Announcement

Collapse
No announcement yet.

Some Users Have Been Hitting EXT4 File-System Corruption On Linux 4.19

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #71
    Originally posted by birdie View Post
    Michael

    The bug has seemingly been identified and a fix has been submitted to mainline. 4.19/4.20 are currently still affected.

    Check the bug report discussion for more details.

    People who run the affected systems should either apply the patch immediately or downgrade to kernel 4.18. Running e2fsck/your-fs.fsck is pretty much mandatory.
    It has not been submitted to mainline yet. By upstream, it's just in Jens' "block" upstream branch. It hasn't yet been sent in or pulled to Linus Torvalds mainline branch. I've been monitoring it closely and will have out a Phoronix article when it's actually in Linux Git and/or back-ported.
    Michael Larabel
    https://www.michaellarabel.com/

    Comment


    • #72
      Originally posted by [email protected] View Post
      I was laughing at all the Windows 10 updates shenanigans since October, but looks like we have our own problems too.

      I'm glad I stayed with kernel 4.15 (Kubuntu 18.04), after the first benchmark Michael did showing only marginal improvements in most games since the beginning of the year.
      Well, there is a reason why most distros won't just grab the lastest mainline kernel immediately. Linux development is a kind of unusual case, since the kernel is developed separately from the rest of the OS. "Released" seems to more like mean that the kernel is released to distro developers for integration testing...

      Comment


      • #73
        Hi guys, and sorry for my english.

        I see that a lot of people can't reproduce the error, and MOST people says they don't see any problem at all.

        But I do can reproduce the error.

        I even thought I was having memory/disk errors corrupting things, but no, I found all those bug reports so the problem was not only with me /o/.

        I can format a disk and reproduce the error once again in less than 5 minutes.

        The problem is that I can't send it in a bug report, because It depends on a HUGE amount of data I have.

        I've been collecting financial data, which I'm paying to much to receive, from the Barchart.com service...
        I collected 700GB of data during all this year, downloaded it, and while using a Python script to "convert" it to another format for my own usage, I started to have this problem - imagine my face when I saw 1 year of hard work and big money going down the hole because of a filesystem corruption!

        How I can reproduce it on my machine:

        The Python script loads a huge file into memory:
        myFile = cbor.loads( zstd.decompress( open('FILE', 'rb').read() ) ) <------ this loads a 8GB file into memory,

        ... process the data...

        ... the process consumes about 25GB of RAM, leaving 7GB free

        ... I start to creating many directories
        os.mkdir('output/' + someName)

        ... on each directory created, I save thousands of files. remember I said I left about 7GB free? each file generated consumes GBs so when Python does a write(), the memory usage is on 100% (~90% used memory + ~10% fs cache)

        ... now I open another file, and do the same work again.

        Now, I believe that maybe, the problem is not on EXT4FS code itself, but in the VM code or something else related, because the problem only appears in this cenario -
        high CPU/RAM usage AND too much filesystem operations (mkdir(), open(O_CREAT), close()) AND too many mallocs() and free() (the Python script does HOURS of work on dictionaries and lists, as the data is saved in a format like JSON). This way, I think that at this point the memory is very fragmented.

        So, this is really a huge and severe bug, almost no one else hits the bug and those who hit it has no idea about what just happened, but I have 19 VMs on cheap cloud servers running the same distribution and kernel I've compiled at home, and I have no problems with any of those machines. And NOTE: I use linux-next (latest GIT, unreleased-yet-kernel, not even reviewd by Linux Torvalds!!!).

        I want to help they to fix this issue, I'm going to recompile the 4.18 kernel, and also try other filesystems and see if I hit this error.

        Can anyone here give me a light of what else I could do to include in the bug report? It's not a fault like a segmentation fault, kernel panic or core dump so all I have to include is my (very minimalistic and customized!) .config


        And those who didn't hit this problem yet: thank God, and start using another HD right now and leave the files you already have untouched, because eventually you will do a "ls -l /my/precious" and you will find out that the directory can't be even listed =]


        PS.: I do miss when Linus Torvalds was so rude and we didn't have those kinds of bugs... actually I only tried to report two things in the Kernel before, and... and that's why I'm writing this here instead of the LKML because I developed a phobia of their responses to newbies like me lol

        Comment


        • #74
          Originally posted by birdie View Post
          People who run the affected systems should either apply the patch immediately or downgrade to kernel 4.18. Running e2fsck/your-fs.fsck is pretty much mandatory.
          That's probably a good Idea even for those without problems, but I do think my ext4.fsck didn't report any errors when I runned it.

          So maybe the kernel is reporting an error while the filesystem is actually ok?

          That would mean anyone having this problem may still have their FS intact...

          So, don't give up up and format the disk like I did, just reboot it with an older kernel before it's too late...

          Comment

          Working...
          X