Announcement

Collapse
No announcement yet.

EXT4 Data Corruption Bug Hits Stable Linux Kernels

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    It isn't too hard to hit this on Fedora 17. Install clean system, do upgrades, reboot, you are running kernel 3.6.2 now, then install drivers, reboot, install anything and filesystem goes read-only, reboot, system won't boot, you need to fsck.

    Comment


    • #42
      Originally posted by henrymiller View Post
      It isn't too hard to hit this on Fedora 17. Install clean system, do upgrades, reboot, you are running kernel 3.6.2 now, then install drivers, reboot, install anything and filesystem goes read-only, reboot, system won't boot, you need to fsck.
      Please send a detailed report to the linux-ext4 mailing list. Include the kernel version you are running, the kernel logs (from dmesg) that were printed or which you found in the logs, the output from e2fsck (which are hopefully saved in /var/log/fsck), and whether you can reliability reproduce it or how frequently it happens.

      There are many potential causes of file system corruption, and sometimes they can have very similar symptoms. This is why I hate try to debug such problems on web forums or on Ubuntu's Launchpad. Someone sees a relatively vague description of the problem on the web forum, immediately jump to the conclusion that it must be the same thing that they saw, and pile on the web thread. Always send us detailed information; if it's redundant with what someone else has sent, that's actually *good*. That way we can confirm a common pattern. But please don't assume a pattern where one doesn't exist; humans tend to do this even when it's not justified.

      Comment


      • #43
        Quality Assurance

        Quality Assurance (Q.A.) should have found that problem before the merge with mainline... Oh wait a minute, this is that Cathedral and Bazaar methodology being shoved down the throats of everyone.

        Seriously speaking, maybe a double boot should be added like a marksman's double tap to your testing repertoire.

        Maybe consider freezing all changes to Ext4 and make a new Ext5 for new ideas.

        Assume nobody runs on Battery Backup.

        Assume people reboot because the power kepts going out: "On then Off then On then off and a couple more times is common."

        Always plan for the worst case.

        Nevermind...

        Comment


        • #44
          Originally posted by tytso View Post
          I have a Google+ post where I've posted my latest updates:

          Note:  This blog post outlines upcoming changes to Google Currents for Workspace users. For information on the previous deprecation of Googl...


          I will note that before I send any pull request to Linus, I have run a very extensive set of file system regression tests, using the standard xfstests suite of tests (originally developed by SGI to test xfs, and now used by all of the major file system authors). So for example, my development laptop, which I am currently using to post this note, is currently running v3.6.3 with the ext4 patches which I have pushed to Linus for the 3.7 kernel. Why am I willing to do this? Specifically because I've run a very large set of automated regression tests on a very regular basis, and certainly before pushing the latest set of patches to Linus. So for all of the kvetching about people not willing to run bleeding edge kernels, please remember that while it is no guarantee of 100% perfection, I and many other kernel developers *are* willing to eat our own dogfood.

          Is there more testing that we could do? Yes, as a result of this fire drill, I will probably add some systematic power fail testing before I send a pull request to Linus. But please rest assured that we are already doing a lot of QA work as a regular part of the ext4 development process already.
          Did anyone seriously think that you were not doing this? Anyway, even if your patches were perfect, someone else could make an "improvement" elsewhere in the kernel that will introduce a bug or trigger a latent race condition. If patches sent to Linus were bug free in all situations, we would not need kernel releases.

          On the topic of systematic power fail testing, I suggest using kexec. It makes such testing easy.

          By the way, what is your opinion of data integrity in ZFS? I have never seen you comment on it.
          Last edited by ryao; 24 October 2012, 11:38 PM.

          Comment


          • #45
            Originally posted by squirrl View Post
            Quality Assurance (Q.A.) should have found that problem before the merge with mainline... Oh wait a minute, this is that Cathedral and Bazaar methodology being shoved down the throats of everyone.
            Nothing is shoved down anybody's throat. You don't have to use the latest bleeding edge kernel. Feel free to use Debian Stable, or RHEL 6 if you like. Heck, use Windows 8 if it floats your boat.

            Seriously speaking, maybe a double boot should be added like a marksman's double tap to your testing repertoire.
            Double boot was a theory; it turned out not to be true (remember, when we do our bug hunting on an open mailing list, sometimes early theories are exposed which turns out not to be true; it's unfortunate when people think it's worthwhile to put those theories in news stories just to drive web hits, for advertising $$$, but that's just the way things roll; don't believe everything you see on the web).

            In fact we regularly mount and unmount the file system many times during the regression tests which I run all the time (which would have been equivalent to a double boot).

            Assume nobody runs on Battery Backup.
            Actually using the nobarrier mount options (which can only be used if you are using hardware raid or an enterprise storage array) is actually pretty rare outside of people who should be using enterprise Linux distributions; and the enterprise Linux distributions do a lot of testiing.

            Assume people reboot because the power kepts going out: "On then Off then On then off and a couple more times is common."
            I dogfood changes before they get pushed out, and I certainly do things like forced poweroffs. (Usually due to X server or suspend/resume bugs, oh, well.) Yes, we should probably add systematic powerfail testing to me regular regression testing that gets done during the development cycle. There's always room for improvement.

            Always plan for the worst case.
            Absolutely. We certainly design with this in mind. There is an awful lot of paranoia built into ext4. In fact, the ext4 error messages that got tripped was specifically because of the paranoid double checks that are built into ext4. You can even add the mount option "block_validity" which will increase the checks, at the cost of adding more CPU overhead --- I run with this mount option when I run my regression tests, for example.

            Maybe consider freezing all changes to Ext4 and make a new Ext5 for new ideas.
            We have considered this. Right now new features get added under experimental feature flags or mount options. One of the users who ran into problems were using experimental new features that are not enabled by default. We can't stop users from trying out new features that aren't enabled by default, just as we can't stop them from deciding to use ext5 instead of ext4 on production servers. Things like metadata checksums are not enabled by default specifically because they aren't ready yet. Brave users who try them out are invaluable, and I am grateful to people who help us do our testing, since that's the only way we can shake out the last bugs that aren't found in developer environments or via regression tests. But you make your choices, and take your chances when you turn on such experimental features.

            And there are some real costs with forking the code base and creating a potential new "ext5". We already have had problems where bugs are fixed in ext4, and they aren't propagated to ext3. Just today I found a minor bug which was fixed in ext3, but not in ext2. And sometimes bugs are fixed in extN, but someone forgets to forward port the bug fix to extN+1. If we were to add an "ext5" it would make this problem much worse, since it would triple the potential ways that bug fixes might fail to get propagated to where they are needed.

            Speaking of bug fixes, you can't freeze all changes, because we are still finding bugs. Heck, as part of deploying ext4 is deployed on thousands and thousands of thousands of machines in Google data centers, we found a bug that only showed up because we had deployed ext4 in great numbers. When we found the bug fix, I checked and found that the exact same bug existed in ext3, where it had not been found despite ten years of testing in enterprise linux releases, by companies such as IBM and Red Hat. (It had probably triggered a couple of times, but it was so rare that testers probably chalked it up to random hardware failure or cosmic rays; it was only because I was correlating failure reports --- and most were caused by hardware failures, not by software bugs --- across a very large number of machines that I could discern the pattern and find this particular bug.)

            The problem is that sometimes bug fixes introduce other bugs. In this particular case, it was a bug fix which as backported to a stable kernel which apparently made this failure mode happen more often. If you really mean "freeze all changes", as opposed to just being full of snark, then that would also mean not taking any bug fixes. And if you want to stay on an older version of Linux, feel free..... that's what people who are using RHEL 5, or RHEL 4, or even in some cases RHAS 2.1 have chosen.

            Comment


            • #46
              hasn't ZFS proven to be a superior FS to ext4???

              so why aren't people using ZFS or even it's copy btrfs or whatever it's called?


              "[hamish@Griffindor ~]$ uname -r
              3.4.11-1.fc16.i686.PAE
              [hamish@Griffindor ~]$ "

              you named your computer griffindor??? gtfo of linux

              Comment


              • #47
                Originally posted by Pallidus View Post
                why aren't people using ZFS or even it's copy btrfs or whatever it's called?
                aren't people?

                Originally posted by Pallidus View Post
                you named your computer griffindor??? gtfo of linux
                not everybody follows your naming scheme of w00tw00tOMGJustlookatmyHWdrool

                you gtfo, troll.

                Comment


                • #48
                  Butter and ZitFace are still in heavy development, right?



                  Originally posted by Pallidus View Post
                  hasn't ZFS proven to be a superior FS to ext4???

                  so why aren't people using ZFS or even it's copy btrfs or whatever it's called?


                  "[hamish@Griffindor ~]$ uname -r
                  3.4.11-1.fc16.i686.PAE
                  [hamish@Griffindor ~]$ "

                  you named your computer griffindor??? gtfo of linux

                  Comment


                  • #49
                    Originally posted by Pallidus View Post
                    hasn't ZFS proven to be a superior FS to ext4???

                    so why aren't people using ZFS or even it's copy btrfs or whatever it's called?


                    "[hamish@Griffindor ~]$ uname -r
                    3.4.11-1.fc16.i686.PAE
                    [hamish@Griffindor ~]$ "

                    you named your computer griffindor??? gtfo of linux
                    ZFS Linux integration at the distribution level is in development in Gentoo and its derivatives. I am the primary Gentoo developer working on it. Other distributions have no such plans.

                    To elaborate on that, a great deal of the heavy lifting in the port of the kernel code is done by professional developers at LLNL so that ZFS could be used as a backend for the Lustre file system on their supercomputer. However, the effort to integrate ZFS into Linux distributions is entirely being done by volunteers and it goes beyond simple packaging. I imagine more distributions will integrate ZFS support after the community paves the way.

                    ZFSOnLinux upstream currently does not support the full range of architectures that are supported in Gentoo and the boot loader options are currently limited to GRUB 2, which is currently limited to single drives and mirror configurations. There are also other issues, such as the need for an initramfs to use ZFS as your rootfs and lack of Anaconda integration, which is important to Sabayon Linux. These are things that volunteer developers get to fix.

                    That does not include a great deal of upstream collaboration that we (or more specifically, I) have already done in terms of improvements to the kernel code. I have played a role in identifying causes of deadlocks that prevented swap on ZFS from working, developing preemption support, developing support for newer kernels and getting 32-bit support into a usable state, among other things. This was and continues to be on a volunteer basis.

                    There are plenty of people using what we have already accomplished, but it is a work in progress. You are more than welcome to try it out.



                    Alternatively, Sabayon Linux has ZFS support, an installer and binary packages, so if you do not like following my instructions, you can just install that. ZFS rootfs support in the installer does not exist yet though.
                    Last edited by ryao; 25 October 2012, 10:34 AM.

                    Comment


                    • #50
                      In the case of the machines name, it is one of the few machines that still follows an old naming convention that we adopted well over a decade ago, back before Harry Potter was quite so common and I was a hell of a lot younger.

                      Also note that the word itself is misspelled, inheriting a mistake from well over a decade ago as well.

                      More recently, we named our router box Tuurngait based on Penumbra. Does that up my nerd cred any?

                      Sorry, serious discussions going on. Carry on fixing things.
                      Last edited by Hamish Wilson; 25 October 2012, 12:02 PM.

                      Comment

                      Working...
                      X