It isn't too hard to hit this on Fedora 17. Install clean system, do upgrades, reboot, you are running kernel 3.6.2 now, then install drivers, reboot, install anything and filesystem goes read-only, reboot, system won't boot, you need to fsck.
Announcement
Collapse
No announcement yet.
EXT4 Data Corruption Bug Hits Stable Linux Kernels
Collapse
X
-
Originally posted by henrymiller View PostIt isn't too hard to hit this on Fedora 17. Install clean system, do upgrades, reboot, you are running kernel 3.6.2 now, then install drivers, reboot, install anything and filesystem goes read-only, reboot, system won't boot, you need to fsck.
There are many potential causes of file system corruption, and sometimes they can have very similar symptoms. This is why I hate try to debug such problems on web forums or on Ubuntu's Launchpad. Someone sees a relatively vague description of the problem on the web forum, immediately jump to the conclusion that it must be the same thing that they saw, and pile on the web thread. Always send us detailed information; if it's redundant with what someone else has sent, that's actually *good*. That way we can confirm a common pattern. But please don't assume a pattern where one doesn't exist; humans tend to do this even when it's not justified.
Comment
-
Quality Assurance
Quality Assurance (Q.A.) should have found that problem before the merge with mainline... Oh wait a minute, this is that Cathedral and Bazaar methodology being shoved down the throats of everyone.
Seriously speaking, maybe a double boot should be added like a marksman's double tap to your testing repertoire.
Maybe consider freezing all changes to Ext4 and make a new Ext5 for new ideas.
Assume nobody runs on Battery Backup.
Assume people reboot because the power kepts going out: "On then Off then On then off and a couple more times is common."
Always plan for the worst case.
Nevermind...
Comment
-
Originally posted by tytso View PostI have a Google+ post where I've posted my latest updates:
Note: This blog post outlines upcoming changes to Google Currents for Workspace users. For information on the previous deprecation of Googl...
I will note that before I send any pull request to Linus, I have run a very extensive set of file system regression tests, using the standard xfstests suite of tests (originally developed by SGI to test xfs, and now used by all of the major file system authors). So for example, my development laptop, which I am currently using to post this note, is currently running v3.6.3 with the ext4 patches which I have pushed to Linus for the 3.7 kernel. Why am I willing to do this? Specifically because I've run a very large set of automated regression tests on a very regular basis, and certainly before pushing the latest set of patches to Linus. So for all of the kvetching about people not willing to run bleeding edge kernels, please remember that while it is no guarantee of 100% perfection, I and many other kernel developers *are* willing to eat our own dogfood.
Is there more testing that we could do? Yes, as a result of this fire drill, I will probably add some systematic power fail testing before I send a pull request to Linus. But please rest assured that we are already doing a lot of QA work as a regular part of the ext4 development process already.
On the topic of systematic power fail testing, I suggest using kexec. It makes such testing easy.
By the way, what is your opinion of data integrity in ZFS? I have never seen you comment on it.Last edited by ryao; 24 October 2012, 11:38 PM.
Comment
-
Originally posted by squirrl View PostQuality Assurance (Q.A.) should have found that problem before the merge with mainline... Oh wait a minute, this is that Cathedral and Bazaar methodology being shoved down the throats of everyone.
Seriously speaking, maybe a double boot should be added like a marksman's double tap to your testing repertoire.
In fact we regularly mount and unmount the file system many times during the regression tests which I run all the time (which would have been equivalent to a double boot).
Assume nobody runs on Battery Backup.
Assume people reboot because the power kepts going out: "On then Off then On then off and a couple more times is common."
Always plan for the worst case.
Maybe consider freezing all changes to Ext4 and make a new Ext5 for new ideas.
And there are some real costs with forking the code base and creating a potential new "ext5". We already have had problems where bugs are fixed in ext4, and they aren't propagated to ext3. Just today I found a minor bug which was fixed in ext3, but not in ext2. And sometimes bugs are fixed in extN, but someone forgets to forward port the bug fix to extN+1. If we were to add an "ext5" it would make this problem much worse, since it would triple the potential ways that bug fixes might fail to get propagated to where they are needed.
Speaking of bug fixes, you can't freeze all changes, because we are still finding bugs. Heck, as part of deploying ext4 is deployed on thousands and thousands of thousands of machines in Google data centers, we found a bug that only showed up because we had deployed ext4 in great numbers. When we found the bug fix, I checked and found that the exact same bug existed in ext3, where it had not been found despite ten years of testing in enterprise linux releases, by companies such as IBM and Red Hat. (It had probably triggered a couple of times, but it was so rare that testers probably chalked it up to random hardware failure or cosmic rays; it was only because I was correlating failure reports --- and most were caused by hardware failures, not by software bugs --- across a very large number of machines that I could discern the pattern and find this particular bug.)
The problem is that sometimes bug fixes introduce other bugs. In this particular case, it was a bug fix which as backported to a stable kernel which apparently made this failure mode happen more often. If you really mean "freeze all changes", as opposed to just being full of snark, then that would also mean not taking any bug fixes. And if you want to stay on an older version of Linux, feel free..... that's what people who are using RHEL 5, or RHEL 4, or even in some cases RHAS 2.1 have chosen.
Comment
-
hasn't ZFS proven to be a superior FS to ext4???
so why aren't people using ZFS or even it's copy btrfs or whatever it's called?
"[hamish@Griffindor ~]$ uname -r
3.4.11-1.fc16.i686.PAE
[hamish@Griffindor ~]$ "
you named your computer griffindor??? gtfo of linux
Comment
-
Originally posted by Pallidus View Postwhy aren't people using ZFS or even it's copy btrfs or whatever it's called?
Originally posted by Pallidus View Postyou named your computer griffindor??? gtfo of linux
you gtfo, troll.
Comment
-
Butter and ZitFace are still in heavy development, right?
Originally posted by Pallidus View Posthasn't ZFS proven to be a superior FS to ext4???
so why aren't people using ZFS or even it's copy btrfs or whatever it's called?
"[hamish@Griffindor ~]$ uname -r
3.4.11-1.fc16.i686.PAE
[hamish@Griffindor ~]$ "
you named your computer griffindor??? gtfo of linux
Comment
-
Originally posted by Pallidus View Posthasn't ZFS proven to be a superior FS to ext4???
so why aren't people using ZFS or even it's copy btrfs or whatever it's called?
"[hamish@Griffindor ~]$ uname -r
3.4.11-1.fc16.i686.PAE
[hamish@Griffindor ~]$ "
you named your computer griffindor??? gtfo of linux
To elaborate on that, a great deal of the heavy lifting in the port of the kernel code is done by professional developers at LLNL so that ZFS could be used as a backend for the Lustre file system on their supercomputer. However, the effort to integrate ZFS into Linux distributions is entirely being done by volunteers and it goes beyond simple packaging. I imagine more distributions will integrate ZFS support after the community paves the way.
ZFSOnLinux upstream currently does not support the full range of architectures that are supported in Gentoo and the boot loader options are currently limited to GRUB 2, which is currently limited to single drives and mirror configurations. There are also other issues, such as the need for an initramfs to use ZFS as your rootfs and lack of Anaconda integration, which is important to Sabayon Linux. These are things that volunteer developers get to fix.
That does not include a great deal of upstream collaboration that we (or more specifically, I) have already done in terms of improvements to the kernel code. I have played a role in identifying causes of deadlocks that prevented swap on ZFS from working, developing preemption support, developing support for newer kernels and getting 32-bit support into a usable state, among other things. This was and continues to be on a volunteer basis.
There are plenty of people using what we have already accomplished, but it is a work in progress. You are more than welcome to try it out.
Alternatively, Sabayon Linux has ZFS support, an installer and binary packages, so if you do not like following my instructions, you can just install that. ZFS rootfs support in the installer does not exist yet though.Last edited by ryao; 25 October 2012, 10:34 AM.
Comment
-
In the case of the machines name, it is one of the few machines that still follows an old naming convention that we adopted well over a decade ago, back before Harry Potter was quite so common and I was a hell of a lot younger.
Also note that the word itself is misspelled, inheriting a mistake from well over a decade ago as well.
More recently, we named our router box Tuurngait based on Penumbra. Does that up my nerd cred any?
Sorry, serious discussions going on. Carry on fixing things.Last edited by Hamish Wilson; 25 October 2012, 12:02 PM.
Comment
Comment