If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.
Announcement
Collapse
No announcement yet.
EXT4 Data Corruption Bug Hits Stable Linux Kernels
Stable releases are for people doing serious work who need their systems to function without chasing down kernel bugs all the time. I spent too much time on this as it is (though my employer has a direct interest in the stability of the Linux kernel, so nobody was too unhappy).
I must say I'm very happy with responsiveness here: I first saw fs corruption on Monday, reported it on Tuesday after figuring out that it was definitely 3.6.3 at fault and thus not an already-fixed bug in an old stable kernel, and had a candidate patch from Ted within a few hours, even though I'd dropped this on him without warning and with so little info that he had to dig through every ext*-affecting patch between 3.6.1--3.6.3. I'm sure I couldn't respond to a bug described that vaguely anywhere near that fast. As ever, Ted provides the rest of us with something to aspire to!
Why do *I* have to "feel free" when RedHat is paying its testers perfectly good money?
If you think you can read and understand in every detail hundred thousand lines of code you can safely replace Linus.
Now you say one has to be a gnarly kernel hacker to help! Which is it?
Software has bugs. It is simply impossible to dodge them all. Just think about the notorious random number generator in debian some stable release ago....
Yes indeed I think all the time about the lack of even the most rudimentary sorts of regression testing.
Just thank you the openness of linux, will hit only a very small fraction of linux users and most likely geeks and contributors
Oh you mean only the people who are using the latest releases of the two most mainstream distributions?
[hamish@Griffindor ~]$ yum info kernel
Loaded plugins: langpacks, presto, refresh-packagekit
Available Packages
Name : kernel
Arch : i686
Version : 3.6.2
Release : 1.fc16
Size : 26 M
Repo : updates
Summary : The Linux kernel
URL : http://www.kernel.org/
License : GPLv2
Description : The kernel package contains the Linux kernel (vmlinuz), the core
: of any Linux operating system. The kernel handles the basic
: functions of the operating system: memory allocation, process
: allocation, device input and output, etc.
And how do you test for errors you can't reproduce?
Quite. My latest tests suggest that you have to reboot *while a umount is in progress* for this to go wrong -- and that this affects Linux 3.6.1 and quite possibly many earlier versions (untested as yet), though the dangerous race window is much narrower in kernels before 3.6.2 or 3.6.3 and you pretty much have to do the umount and then the reboot -f as the very next command to make it go wrong. It is not plausible that anyone would have thought of testing *that* before I ran into it. But my home server is a test platform that does just that!
This is, to be honest, a somewhat insane thing to do, even though I need to do it in order to reboot reliably due to nested NFS and non-NFS mounts, not all of which may be reachable at umount time. I'm not entirely convinced this is even a bug, though I hope it's a bug because I'm sick of seeing my filesystems corrupted!
It certainly explains why, myself apart, only people using ext4 on removable devices have seen it so far (though anyone making heavy use of umount -l in any context would probably see it soon enough).
Note: This blog post outlines upcoming changes to Google Currents for Workspace users. For information on the previous deprecation of Googl...
I will note that before I send any pull request to Linus, I have run a very extensive set of file system regression tests, using the standard xfstests suite of tests (originally developed by SGI to test xfs, and now used by all of the major file system authors). So for example, my development laptop, which I am currently using to post this note, is currently running v3.6.3 with the ext4 patches which I have pushed to Linus for the 3.7 kernel. Why am I willing to do this? Specifically because I've run a very large set of automated regression tests on a very regular basis, and certainly before pushing the latest set of patches to Linus. So for all of the kvetching about people not willing to run bleeding edge kernels, please remember that while it is no guarantee of 100% perfection, I and many other kernel developers *are* willing to eat our own dogfood.
Is there more testing that we could do? Yes, as a result of this fire drill, I will probably add some systematic power fail testing before I send a pull request to Linus. But please rest assured that we are already doing a lot of QA work as a regular part of the ext4 development process already.
Stable releases are for people doing serious work who need their systems to function without chasing down kernel bugs all the time. I spent too much time on this as it is (though my employer has a direct interest in the stability of the Linux kernel, so nobody was too unhappy).
I must say I'm very happy with responsiveness here: I first saw fs corruption on Monday, reported it on Tuesday after figuring out that it was definitely 3.6.3 at fault and thus not an already-fixed bug in an old stable kernel, and had a candidate patch from Ted within a few hours, even though I'd dropped this on him without warning and with so little info that he had to dig through every ext*-affecting patch between 3.6.1--3.6.3. I'm sure I couldn't respond to a bug described that vaguely anywhere near that fast. As ever, Ted provides the rest of us with something to aspire to!
be happy that people like me use the latest beta/alpha stuff to trash there system because I also report bugs if i find one
Comment