EXT4 Data Corruption Bug Hits Stable Linux Kernels

tytso replied

24 October 2012, 09:34 PM
I have a Google+ post where I've posted my latest updates:

New community features for Google Chat and an update on Currents

https://plus.google.com/117091380454742934025/posts/Wcc5tMiCgq7

Note: This blog post outlines upcoming changes to Google Currents for Workspace users. For information on the previous deprecation of Googl...

I will note that before I send any pull request to Linus, I have run a very extensive set of file system regression tests, using the standard xfstests suite of tests (originally developed by SGI to test xfs, and now used by all of the major file system authors). So for example, my development laptop, which I am currently using to post this note, is currently running v3.6.3 with the ext4 patches which I have pushed to Linus for the 3.7 kernel. Why am I willing to do this? Specifically because I've run a very large set of automated regression tests on a very regular basis, and certainly before pushing the latest set of patches to Linus. So for all of the kvetching about people not willing to run bleeding edge kernels, please remember that while it is no guarantee of 100% perfection, I and many other kernel developers *are* willing to eat our own dogfood.

Is there more testing that we could do? Yes, as a result of this fire drill, I will probably add some systematic power fail testing before I send a pull request to Linus. But please rest assured that we are already doing a lot of QA work as a regular part of the ext4 development process already.
Leave a comment:
NullNix replied

24 October 2012, 07:39 PM
Originally posted by PuckPoltergeist View Post

And how do you test for errors you can't reproduce?

Quite. My latest tests suggest that you have to reboot *while a umount is in progress* for this to go wrong -- and that this affects Linux 3.6.1 and quite possibly many earlier versions (untested as yet), though the dangerous race window is much narrower in kernels before 3.6.2 or 3.6.3 and you pretty much have to do the umount and then the reboot -f as the very next command to make it go wrong. It is not plausible that anyone would have thought of testing *that* before I ran into it. But my home server is a test platform that does just that!

This is, to be honest, a somewhat insane thing to do, even though I need to do it in order to reboot reliably due to nested NFS and non-NFS mounts, not all of which may be reachable at umount time. I'm not entirely convinced this is even a bug, though I hope it's a bug because I'm sick of seeing my filesystems corrupted!

It certainly explains why, myself apart, only people using ext4 on removable devices have seen it so far (though anyone making heavy use of umount -l in any context would probably see it soon enough).
Leave a comment:
Hamish Wilson replied

24 October 2012, 06:47 PM
Now I am confused...

[hamish@Griffindor ~]$ yum info kernel
Loaded plugins: langpacks, presto, refresh-packagekit
Available Packages
Name : kernel
Arch : i686
Version : 3.6.2
Release : 1.fc16
Size : 26 M
Repo : updates
Summary : The Linux kernel
URL : http://www.kernel.org/
License : GPLv2
Description : The kernel package contains the Linux kernel (vmlinuz), the core
: of any Linux operating system. The kernel handles the basic
: functions of the operating system: memory allocation, process
: allocation, device input and output, etc.

[hamish@Griffindor ~]$

[root@Griffindor ~]# yum update
Loaded plugins: langpacks, presto, refresh-packagekit
fedora-awesome | 2.8 kB 00:00
fedora-chromium-stable | 3.4 kB 00:00
rpmfusion-free-updates | 3.3 kB 00:00
rpmfusion-nonfree-updates | 3.3 kB 00:00
updates/metalink | 16 kB 00:00
No Packages marked for Update
[root@Griffindor ~]#

[hamish@Griffindor ~]$ uname -r
3.4.11-1.fc16.i686.PAE
[hamish@Griffindor ~]$

I guess this is good, but...
Leave a comment:
PuckPoltergeist replied

24 October 2012, 05:41 PM
Originally posted by tehehe View Post

That's why kernel shoud have automatic tests. Code review is important but it's not a substitute to a good test coverage.

And how do you test for errors you can't reproduce?
Leave a comment:
darkbasic replied

24 October 2012, 05:27 PM
Again, you can't spot such a bug with automatic tests.
Leave a comment:
tehehe replied

24 October 2012, 04:39 PM
That's why kernel shoud have automatic tests. Code review is important but it's not a substitute to a good test coverage.
Leave a comment:
frantaylor replied

24 October 2012, 04:06 PM
Originally posted by enrico.tagliavini View Post

Feel free to help.

Why do *I* have to "feel free" when RedHat is paying its testers perfectly good money?

If you think you can read and understand in every detail hundred thousand lines of code you can safely replace Linus.

Now you say one has to be a gnarly kernel hacker to help! Which is it?

Software has bugs. It is simply impossible to dodge them all. Just think about the notorious random number generator in debian some stable release ago....

Yes indeed I think all the time about the lack of even the most rudimentary sorts of regression testing.

Just thank you the openness of linux, will hit only a very small fraction of linux users and most likely geeks and contributors

Oh you mean only the people who are using the latest releases of the two most mainstream distributions?
Leave a comment:
NullNix replied

24 October 2012, 04:00 PM
Originally posted by necro-lover View Post

stable releases are for pussy?s.

Stable releases are for people doing serious work who need their systems to function without chasing down kernel bugs all the time. I spent too much time on this as it is (though my employer has a direct interest in the stability of the Linux kernel, so nobody was too unhappy).

I must say I'm very happy with responsiveness here: I first saw fs corruption on Monday, reported it on Tuesday after figuring out that it was definitely 3.6.3 at fault and thus not an already-fixed bug in an old stable kernel, and had a candidate patch from Ted within a few hours, even though I'd dropped this on him without warning and with so little info that he had to dig through every ext*-affecting patch between 3.6.1--3.6.3. I'm sure I couldn't respond to a bug described that vaguely anywhere near that fast. As ever, Ted provides the rest of us with something to aspire to!
Leave a comment:
necro-lover replied

24 October 2012, 02:48 PM
Originally posted by Pallidus View Post

LOL is on you because you could be running stable 3.6.1 and be unnafected as well.

PROTIP wait for the kernels to mature, even the stable ones, for at least 15 days before upgrading them PROTIP

"Still, the commit in question *does* change things, and so it's still the most likely culprit."

name and shame plox

I only use RC/Beta/Alpha kernels for years and most of the time Alpfa/Beta/Git Readeon driver as well.

stable releases are for pussy?s.
Leave a comment:
mazumoto replied

24 October 2012, 02:22 PM
awww, on stable ? that's evil. last time i lost data on ext4 was on rc-kernels at least ...
glad I'm on btrfs now for all my fs's, although that'll probably blow up any second now :-P
Leave a comment:

Announcement

EXT4 Data Corruption Bug Hits Stable Linux Kernels

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: