EXT4 Data Corruption Bug Hits Stable Linux Kernels

Written by Michael Larabel in Linux Storage on 24 October 2012 at 06:06 AM EDT. 68 Comments

As a warning for those who are normally quick to upgrade to the latest stable vanilla kernel releases, a serious EXT4 data corruption bug worked its way into the stable Linux 3.4, 3.5, and 3.6 kernel series.

Being discussed recently on the Linux kernel mailing list was an "apparent serious progressive ext4 data corruption bug in 3.6.3." Theodore Ts'o was able to successfully bisect the kernel and found the serious bug, which first appeared within the Linux 3.6.2 kernel and was since back-ported to older stable kernels.

From the user reporting this problem, "The bug did really quite a lot of damage to my /home fs in only a few minutes of uptime, given how few files I wrote to it. What it could have done to a more conventional distro install with everything including /home on one filesystem, I shudder to think."

Ted Ts'o wrote in the thread:

I think I've found the problem. I believe the commit at fault is commit 14b4ed22a6 (upstream commit eeecef0af5e):

jbd2: don't write superblock when if its empty

which first appeared in v3.6.2.

The reason why the problem happens rarely is that the effect of the buggy commit is that if the journal's starting block is zero, we fail to truncate the journal when we unmount the file system. This can happen if we mount and then unmount the file system fairly quickly, before the log has a chance to wrap. After the first time this has happened, it's not a disaster, since when we replay the journal, we'll just replay some extra transactions. But if this happens twice, the oldest valid transaction will still not have gotten updated, but some of the newer transactions from the last mount session will have gotten written by the very latest transacitons, and when we then try to do the extra transaction replays, the metadata blocks can end up getting very scrambled indeed.

*Sigh*. My apologies for not catching this when I reviewed this patch. I believe the following patch should fix the bug; once it's reviewed by other ext4 developers, I'll push this to Linus ASAP.

- Ted

Here's a response with some more comments from one of the affected users for this EXT4 data corruption bug in the stable kernel.

To Ted's patch, Red Hat's Eric Sandeen reviewing the work raised some questions as to whether it fully fixed the problem and Ted has already tossed out a new patch.

The Linux bug hunting is still on-going...

As far as how you get hit by this EXT4 bug, Ted says, "Well, the problem won't show up if the journal has wrapped. So it will only show up if the system has been rebooted twice in fairly quick succession. A full conventional distro install probably wouldn't have triggered a bug... although someone who habitually reboots their laptop instead of using suspend/resume or hiberbate, or someone who is trying to bisect the kernel looking for some other bug could easily trip over this --- which I guess is how you got hit by it."

The problematic commit causing this potential EXT4 data issue was back-ported previous to the stable Linux 3.4.x and 3.5.x kernels, which are reaching end-of-life. Hopefully an exception will be made and new kernels issued soon.

That's where things are at right now in the mailing list thread for this serious EXT4 data corruption issue that reached the stable Linux kernel.

68 Comments