EXT4 Data Corruption Bug Hits Stable Linux Kernels

Being discussed recently on the Linux kernel mailing list was an "apparent serious progressive ext4 data corruption bug in 3.6.3." Theodore Ts'o was able to successfully bisect the kernel and found the serious bug, which first appeared within the Linux 3.6.2 kernel and was since back-ported to older stable kernels.
From the user reporting this problem, "The bug did really quite a lot of damage to my /home fs in only a few minutes of uptime, given how few files I wrote to it. What it could have done to a more conventional distro install with everything including /home on one filesystem, I shudder to think."
Ted Ts'o wrote in the thread:
I think I've found the problem. I believe the commit at fault is commit 14b4ed22a6 (upstream commit eeecef0af5e):Here's a response with some more comments from one of the affected users for this EXT4 data corruption bug in the stable kernel.
jbd2: don't write superblock when if its empty
which first appeared in v3.6.2.
The reason why the problem happens rarely is that the effect of the buggy commit is that if the journal's starting block is zero, we fail to truncate the journal when we unmount the file system. This can happen if we mount and then unmount the file system fairly quickly, before the log has a chance to wrap. After the first time this has happened, it's not a disaster, since when we replay the journal, we'll just replay some extra transactions. But if this happens twice, the oldest valid transaction will still not have gotten updated, but some of the newer transactions from the last mount session will have gotten written by the very latest transacitons, and when we then try to do the extra transaction replays, the metadata blocks can end up getting very scrambled indeed.
*Sigh*. My apologies for not catching this when I reviewed this patch. I believe the following patch should fix the bug; once it's reviewed by other ext4 developers, I'll push this to Linus ASAP.
- Ted
To Ted's patch, Red Hat's Eric Sandeen reviewing the work raised some questions as to whether it fully fixed the problem and Ted has already tossed out a new patch.
The Linux bug hunting is still on-going...
The problematic commit causing this potential EXT4 data issue was back-ported previous to the stable Linux 3.4.x and 3.5.x kernels, which are reaching end-of-life. Hopefully an exception will be made and new kernels issued soon.
That's where things are at right now in the mailing list thread for this serious EXT4 data corruption issue that reached the stable Linux kernel.
68 Comments