The Linux 4.0 EXT4 RAID Corruption Bug Has Been Uncovered
In the original article it was mentioned it looked to be an EXT4 RAID issue and that indeed turned out to be the case. The issue was caused by an MD commit late into the Linux 4.0 kernel cycle, a.k.a. commit 47d68979cc968535cb87f3e5f2e6a3533ea48fbd that was md/raid0: fix bug with chunksize not a power of 2.. The commit by SUSE's Neil Brown explained, "Since commit 20d0189b1012a37d2533a87fb451f7852f2418d1 in v3.14-rc1 RAID0 has performed incorrect calculations when the chunksize is not a power of 2. This happens because sector_div() modifies its first argument, but this wasn't taken into account in the patch. So restore that first arg before re-using the variable."
It turns out that this "fix" to an issue present since Linux 3.14-rc1 is what's causing the EXT4 RAID corruption problems on Linux 4.0.x. Eric Work has devised a small fix to address the corruption problem, but for now it's only present within the MD Git tree. Neil Brown commented, "The patch was only added to my tree today. I will send to Linus tomorrow so it should appear in the next -rc. Any -stable kernel released since mid-April probably has the bug. Once the fix gets into Linus' tree, it should get into subsequent -stable releases."
Thus for now all EXT4 RAID0 users on the Linux 4.0.x kernel or current Linux 4.1 Git code are advised to downgrade until the next 4.1 release candidate or 4.0.x stable release otherwise you stand good chances of hosing your file-system. It also looks like if dropping the discard mount option you will also avoid being hit by this serious issue. This isn't a problem for Linux users on distributions like RHEL, Ubuntu, and other fixed-release distributions that don't tend to update major versions of their kernel post-release, but this corruption issue has already become a problem for Arch Linux and other rolling-release distributions with users who quickly jump to new versions of upstream software.