That Linux 5.12 Severe Data Corruption Bug Hits Intel CI Systems - Issue Caused By Swap File
As reported last week, on my test systems with the Linux 5.12 kernel I have been suffering from significant data corruption during benchmarking. Running e2fsck on the EXT4 file-systems would yield a plethora of errors and ultimately not recoverable. Besides the fact of having to either recover from a backup image or reinstall from scratch each time, making it more complex was seeing this behavior even before EXT4 file-system changes were merged for the 5.12 cycle and they tended to be on the mundane side anyhow -- likely indicating a problem elsewhere in the kernel and not something specific to EXT4, just that many of my test systems are using EXT4.
It's been a slow process sorting bisecting it given that a block-level backup needs to be restored each time or just re-installing Ubuntu from scratch, thus much more time consuming than bisecting a "simple" performance regression of the kernel. Plus with everything else on my plate, it's been a rough week dealing with Linux 5.12. At the same time left wondering why more folks aren't hitting this nasty bug and screaming about it - but today, Intel has joined the chorus.
Phew, others now seeing this issue too... Thanks to impacting Intel and their greater resources, the issue should be buttoned up much quicker.
It turns out Intel's graphics continuous integration systems have been impacted by this system so fortunately that led Intel engineers to looking at the issue. There was a notice sent out today regarding the issue.
Intel's Tomi Sarvela noted, "Hitting the bug corrupts the underlying filesystem very thoroughly, wiping out large amount of data from the beginning of the partition which leaves fsck sad with thousands of items lost. Bisection of the IGT testlist was done with two root filesystems, where testable kernel booted from 2. partition, and copy of the 2. partition was stored on 1. partition and could be restored at will."
The analysis on the Intel side found it to happen during their testing but an important discovery is that it appears to be related to systems with an active Swapfile rather than a swap partition or no swap at all. With the current Linux code, the file-system left trashed is the one containing a swapfile.
Intel's Chris Wilson was able to bisect the issue on their three systems and found three patches to the Linux kernel's memory management code touching the swapfile handling. When reverting the three patches they are no longer seeing this severe file-system corruption appear.
So hopefully after further testing those patches will be upstreamed or a more adequate swapfile fix to those patches carried out. I'll be testing on my end now thanks to the Intel discoveries. As it stands right now, Linux 5.12 Git is still vulnerable to this nasty issue and so for those relying on a swap file would certainly recommend to avoid testing out this development kernel until a fix or revert to those patches have landed.
UPDATE: A fix has landed for Linux 5.12 on 3 March.