Watch Out For Possible Data Loss On Early Linux 5.12 Kernels
( TLDR: Will report upstream when having more information in a brief manner with shorter steps to reproduce than first needing to run a number of benchmarks, but for now just mentioning this for those that may otherwise be eager in trying out Linux 5.12 Git on systems where you would rather not face possible data loss... Still working on obtaining more data but at least with the testing over the past few days, I am confident now in saying there definitely seems to be a nasty issue in 5.12 Git at the moment that may leave you with some headaches. )
With nearing the end of the Linux 5.12 merge window, I've begun firing up Linux 5.12 Git benchmarks using the latest Ubuntu Mainline Kernel PPAs (for easy reproducibility) in as usual looking for any performance changes across an assortment of hardware. But with Linux 5.12 on all my attempts thus far I have been left with file-system corruption.
Earlier in the week I woke up to one of the test systems rebooted and stuck at GRUB, unable to boot the system. Booting a live USB and running e2fsck on that system yielded a ton of errors. After that process and mounting the partition, none of the data was there.
Trying again on that Ryzen 9 5900X + RX 5600 XT system with a different NVMe drive and again running Ubuntu 20.10 + EXT4 with the latest Linux 5.12 daily build of the day, sure enough half-way through the testing the system became inoperable with the "structure needs cleaning" errors. Trying again with the lengthy e2fsck -y run, none of the data appeared recoverable after mounting the fixed file-system.
On a third try and repeating the same test being run, it happened again. This time at least the file-system shifted over to read-only mode so at least was able to see the issue being triggered under memory pressure. The kernel log pointed to AMDGPU driver errors over not enough memory for command submission and that cascaded into other out-of-memory errors and ultimately EXT4 file-system errors...
With the system being up at least, indeed was able to see this was being triggered when running the ParaView workstation visualization software. So after freshly re-installing Ubuntu, moving to the latest daily kernel, and trying again, I first started with running the ParaView benchmark straight away - this time though it passed and didn't exhibit any problems.
I then re-started the benchmarks over again from the original result file to maintain the same benchmarks and ordering as when originally hitting the problem but to which ran fine on Linux 5.11, and indeed - when hitting the ParaView test case the system dropped out again... In none of the cases was e2fsck able to correct the file-system and allow for remounting of the data.
That's all I know at the moment regarding this on Linux 5.11. I've been able to effectively reproduce it now on multiple systems with varying daily kernel builds from this week, simply using a clean install and then running phoronix-test-suite benchmark 2102250-HA-5900XLINU47 by the time of hitting ParaView the system is left in a very bad state.
Linux 5.11 runs fine, but Linux 5.12 with these tests is leaving the tested hardware so far in bad shape.
The EXT4 changes were quite small this time around and only sent out yesterday, so would seem to be something happening outside of EXT4, just that most of my benchmark systems are using that file-system.
If you have a spare (seemingly GPU-enabled, given ParaView seems to be the ultimate trigger at least for the combination of tests I've been running) system and don't mind possible data loss but want to help trying to reproduce, on Ubuntu try the latest daily kernel builds and then simply run the phoronix-test-suite benchmark 2102250-HA-5900XLINU47 and see if your system can survive. So far none of mine have been able to with the Git kernel builds while running fine on Linux 5.11 and prior.