Linux 4.20 To Receive Fix That Prevented Some AMD Raven Ridge Systems From Booting
It looks like the mainline Linux 4.20 kernel within a few days will be playing nicely on more AMD hardware. In particular, the Raven Ridge Zen+Vega APUs that have been rather troublesome depending upon the BIOS/motherboard since their launch almost one year ago.
Queued in this morning in the x86/urgent branch, which should be sent into Linus Torvalds as part of fixes for the Linux 4.20 kernel soon, is fixing the thresholding machinery initialization order in the AMD-specific x86 MCE code. The commit explains, "Currently, the code sets up the thresholding interrupt vector and only then goes about initializing the thresholding banks. Which is wrong, because an early thresholding interrupt would cause a NULL pointer dereference when accessing those banks and prevent the machine from booting. Therefore, set the thresholding interrupt vector only *after* having initialized the banks successfully."
The bug introducing this issue to the AMD MCE kernel code was merged in 2016. Of bug reports related to this issue, the Ryzen 2500U / Raven Ridge hardware seems to be a common trait. But depending upon the motherboard and particular BIOS version, the issue may or may not present itself. A workaround up until now has been booting the affected kernels with mce=off to avoid this null pointer dereferencing that would halt the boot process.
It's good that this issue is now being cleared up and one of my Raven Ridge systems was running into kernel boot issues with its latest BIOS that I'll now try again once this commit lands into Linux 4.20. Presumably this change will also get backported to current stable/LTS series.
Queued in this morning in the x86/urgent branch, which should be sent into Linus Torvalds as part of fixes for the Linux 4.20 kernel soon, is fixing the thresholding machinery initialization order in the AMD-specific x86 MCE code. The commit explains, "Currently, the code sets up the thresholding interrupt vector and only then goes about initializing the thresholding banks. Which is wrong, because an early thresholding interrupt would cause a NULL pointer dereference when accessing those banks and prevent the machine from booting. Therefore, set the thresholding interrupt vector only *after* having initialized the banks successfully."
The bug introducing this issue to the AMD MCE kernel code was merged in 2016. Of bug reports related to this issue, the Ryzen 2500U / Raven Ridge hardware seems to be a common trait. But depending upon the motherboard and particular BIOS version, the issue may or may not present itself. A workaround up until now has been booting the affected kernels with mce=off to avoid this null pointer dereferencing that would halt the boot process.
It's good that this issue is now being cleared up and one of my Raven Ridge systems was running into kernel boot issues with its latest BIOS that I'll now try again once this commit lands into Linux 4.20. Presumably this change will also get backported to current stable/LTS series.
22 Comments