AMD Introducing FRU Memory Poison Manager In Linux 6.9

Written by Michael Larabel in AMD on 6 March 2024 at 09:57 AM EST. 14 Comments
AMD
Queued for introduction in the upcoming Linux 6.9 kernel cycle is an FRU Memory Poison Manager "FMPM" developed by AMD that may later be adapted for other non-AMD platforms. The FRU Memory Poison Manager is working to persist information around known bad/faulty memory across reboots.

As noted previously AMD has been working on row retirement support and other changes for dealing with faulty memory particularly for the Instinct MI300 series with HBM3 memory. While the row retirement support allows phasing out use of that DRAM row after a threshold has been reached for errors, ultimately it can become the situation of repeating itself on a clean reboot. With the forthcoming FRU Memory Poison Manager it will allow optionally persisting such information around bad memory across reboots.

For consistently faulty memory the intent is for the FRU Memory Poison Manager to be retired immediately on a new boot rather than going through the process of dealing with errors and treating it faulty later on. The AMD FMPM driver for this persistence is queued via the RAS subsystem ahead of the Linux 6.9 cycle. The new "RAS_FMPM" Kconfig switch allows building this driver for saving/restoring memory error information across reboots. The information is archived within the ACPI ERST, the Error Record Serialization Table.

FRU Memory Poison Manager Kconfig


Platform-specific policies will allow setting the behavior around retiring problematic memory at boot time. This merge to RAS.git's "edac-for-next" branch ahead of the Linux 6.9 merge window sums up the FRU Memory Poison Manager driver:
"Memory errors are an expected occurrence on systems with high memory density. Generally, errors within a small number of unique physical locations are acceptable, based on manufacturer and/or admin policy. During run time, memory with errors may be retired so it is no longer used by the system. This is done in mm through page poisoning, and the effect will remain until the system is restarted.

If a memory location is consistently faulty, then the same run time error handling may occur in the next reboot cycle, leading to terminating jobs due to that already known bad memory. This could be prevented if information from the previous boot was not lost.

Some add-in cards with driver-managed memory have on-board persistent storage. Their driver saves memory error information to the persistent storage during run time. The information is then restored after reset, and known bad memory will be retired before the hardware is used. A running log of bad memory locations is kept across multiple resets.

A similar solution is desirable for CPUs. However, this solution should leverage industry-standard components as much as possible, rather than a bespoke platform driver.

Two components are needed: a record format and a persistent storage interface.

Implement a new module to manage the record formats on persistent storage. Use the requirements for an AMD MI300-based system to start. Vendor- and platform-specific details can be abstracted later as needed."

Lots of MI300 work continues to work its way into the mainline kernel that should also benefit future Instinct generations.
Related News
About The Author
Michael Larabel

Michael Larabel is the principal author of Phoronix.com and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and OpenBenchmarking.org automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via MichaelLarabel.com.

Popular News This Week