AMD Introducing FRU Memory Poison Manager In Linux 6.9

Written by Michael Larabel in AMD on 6 March 2024 at 09:57 AM EST. 14 Comments

Queued for introduction in the upcoming Linux 6.9 kernel cycle is an FRU Memory Poison Manager "FMPM" developed by AMD that may later be adapted for other non-AMD platforms. The FRU Memory Poison Manager is working to persist information around known bad/faulty memory across reboots.

As noted previously AMD has been working on row retirement support and other changes for dealing with faulty memory particularly for the Instinct MI300 series with HBM3 memory. While the row retirement support allows phasing out use of that DRAM row after a threshold has been reached for errors, ultimately it can become the situation of repeating itself on a clean reboot. With the forthcoming FRU Memory Poison Manager it will allow optionally persisting such information around bad memory across reboots.

For consistently faulty memory the intent is for the FRU Memory Poison Manager to be retired immediately on a new boot rather than going through the process of dealing with errors and treating it faulty later on. The AMD FMPM driver for this persistence is queued via the RAS subsystem ahead of the Linux 6.9 cycle. The new "RAS_FMPM" Kconfig switch allows building this driver for saving/restoring memory error information across reboots. The information is archived within the ACPI ERST, the Error Record Serialization Table.

FRU Memory Poison Manager Kconfig

Platform-specific policies will allow setting the behavior around retiring problematic memory at boot time. This merge to RAS.git's "edac-for-next" branch ahead of the Linux 6.9 merge window sums up the FRU Memory Poison Manager driver:

"Memory errors are an expected occurrence on systems with high memory density. Generally, errors within a small number of unique physical locations are acceptable, based on manufacturer and/or admin policy. During run time, memory with errors may be retired so it is no longer used by the system. This is done in mm through page poisoning, and the effect will remain until the system is restarted.

If a memory location is consistently faulty, then the same run time error handling may occur in the next reboot cycle, leading to terminating jobs due to that already known bad memory. This could be prevented if information from the previous boot was not lost.

Some add-in cards with driver-managed memory have on-board persistent storage. Their driver saves memory error information to the persistent storage during run time. The information is then restored after reset, and known bad memory will be retired before the hardware is used. A running log of bad memory locations is kept across multiple resets.

A similar solution is desirable for CPUs. However, this solution should leverage industry-standard components as much as possible, rather than a bespoke platform driver.

Two components are needed: a record format and a persistent storage interface.

Implement a new module to manage the record formats on persistent storage. Use the requirements for an AMD MI300-based system to start. Vendor- and platform-specific details can be abstracted later as needed."

Lots of MI300 work continues to work its way into the mainline kernel that should also benefit future Instinct generations.

14 Comments