Linux 6.9 Adding AMD MI300 Row Retirement Support For Problematic HBM Memory
For the upcoming Linux 6.9 kernel cycle there are a number of AMD Instinct MI300 additions to the EDAC (Error Detection And Correction) and RAS (Reliability, Availability and Serviceability) drivers.
This work includes adapting the AMD EDAC driver to use the AMD Address Translation Library, MI300 support for that ATL library, other MI300 RAS additions, and then a new feature for MI300 hardware is row retirement support.
The MI300 row retirement support within the amd64_edac driver is summed up in that patch as for dealing with defective/errored out high bandwidth memory (HBM) on the MI300:
A code comment within that row retirement support patch reaffirms the intentions of retiring all memory within that DRAM row on errors:
That latest AMD MI300 work is to be found in Linux 6.9 now that those patches are part of RAS.git's "edac-for-next" Git branch.
This work includes adapting the AMD EDAC driver to use the AMD Address Translation Library, MI300 support for that ATL library, other MI300 RAS additions, and then a new feature for MI300 hardware is row retirement support.
The MI300 row retirement support within the amd64_edac driver is summed up in that patch as for dealing with defective/errored out high bandwidth memory (HBM) on the MI300:
"AMD MI300 systems have on-die High Bandwidth Memory. This memory has a relatively higher error rate, and it is not individually replaceable like DIMMs.
Uncorrectable ECC errors are individually reported as Deferred errors using the AMD Deferred error interrupt. Each reported error corresponds to a single hardware error.
Correctable ECC errors get reported in batches through MCA Thresholding. Users can configure the threshold limit based on their policy. Each reported correctable error represents a single occurrence of the threshold limit being reached.
The current guidance from AMD designers is that memory affected by ECC errors within a DRAM row should be retired. Action should be taken on every reported ECC error.
Add a helper function to apply this policy for MI300 systems.
This and similar functionality can also be best handled in a separate, generic module. In the meantime, do this in AMD64 EDAC for simplicity."
A code comment within that row retirement support patch reaffirms the intentions of retiring all memory within that DRAM row on errors:
"When a DRAM ECC error occurs on MI300 systems, it is recommended to retire all memory within that DRAM row. This applies to the memory with a DRAM bank."
That latest AMD MI300 work is to be found in Linux 6.9 now that those patches are part of RAS.git's "edac-for-next" Git branch.
34 Comments