AMD EDAC/RAS Code Adds GPU/Accelerator Support In Linux 6.5
In addition to yesterday bringing EDAC support for AMD Zen 4 client CPUs, the set of RAS "Reliability, Availability and Serviceability" updates for the Linux 6.5 kernel have separately brought initial GPU/accelerator support.
This is the code that has been in the works the past few months for extending the Linux EDAC driver for data center GPUs. In particular, getting the AMD64 Error Detection and Correction driver working for AMD Instinct MI200 GPUs with HBM.
The RAS pull request sent out yesterday for Linux 6.5 explains:
That code has now been merged for Linux 6.5. While the initial focus is on the MI200 series, it will also be important for the forthcoming AMD Instinct MI300 series too.
This is the code that has been in the works the past few months for extending the Linux EDAC driver for data center GPUs. In particular, getting the AMD64 Error Detection and Correction driver working for AMD Instinct MI200 GPUs with HBM.
The RAS pull request sent out yesterday for Linux 6.5 explains:
"Add initial support for RAS hardware found on AMD server GPUs (MI200). Those GPUs and CPUs are connected together through the coherent fabric and the GPU memory controllers report errors through x86's MCA so EDAC needs to support them. The amd64_edac driver supports now HBM (High Bandwidth Memory) and thus such heterogeneous memory controller systems."
That code has now been merged for Linux 6.5. While the initial focus is on the MI200 series, it will also be important for the forthcoming AMD Instinct MI300 series too.
4 Comments