New Patches Extend AMD EDAC Linux Driver For Data Center GPUs
The AMD EDAC Linux driver for Error Detection And Correction of AMD x86_64 CPU/memory errors is now being extended for handling AMD data center GPUs like the Instinct MI200 series and newer where any error reporting/correction information can now be propagated to this existing driver.
Last month with the Linux 6.4 merge window there were AMD EDAC preparations for GPUs while this Monday morning saw the initial AMD patches posted for actually extending this EDAC driver for AMD Instinct accelerator coverage.
The patch series explains:
Just under 500 lines of code is needed for setting up the AMD64 EDAC driver for data center GPU use. The patches are now under review for mainlining to a future kernel series.
The focus with this initial enablement is for the AMD Instinct MI200 series while the forthcoming Instinct MI300 series should work much the same with this EDAC integration.
Last month with the Linux 6.4 merge window there were AMD EDAC preparations for GPUs while this Monday morning saw the initial AMD patches posted for actually extending this EDAC driver for AMD Instinct accelerator coverage.
The patch series explains:
"This set adds GPU support to AMD64 EDAC starting with the MI200 (Aldebaran) series.
...
The AMD Instinctâ„¢ MI200 series accelerators are the data center GPUs. The MI200 (Aldebaran) series of accelerator devices include Unified Memory Controllers and a data fabric similar to those used in AMD x86 CPU products. The memory controllers report errors using MCA, though these errors are generally handled through GPU drivers that directly manage the accelerator device.
In some configurations, memory errors from these devices will be reported through MCA and managed by x86 CPUs. The OS is expected to handle these errors in similar fashion to MCA errors originating from memory controllers on x86 CPUs. In Linux, this flow includes passing MCA errors to a notifier chain that with handlers in the EDAC subsystem.
The AMD64 EDAC module requires information from the memory controllers and data fabric in order to provide detailed decoding of memory errors. The information is read from hardware registers accessed through interfaces in the data fabric.
The accelerator data fabrics are visible to the host x86 CPUs as PCI devices just like x86 CPU data fabrics are already. However, the accelerator fabrics have new and unique PCI IDs.
...
AMD Family 19h Model 30h-3Fh systems can be connected to AMD MI200 accelerator/GPU devices such that the CPU and GPU data fabrics are connected together. In this configuration, the CPU manages error logging and reporting for MCA banks located on the GPUs. This includes HBM memory errors reported from Unified Memory Controllers (UMCs) on the GPUs. The GPU memory errors are handled like CPU memory errors."
Just under 500 lines of code is needed for setting up the AMD64 EDAC driver for data center GPU use. The patches are now under review for mainlining to a future kernel series.
The focus with this initial enablement is for the AMD Instinct MI200 series while the forthcoming Instinct MI300 series should work much the same with this EDAC integration.
Add A Comment