AMD Posts Patches Implementing RAS Support For AMDGPU Linux Driver
AMD developers today posted a set of twenty patches implementing RAS support for the AMDGPU Linux kernel diver.
The AMDGPU driver is seeing support for RAS -- Reliability, Availability, Serviceability -- for supported hardware that at least for now appears to be focused on Vega 20 -- likely just the Radeon Instinct products and not Radeon VII. The AMDGPU RAS support includes SRAM/VRAM ECC, bad page tracking, and error containment.
When RAS errors occur, the GPU will automatically reset though the initial Vega 20 A1 hardware appears to currently have an issue.
This AMDGPU RAS implementation comes in at just under three thousand lines of new kernel code. The code is now out for public review. Given the timing, it's not going to be merged for the Linux 5.1 cycle but it's possible to see it for the Linux 5.2 kernel
The AMDGPU driver is seeing support for RAS -- Reliability, Availability, Serviceability -- for supported hardware that at least for now appears to be focused on Vega 20 -- likely just the Radeon Instinct products and not Radeon VII. The AMDGPU RAS support includes SRAM/VRAM ECC, bad page tracking, and error containment.
When RAS errors occur, the GPU will automatically reset though the initial Vega 20 A1 hardware appears to currently have an issue.
This AMDGPU RAS implementation comes in at just under three thousand lines of new kernel code. The code is now out for public review. Given the timing, it's not going to be merged for the Linux 5.1 cycle but it's possible to see it for the Linux 5.2 kernel
8 Comments