Linux Prepares For AMD Servers With Aldebaran GPU Nodes Sporting HBM2
These new heterogeneous AMD system details were revealed today as part of a set of patches prepping the AMD64 EDAC (Error Detection And Correction) kernel driver code for non-CPU nodes. The AMD64 EDAC driver is for traditionally dealing with and correcting system DRAM ECC errors while now being extended to GPU node memory accessible from the CPUs via the xGMI high-speed interconnect.
The public patches note that there will be systems with GPU nodes connected via xGMI links and the GPU dies have HBM2 memory. The patches go on to confirm those nodes as being Aldebaran, the codename for a next-gen AMD CDNA GPU/accelerator that saw initial kernel driver support in Linux 5.13 and continues seeing more open-source driver work around it. Aldebaran is the apparent successor to MI100 "Arcturus" and thus presumably will debut as something along the lines of the AMD Instinct MI200.
These patches published a short time ago note that Aldebaran has two dies (further confirming Aldebaran as an MCM design) with each having four unified memory controllers (UMCs). Each unified memory controller manages eight memory channels that each are connected to 2GB of HBM2 (or HBM2E) memory.
The seven patches posted prepare the EDAC memory driver for the notion of connected non-CPU nodes, recognizing the HBM Gen2 memory type, address translation on Data Fabric version 3.5, and related plumbing. Getting this Linux support squared away timely is being driven by the dominance of Linux in the HPC space and especially with AMD's increasing supercomputer design wins. Most notably Aldebaran and in turn this Linux code is likely what we are to see within the upcoming Frontier exascale supercomputer where it has been mentioned already to have the coherent interconnect between the EPYC CPUs and Radeon Instinct GPUs.
Given the timing of these patches with the Linux 5.14 merge window already open, these amd64_edac additions will likely land for Linux 5.15 unless drawn out by an extended review process.