Linux Prepares For AMD Servers With Aldebaran GPU Nodes Sporting HBM2

Written by Michael Larabel in AMD on 30 June 2021 at 02:45 PM EDT. 1 Comment
AMD
The latest public code patches on the mailing list today are preparing for newer AMD heterogeneous servers that will have Aldebaran GPU nodes connected via xGMI links to the CPU(s) and the GPU dies in turn having HBM2 memory.

These new heterogeneous AMD system details were revealed today as part of a set of patches prepping the AMD64 EDAC (Error Detection And Correction) kernel driver code for non-CPU nodes. The AMD64 EDAC driver is for traditionally dealing with and correcting system DRAM ECC errors while now being extended to GPU node memory accessible from the CPUs via the xGMI high-speed interconnect.

The public patches note that there will be systems with GPU nodes connected via xGMI links and the GPU dies have HBM2 memory. The patches go on to confirm those nodes as being Aldebaran, the codename for a next-gen AMD CDNA GPU/accelerator that saw initial kernel driver support in Linux 5.13 and continues seeing more open-source driver work around it. Aldebaran is the apparent successor to MI100 "Arcturus" and thus presumably will debut as something along the lines of the AMD Instinct MI200.

These patches published a short time ago note that Aldebaran has two dies (further confirming Aldebaran as an MCM design) with each having four unified memory controllers (UMCs). Each unified memory controller manages eight memory channels that each are connected to 2GB of HBM2 (or HBM2E) memory.

The seven patches posted prepare the EDAC memory driver for the notion of connected non-CPU nodes, recognizing the HBM Gen2 memory type, address translation on Data Fabric version 3.5, and related plumbing. Getting this Linux support squared away timely is being driven by the dominance of Linux in the HPC space and especially with AMD's increasing supercomputer design wins. Most notably Aldebaran and in turn this Linux code is likely what we are to see within the upcoming Frontier exascale supercomputer where it has been mentioned already to have the coherent interconnect between the EPYC CPUs and Radeon Instinct GPUs.

Given the timing of these patches with the Linux 5.14 merge window already open, these amd64_edac additions will likely land for Linux 5.15 unless drawn out by an extended review process.
Related News
About The Author
Michael Larabel

Michael Larabel is the principal author of Phoronix.com and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and OpenBenchmarking.org automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via MichaelLarabel.com.

Popular News This Week