AMD Preparing More Linux Code For The Frontier Supercomputer

Written by Michael Larabel in AMD on 27 May 2021 at 08:37 PM EDT. 12 Comments

Frontier as the first US exascale supercomputer being commissioned by Oak Ridge National Laboratory and the Department of Energy. while being powered by AMD CPUs/GPUs is in the process of seeing more Linux kernel changes for bringing up the new platform.

Frontier is being powered by AMD EPYC and Radeon Instinct accelerators. While set to be delivered in 2021, the Linux software support continues to be worked on for making this supercomputer a reality. In particular, the latest code sent out is working on coherent handling of GPU memory with this supercomputer supporting a coherent interconnect between the CPUs and GPUs. The latest patch series out of AMD is proposing changes to the memory management code around device zone page migration and ultimately for handling page migration and coherent CPU access to video memory.

Building off the recent Heterogeneous Memory Management (HMM) Shared Virtual Memory (SVM) code that is now queued for Linux 5.14, this new patch series is [RFC PATCH 0/5] Support DEVICE_GENERIC memory in migrate_vma_*. Longtime AMD Linux engineer Felix Kuehling sums up the situation as:

AMD is building a system architecture for the Frontier supercomputer with a coherent interconnect between CPUs and GPUs. This hardware architecture allows the CPUs to coherently access GPU device memory. We have hardware in our labs and we are working with our partner HPE on the BIOS, firmware and software for delivery to the DOE.

The system BIOS advertises the GPU device memory (aka VRAM) as SPM (special purpose memory) in the UEFI system address map. The amdgpu driver looks it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC using devm_memremap_pages.

Now we're trying to migrate data to and from that memory using the migrate_vma_* helpers so we can support page-based migration in our unified memory allocations, while also supporting CPU access to those pages.

This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave correctly in the migrate_vma_* helpers. We are looking for feedback about this approach. If we're close, what's needed to make our patches acceptable upstream? If we're not close, any suggestions how else to achieve what we are trying to do (i.e. page migration and coherent CPU access to VRAM)?

12 Comments