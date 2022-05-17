This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory owned by a device that can be mapped into CPU page tables like MEMORY_DEVICE_GENERIC and can also be migrated like MEMORY_DEVICE_PRIVATE.

System stability and performance are not affected according to our ongoing testing, including xfstests.



How it works: The system BIOS advertises the GPU device memory (aka VRAM) as SPM (special purpose memory) in the UEFI system address map.



The amdgpu driver registers the memory with devmap as MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for this hardware page migration capability is the Frontier supercomputer project. This functionality is not AMD-specific. We expect other GPU vendors to find this functionality useful, and possibly other hardware types in the future.



Our test nodes in the lab are similar to the Frontier configuration, with .5 TB of system memory plus 256 GB of device memory split across 4 GPUs, all in a single coherent address space. Page migration is expected to improve application efficiency significantly. We will report empirical results as they become available.

For over the past year we've seen various patches posted by AMD engineers with a state effort around preparations for the Frontier supercomputer. Most of these patches have involved memory handling under Linux and the special purpose memory handling between the CPU/GPUs. Published on Monday was their latest work on coherent device memory mappings for the Linux kernel. See the latest MEMORY_DEVICE_COHERENT patch series for more technical details if interested.



ORNL photo showing Frontier under construction.

Frontier is the exascale supercomputer currently being built for Oak Ridge National Laboatory and expected to reach full capability this calendar year using a combination of AMD EPYC 3rd Gen CPUs and AMD Instinct 250X GPUs. The coherent interconnects between the CPUs and GPUs with xGMI has been what's seeing most of the Frontier-mentioning Linux support patches for getting the software support all in order. Frontier once fully operational should be delivering above 1.5 Exaflops compute performance.