AMD Continues CRIU Work To Checkpoint/Restore ROCm Compute Workloads
Earlier this year AMD went public with prototyping CRIU support for Radeon GPUs around ROCm to be able to checkpoint/freeze running compute workloads and to then restore them at a later point. This CRIU focus is driven by their big accelerator needs and forthcoming supercomputers for migrating workloads particularly within containers. AMD continues working on CRIU support for GPUs and last week provided an update on the project.
Being able to checkpoint/restore for the GPU is obviously a complex process especially with CRIU not having been designed for scenarios as complex as GPU handling, but they have been making progress the past number of months with their ROCm focus -- they aren't working on CRIU support for say Vulkan or OpenGL. As with the rest of their GPU compute stack, this CRIU support is fully open-source and they are planning to upstream all of it. Their CRIU plug-in is the first for GPUs and they also need changes to their AMDGPU/AMDKFD kernel driver code around saving and restoring the hardware state and memory mappings, etc.
AMD has been able to get this checkpoint/restore working for "real" TensorFlow/PyTorch workloads on multi-GPU nodes and with the focus on being able to migrate the containers running those workloads.
While AMD continues working on upstreaming the code, it's being actively developed on GitHub.
The current state of AMD ROCm CRIU support was presented at last week's XDC2021 conference. Embedded below is the presentation along with the accompanying slide deck.
Being able to checkpoint/restore for the GPU is obviously a complex process especially with CRIU not having been designed for scenarios as complex as GPU handling, but they have been making progress the past number of months with their ROCm focus -- they aren't working on CRIU support for say Vulkan or OpenGL. As with the rest of their GPU compute stack, this CRIU support is fully open-source and they are planning to upstream all of it. Their CRIU plug-in is the first for GPUs and they also need changes to their AMDGPU/AMDKFD kernel driver code around saving and restoring the hardware state and memory mappings, etc.
AMD has been able to get this checkpoint/restore working for "real" TensorFlow/PyTorch workloads on multi-GPU nodes and with the focus on being able to migrate the containers running those workloads.
While AMD continues working on upstreaming the code, it's being actively developed on GitHub.
The current state of AMD ROCm CRIU support was presented at last week's XDC2021 conference. Embedded below is the presentation along with the accompanying slide deck.
1 Comment