AMD Continues CRIU Work To Checkpoint/Restore ROCm Compute Workloads
Being able to checkpoint/restore for the GPU is obviously a complex process especially with CRIU not having been designed for scenarios as complex as GPU handling, but they have been making progress the past number of months with their ROCm focus -- they aren't working on CRIU support for say Vulkan or OpenGL. As with the rest of their GPU compute stack, this CRIU support is fully open-source and they are planning to upstream all of it. Their CRIU plug-in is the first for GPUs and they also need changes to their AMDGPU/AMDKFD kernel driver code around saving and restoring the hardware state and memory mappings, etc.
AMD has been able to get this checkpoint/restore working for "real" TensorFlow/PyTorch workloads on multi-GPU nodes and with the focus on being able to migrate the containers running those workloads.
While AMD continues working on upstreaming the code, it's being actively developed on GitHub.
The current state of AMD ROCm CRIU support was presented at last week's XDC2021 conference. Embedded below is the presentation along with the accompanying slide deck.