AMD Begins Prototyping CRIU Support For ROCm Compute
As part of AMD's growing HPC focus and maturing of their Radeon Open eCosystem GPU compute stack, they ended out this week by making public a prototype implementation of CRIU support for ROCm.
AMD's Radeon open-source graphics software developers are working on Checkpoint/Restore In Userspace (CRIU) handling for ROCm. CRIU allows the ability to freeze a running process and archiving it to disk that can then be thawed/restored later on. This user-space-based solution is, of course, much more tricky when it comes to handling processes interacting with the GPU.
Overnight an initial set of patches were posted for the AMD Radeon graphic's "AMDKFD" kernel code for supporting CRIU. These 17 patches with more than two thousand lines of new kernel code is still in a "request for comments" / prototyping stage.
Ultimately they are working towards being able to upstream this checkpoint/restore support in the AMDKFD driver that will be usable to the ROCm stack.so ROCm applications can be CRIU'ed. The new kernel ioctl for the new capabilities is still not finalized yet, so it may be a while before this support is squared away.
In any case for those interested in CRIU around AMD Radeon compute workloads, see this patch series for more details.