The State Of ROCm For HPC In Early 2021 With CUDA Porting Via HIP, Rewriting With OpenMP

Written by Michael Larabel in AMD on 21 February 2021 at 12:10 PM EST. 116 Comments

Earlier this month at the virtual FOSDEM 2021 conference was an interesting presentation on how European developers are preparing for AMD-powered supercomputers and beginning to figure out the best approaches for converting existing NVIDIA CUDA GPU code to run on Radeon GPUs as well as whether writing new GPU-focused code with OpenMP device offload is worthwhile.

Georgios Markomanolis who is the lead HPC scientist at CSC was the one presenting at this month's FOSDEM virtual conference on preparing for supercomputers with AMD GPUs. Their focus at the moment is on the LUMI EuroHPC supercomputer that is expected to become operational later this year. LUMI is aiming for 550+ Peta FLOPS of peak performance with being powered by AMD EPYC "Milan" processors and AMD Instinct GPUs (post MI100, perhaps the new "GFX90A").

While awaiting the supercomputer, the HPC researchers in Europe that are involved with the LUMI consortium have already been busy analyzing the Radeon Open eCosystem (ROCm) and the available methods for exploiting the GPU performance in porting existing CUDA codebases over as well as the best practices when writing new code.

For converting CUDA code over for AMD GPU execution, the focus is obviously on using AMD's open-source HIP heterogeneous interface. With the "Hipify" Clang is how source-based translations can be achieved in large part from CUDA or there is also Hipify-Perl for the text-based search/replace in migrating from CUDA to HIP. From the HIP-based approaches they have been seeing good results with roughly 2% overhead.

With Fortran code, Hipfort is necessary as as interface library for the GPU kernel and more manual porting compared to the automatic translation. But with one test case at least under their HIP version they found it to be 30% faster, but part of that at least may also come down to compiler stack differences, as noted.

LUMI researchers are also exploring AMD's OpenMP device offloading support that was recently upstreamed in LLVM and continues being developed in the downstream "AOMP" project. So far they have found AOMP to have some performance issues but are expecting it to be improved by the time LUMI is deployed. They expect HIP will ultimately perform better than using OpenMP offloading but may use OpenMP for Fortran or other complicated codebases.

Those curious about the ROCm/HIP experiences so far with European researchers preparing for AMD-powered LUMI can see Georgios Markomanolis' PDF slide deck from his FOSDEM presentation as well as the WebM/VP9 and MP4 video recordings.

116 Comments