Announcement

Collapse
No announcement yet.

AMD's AOMP 19.0-2 Compiler Brings Zero-Copy For CPU-GPU Unified Shared Memory

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • AMD's AOMP 19.0-2 Compiler Brings Zero-Copy For CPU-GPU Unified Shared Memory

    Phoronix: AMD's AOMP 19.0-2 Compiler Brings Zero-Copy For CPU-GPU Unified Shared Memory

    AMD compiler engineers have released AOMP 19.0-2 as the newest version of their downstream LLVM/Clang compiler that carries all of their latest work around OpenMP/AOCC GPU device offloading to Radeon and Instinct hardware. With this updated AOMP compiler is now run-time support for zero-copy with CPU-GPU unified shared memory and various other new features for this GPU/accelerator-focused compiler...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Sigh…..

    23 years ago Sony, Toshiba and IBM had a dream to develop a very high performance chip based on the concept of heterogeneous memory management and transport. The Cell Broadband Engine or just Cell. Lisa Su now of AMD, then with IBM, led one of those design teams. Introduced to the mass market with the PlayStation 3, the Cell chip with its pre-AMD HSA memory scheme would go on to power the IBM Roadrunner Supercomputer that in 2008 was the world's fastest Supercomputer at a sustained 1 petabyte per second and the first to do so.

    2 years later a pre-Lisa Su AMD introduced the first Fusion APUs with Llano with the stated goal that because of the combined in-die CPU and GPU that eventually there would be a zero copy heterogeneous memory scheme to power both. That came in 2014 with the introduction of the Kaveri APU and the unveiling of HSA and the AMD led HSA Foundation to promote the design and usage of HSA not only for x86 but the ARM world as well. Kaveri, Carrizo and Bristol Ridge would be the only HSA 1.0 compliant AMD Fusion APUs until Lisa Su came to AMD to save it and in the process killed off Fusion and HSA with the introduction of Zen.

    Now fast forward 23 years from the first designs of the Cell Broadband Engine Architecture, 16 years from the introduction of the PlayStation 3, 13 years from the introduction of AMD’s Fusion APUs, 10 years from the introduction of Kaveri and HSA, and 7 years from Lisa Su coming to AMD and the introduction of Zen that killed off Fusion and the HSA we finally have the AMD compiler support for heterogeneous zero copy between the CPU and the GPU. And it seems like, while not exclusively, it’s optimized for Instinct. What’s wild is that the flag begins with “HSA”.

    A while ago Bridgman told me that one of the reasons HSA and zero copy wouldn’t work on Zen was that they lost the unified memory bit addressing between the CPU and the GPU. I think he said the Fusion APUs at least Kaveri, Carrizo and Bristol Ridge had a unified 48 bit address space shared between the CPU and GPU. I wonder if this new compiler support is generic enough once the flags were applied to support the old Fusion APUs particular the HSA 1.0 compliant ones like Kaveri, Carrizo and Bristol Ridge?

    Comment


    • #3
      Originally posted by Jumbotron View Post
      Sigh…..
      This is still supported on modern hardware today. The magic here that allows shared virtual address spaces between devices and the CPU is the PCIe PRI extension. It basically allows the CPU's virtual address space to be mirrored to the device using the IOMMU on a per process basis.
      Last edited by agd5f; 27 June 2024, 01:08 PM.

      Comment


      • #4
        Originally posted by agd5f View Post

        This is still supported on modern hardware today. The magic here that allows shared virtual address spaces between devices and the CPU is the PCIe PRI extension. It basically allows the CPU's virtual address space to be mirrored to the device using the IOMMU on a per process basis.
        That’s well and good if your computer has the correct implementation of AMD’s IOMMU but at least with my two old rigs with Bristol Ridge APUs, one an HP and the other a Lenovo, the page translation table implementation was utter shite and broken. I’ve read where this was a problem with high end HP computers as well. And with updates and support non existent they became pretty worthless trying to implement HSA and HSAIL, much less ROCm before the plug was pulled on those features.

        Comment


        • #5
          Originally posted by agd5f View Post

          This is still supported on modern hardware today. The magic here that allows shared virtual address spaces between devices and the CPU is the PCIe PRI extension. It basically allows the CPU's virtual address space to be mirrored to the device using the IOMMU on a per process basis.
          The problem is that you have to have both a PCIe and IOMMU implementation that supports it. Dropping HSA and abandoning Fusion was a big mistake. All of the gpgpu work that came before was just abandoneed. And now we have this piece of shit ROCm half ass broken cuda wanna-be...

          Comment


          • #6
            Originally posted by duby229 View Post

            The problem is that you have to have both a PCIe and IOMMU implementation that supports it. Dropping HSA and abandoning Fusion was a big mistake. All of the gpgpu work that came before was just abandoneed. And now we have this piece of shit ROCm half ass broken cuda wanna-be...
            Not sure what you are referring to. Nothing has fundamentally changed in ROCm between the original HSA and today. It still has the same programming model in that CPU and GPU share a unified per-process virtual address space. PRI was always somewhat limited in that device memory never really fit into the PRI model so even if you used PRI for system memory, you still had to deal with VRAM. We even supported PRI on the original zen APUs despite earlier comments from others to the contrary. We eventually dropped it because it didn't really provide any advantages in our driver stack because we already had to manage the GPU's page table to handle VRAM so it was just extra code to maintain. The only difference is that rather than using a single set of page tables for both CPU and GPU, we have to maintain a separate set of page tables for the GPU, but we needed to do that already for VRAM. PRI has not gone anywhere. In fact in the last year or two more vendors have actually started supporting it on their platforms, years after AMD hardware did. The same programming model is still useful for lots of devices. E.g., the NPU on Phoenix APUs uses PRI to support shared virtual memory between the CPU and NPU.

            Comment


            • #7
              Originally posted by Jumbotron View Post

              That’s well and good if your computer has the correct implementation of AMD’s IOMMU but at least with my two old rigs with Bristol Ridge APUs, one an HP and the other a Lenovo, the page translation table implementation was utter shite and broken. I’ve read where this was a problem with high end HP computers as well. And with updates and support non existent they became pretty worthless trying to implement HSA and HSAIL, much less ROCm before the plug was pulled on those features.
              The functionality worked well on platforms which were validated and sold for Linux, but at the time, windows didn't support the IOMMU so neither MS nor most OEMs validated the IOMMU on the platforms; sometimes they didn't even enable the IOMMU on the platform. That's where the vast majority of consumer APUs were used. Unfortunately that led to problems when users tried to use Linux on them. Now that MS enables the IOMMU, it just works.

              Also, as I said before, ROCm has not fundamentally changed since the HSA days.

              Comment


              • #8
                How did MS get to be the arbiter of whether or not AMD's products function as expected on Linux? Years ago? I don't get it...

                Comment


                • #9
                  Originally posted by duby229 View Post

                  The problem is that you have to have both a PCIe and IOMMU implementation that supports it. Dropping HSA and abandoning Fusion was a big mistake. All of the gpgpu work that came before was just abandoneed. And now we have this piece of shit ROCm half ass broken cuda wanna-be...
                  This…

                  Comment


                  • #10
                    Originally posted by agd5f View Post

                    The functionality worked well on platforms which were validated and sold for Linux, but at the time, windows didn't support the IOMMU so neither MS nor most OEMs validated the IOMMU on the platforms; sometimes they didn't even enable the IOMMU on the platform. That's where the vast majority of consumer APUs were used. Unfortunately that led to problems when users tried to use Linux on them. Now that MS enables the IOMMU, it just works.

                    Also, as I said before, ROCm has not fundamentally changed since the HSA days.
                    First thanks for the explanation above. That’s explains a lot and provides clarity I’ve never had before. So those of us pinning hopes and looking forward to the Fusion / HSA era were screwed from the beginning without knowing it thanks once again to Microsuck. But to the last point, maybe that’s the problem. There was some reason or reasons the industry didn’t run with AMD’s HSA at the beginning even with multiple vendors from both the x86 world and the ARM world as evidenced by the HSA Foundation and even Oracle. Obviously there was the shakeout from Lisa Su coming in and reprioritizing AMDs vision and roadmap and scrapping Fusion and developing Zen. But AMD was talking up a heterogeneous future even before Kaveri and from the beginning of Fusion with Llano in 2011. It seems to me a lost opportunity due to delayed and ill developed software and a fragmented and chaotic messaging and corporate vision.

                    Comment

                    Working...
                    X