Announcement

Collapse
No announcement yet.

AMD HSA Offloading Support Dropped From The GCC Compiler

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by __martinj View Post
    You have to distinguish between HSAIL (HSA Intermediate Language) and HSA run-time (and there are other HSA specifications too). HSAIL generation has been dropped from GCC, GCC continues to be able to utilize HSA run-time to offload to GCN devices.
    Thanks for making that clear - I hadn't gone through the patches yet so wasn't sure what "removing HSA" actually meant in this context.

    Originally posted by __martinj View Post
    To my best knowledge LLVM never produced HSAIL and their "use of HSA" is also just using the run-time to run GCN ISA kernels.
    AFAIK we do still generate HSAIL from an LLVM back end (the OpenCL API code uses clang/llvm to convert from OpenCL C to HSAIL in some cases) then we use the proprietary shader compiler to convert from HSAIL to GPU ISA. The all-open OpenCL, HCC and HIP-clang all bypass HSAIL and go straight to GPU ISA as you say.

    Originally posted by guglovich View Post
    I waited for Kaveri for several years, did it for a long time, then the appearance of ROCm, from which Kaveri is removed. Thanks to AMD for marketing and deceiving customers!
    We didn't actually remove Kaveri support from ROCm, but we did break it for a few kernel cycles while reconciling the original APU code paths using ATS/PRI with the newer dGPU code paths using GPUVM. The biggest problem was that we had very little success getting OEMs to add SBIOS support for the HSA functionality that the HSA/ROCM stack requires.

    The shift from APU focus to dGPU focus was definitely rough on Kaveri/HSA early adopters (I got a lot of flak from friends as well) but we are far enough along on large scale dGPU system support to circle back and ramp up efforts to make ROCm more accessible on developer platforms. I'm not sure if that will go back as far as Kaveri but given that most of my home systems are Kaveri-based I'll probably try to get it working myself even if we don't do it officially.
    Last edited by bridgman; 04 August 2020, 05:59 PM.
    Test signature

    Comment


    • #22
      Originally posted by bridgman View Post

      Thanks for making that clear - I hadn't gone through the patches yet so wasn't sure what "removing HSA" actually meant in this context.



      AFAIK we do still generate HSAIL from an LLVM back end (the OpenCL API code uses clang/llvm to convert from OpenCL C to HSAIL in some cases) then we use the proprietary shader compiler to convert from HSAIL to GPU ISA. The all-open OpenCL, HCC and HIP-clang all bypass HSAIL and go straight to GPU ISA as you say.



      We didn't actually remove Kaveri support from ROCm, but we did break it for a few kernel cycles while reconciling the original APU code paths using ATS/PRI with the newer dGPU code paths using GPUVM. The biggest problem was that we had very little success getting OEMs to add SBIOS support for the HSA functionality that the HSA/ROCM stack requires.

      The shift from APU focus to dGPU focus was definitely rough on Kaveri/HSA early adopters (I got a lot of flak from friends as well) but we are far enough along on large scale dGPU system support to circle back and ramp up efforts to make ROCm more accessible on developer platforms. I'm not sure if that will go back as far as Kaveri but given that most of my home systems are Kaveri-based I'll probably try to get it working myself even if we don't do it officially.

      With all due respect, why all this discussion around Kaveri when it comes to HSA/ROCm? Carrizo and by extension Bristol Ridge were the first AMD APU's to be fully HSA 1.0 compliant. And yet AMD, who created the HSA Foundation, bypassed Carrizo and Bristol Ridge for Zen base APUs even though Carrizo and Bristol Ridge are fully HSA 1.0 compliant.

      Kaveri is Steamroller based with DDR3 memory support only, integrated GCN Gen 2 GPU cores and only up to AVX support. Carrizo on the other hand is Excavator based with DDR4 memory support, integrated GCN Gen 3 GPU cores partly based on Volcanic Islands and partly based on Pirate Islands, and has AVX2 vector support. Bristol Ridge's GPU is clocked higher and has more power as measured in GFLOPS than even the highest end Kaveri. Going further Bristol Ridge, which powers my desktop and my laptop, is one up on Carrizo by being an Excavator+ part. AMD used High Density or "Thin" libraries for Bristol Ridge which are normally reserved mostly for GPU manufacturing to the 28nm process they had been using for too long and got additional power savings and IPC out of the Excavator cores above and beyond Carrizo which already had an advantage over the Steamroller based Kaveri.

      In many ways Bristol Ridge as the last, fully optimized Excavator part was also the pinnacle of the long running Bulldozer design. Yes, say what you will about the wisdom and the efficacy of the Bulldozer core design, AMD along with Global Foundries ran it as far out as it could go before going "clean sheet" design with Zen. I and many other users sat out buying into AMD's "Fusion" design scheme until there was a fully HSA 1.0 compliant chip such as Carrizo and particularly Bristol Ridge. I'd like to see AMD "dance with the one who brought 'em" and back track and make ROCm fully usable for Carrizo and Bristol Ridge like they have with Zen based APU's.

      ** Additional** Just to put a finer point on the performance increase of Bristol Ridge over Kaveri. Not only is Bristol Ridge HSA 1.0 compliant and Kaveri is not but here are the CPU and GPU speeds of each part. This does NOT take into account the additional IPC improvements that Bristol Ridge has over both Carrizo and Kaveri. Notice how each part is based on GlobalFoundries 28nm node but the Bristol Ridge is HIGHER performance particularly in the GPU because of the High Density "Thin" Libraries GloFo used for the end of the Bulldozer run with Excavator+ based Bristol Ridge.

      Kaveri - Pro A10-8850B _ CPU Base Clock - 3.9 GHz _ Turbo Clock 4.1 GHz _ GPU Clock 800 MHz _ 819 GFLOPS _ TDP 95 Watts
      Bristol Ridge - Pro A12-9800 _ CPU Base Clock 3.8 GHZ _ Turbo Clock 4.2 GHZ _ GPU Clock 1108 GHz _ 1134 GFLOPS _ TDP 65 Watts

      Here's how the above looks when you match a Kaveri with TDP of 65 Watts to match Bristol Ridge.

      Kaveri - A10-8750B _ CPU Base Clock - 3.6 GHZ _ Turbo Clock 4.0 GHz _ GPU Clock 757 MHz _ 775 GFLOPS _ TDP 65 Watts
      Bristol Ridge - Pro A12-9800 _ CPU Base Clock - 3.8 GHZ _ Turbo Clock 4.2 GHZ _ GPU Clock 1108 GHz _ 1134 GFLOPS _ TDP 65 Watts

      So...Bristol Ridge has MORE IPC, HIGHER CPU clock speeds, HIGHER Turbo Clock speeds, HIGHER GPU clock speeds, HIGHER GPU processing power as measured in GFLOPS, is the PINNACLE of the Bulldozer Architecture before AMD moved on to Zen AND, most of all, is HSA 1.0 compliant while Kaveri is NOT. Come on AMD....show some ROCm love for Carrizo / Bristol Ridge.
      Last edited by Jumbotron; 04 August 2020, 08:53 PM. Reason: Additional Edit. The highest end Kaveri had the same GPU core count as Carrizo and Bristol Ridge. I said otherwise. Corrected now.

      Comment


      • #23
        Originally posted by phred14 View Post
        A long, long time ago I was working for IBM, and when the Vector Facility became available to us I was looking forward to doing circuit simulation with it. It turned out to be a fairly minor performance improvement, nothing like what we were hoping for. It was explained to us as the "gather-scatter problem". You had to gather the information together out of a sparse matrix and put it into a form that could be pushed into the Vector Facility, and then the results had to be scattered back into the sparse matrix. There was so much overhead that it wasn't a big win. It would seem to me that something similar has to be done for GPU computing... Presuming the entire problem can't be done just by the GPU alone, you wind up with significant overhead getting in and out of the GPU.

        I saw HSA as a way around this, and a case where Unified Memory might actually be a win. But that also requires that the code understand that it's running on a UMA system and knows that it doesn't have to push data from place to place prior to working with it. If the "normal" model remains the separate GPU card, I could see UMA systems running the same way, moving data needlessly from one part of main memory to another, just to look like a normal GPU system.

        I never got a good answer to this question.
        Good news / bad news...

        The bad news is that each of the major vendors took a different approach to unified memory, with each approach having some advantages and disadvantages over the other, and that none of the early models really survived in their original forms. The good news is that over the last few years everyone has been re-aligning on a fairly common model.

        Programming for a unified memory model is definitely different from programming for a model with separate CPU and GPU memory and that will probably never change, ie there will probably always be two slightly different code paths for separate memory and unified memory. The challenge for the industry is to achieve enough standardization in terms of memory models that the same unified memory code path can be used for multiple vendors with only an acceptably small number of runtime differences.

        I'm not sure everyone has accepted that challenge yet, although we have.

        In our case we went all-in on the APU approach initially, although it's fair to argue that was a case of the technology getting ahead of the market, and we ended up having to drop back and offer a less spiffy programming model in order to run on the dGPUs of the time. What everyone is trying to do with unified memory today is to duplicate what we did on APUs (using ATS/PRI from IOMMUv2 via Address Translation Cache/ATC on the GPU) without requiring IOMMUv2.

        After a bunch of bumping around we all settled on an HMM implementation that could go upstream (Jerome Glisse deserves huge credit for having the patience to get it there) and I think you'll find that everyone settles on HMM as the foundation for unified memory. There will be some more bumping around as all the vendors get solid recoverable page fault support into their GPUs but I think that will settle down fairly quickly.

        Originally posted by Jumbotron View Post
        With all due respect, why all this discussion around Kaveri when it comes to HSA/ROCm?
        Finally, an easy question

        The reason for "all this discussion about Kaveri" (1 question, 1 answer) was that guglovich asked about it and I happened to have a few Kaveri systems myself.

        Originally posted by Jumbotron View Post
        Carrizo and by extension Bristol Ridge were the first AMD APU's to be fully HSA 1.0 compliant. And yet AMD, who created the HSA Foundation, bypassed Carrizo and Bristol Ridge for Zen base APUs even though Carrizo and Bristol Ridge are fully HSA 1.0 compliant.
        That surprises me - I thought we put more polish into Carrizo than we did into Kaveri and Raven Ridge. Can you give me more specifics ? Carrizo was effectively the last APU we shipped before the market focus shifted from APU to dGPU. My impression was that it was pretty well supported.

        Originally posted by Jumbotron View Post
        So...Bristol Ridge has MORE IPC, HIGHER CPU clock speeds, HIGHER Turbo Clock speeds, HIGHER GPU clock speeds, HIGHER GPU processing power as measured in GFLOPS, is the PINNACLE of the Bulldozer Architecture before AMD moved on to Zen AND, most of all, is HSA 1.0 compliant while Kaveri is NOT. Come on AMD....show some ROCm love for Carrizo / Bristol Ridge.
        In fairness, the primary difference between Kaveri and Carrizo in terms of HSA compliance was CWSR (Compute Save/Restore Support), which only comes into play if you have long-running shader programs and need to maintain interactivity / quality of service. In almost every other respect Kaveri was the big jump in terms of HW functionality, with 48-bit GPU addressing (to match the CPU) and unified addressing, MEC blocks adding effectively unlimited compute queues via MEC and HW scheduling, and ATC + IOMMUv2 with ATS/PRI support adding demand-paged unified memory support.

        For what it's worth, everything I said about revisiting APU support for developers and casual users applies just as much to Carrizo & Bristol Ridge since it is a more recent part as you said. Unforthately the "I don't know how far back we will go" part applies as well.

        One change we made recently (starting with Renoir, I think) was using dGPU code paths for integrated GPUs rather than the original ATC/IOMMUv2 path for memory. Downside is that you lose the ability to malloc memory and then have the GPU access it without an API call to make it accessible, but upside is that the same code can run unchanged on dGPU and APU. The ability to have GPU immediately access malloc'ed memory will hopefully come to dGPUs over time via HMM integration (see phred14's question) but I suspect it would be cleaner to stay with ATC/IOMMUv2 paths for unified memory on APUs.
        Last edited by bridgman; 05 August 2020, 01:46 PM.
        Test signature

        Comment


        • #24
          Originally posted by bridgman View Post

          Good news / bad news...

          The bad news is that each of the major vendors took a different approach to unified memory, with each approach having some advantages and disadvantages over the other, and that none of the early models really survived in their original forms. The good news is that over the last few years everyone has been re-aligning on a fairly common model.

          Programming for a unified memory model is definitely different from programming for a model with separate CPU and GPU memory and that will probably never change, ie there will probably always be two slightly different code paths for separate memory and unified memory. The challenge for the industry is to achieve enough standardization in terms of memory models that the same unified memory code path can be used for multiple vendors with only an acceptably small number of runtime differences.

          I'm not sure everyone has accepted that challenge yet, although we have.

          In our case we went all-in on the APU approach initially, although it's fair to argue that was a case of the technology getting ahead of the market, and we ended up having to drop back and offer a less spiffy programming model in order to run on the dGPUs of the time. What everyone is trying to do with unified memory today is to duplicate what we did on APUs (using ATS/PRI from IOMMUv2 via Address Translation Cache/ATC on the GPU) without requiring IOMMUv2.

          After a bunch of bumping around we all settled on an HMM implementation that could go upstream (Jerome Glisse deserves huge credit for having the patience to get it there) and I think you'll find that everyone settles on HMM as the foundation for unified memory. There will be some more bumping around as all the vendors get solid recoverable page fault support into their GPUs but I think that will settle down fairly quickly.



          Finally, an easy question

          The reason for "all this discussion about Kaveri" (1 question, 1 answer) was that guglovich asked about it and I happened to have a few Kaveri systems myself.



          That surprises me - I thought we put more polish into Carrizo than we did into Kaveri and Raven Ridge. Can you give me more specifics ? Carrizo was effectively the last APU we shipped before the market focus shifted from APU to dGPU. My impression was that it was pretty well supported.



          In fairness, the primary difference between Kaveri and Carrizo in terms of HSA compliance was CWSR (Compute Save/Restore Support), which only comes into play if you have long-running shader programs and need to maintain interactivity / quality of service. In almost every other respect Kaveri was the big jump in terms of HW functionality, with 48-bit GPU addressing (to match the CPU) and unified addressing, MEC blocks adding effectively unlimited compute queues via MEC and HW scheduling, and ATC + IOMMUv2 with ATS/PRI support adding demand-paged unified memory support.


          Thanks for the reply bridgman. Here is what I found on RadeonOpenCompute's Github page concerning Hardware and Software support. I've edited the "Limited or No Support" portion of the page to include only the relevant bits concerning Carrizo / Bristol Ridge suppport or more to the point_the lack of support_. As you also brought up Raven Ridge support I included that part as well. Finally to your thought that Carrizo "was effectively the last APU we shipped" that is technically not correct. Bristol Ridge and it's low power version Stoney Ridge were the last Bulldozer based APUs before AMD went clean sheet design with Zen core based APUs with integrated Vega GPUs. Now...to be clear...Bristol Ridge IS in fact a Carrizo part. Just optimized by GlobalFoundries using their High Density "Thin" Libraries to increase the "dark silicon" space between transistors in order to clock up the CPU and GPU cores and adding addition P states so that Bristol Ridge would not clock down as much as Carrizo due to thermal stress. However...the BIG difference between Carrizo and Bristol Ridge is that Carrizo was ONLY a mobile part. AMD decided to brush off Carrizo and optimize it with GloFo's High Density libraries and release it as Bristol Ridge as a Desktop CPU when there was a delay in getting Zen out the door after Lisa Su became CEO of AMD and decided to scrap Bulldozer as a continuing core architecture and start over with Zen.

          In all seriousness, I have both an HP desktop with a Bristol Ridge A12-9800P and a Lenovo Ideapad laptop with a Bristol Ridge A12-9700. I know you only have Kaveri based computers to work on. I would seriously entertain the idea of shipping either my desktop or laptop ( my expense ) to you for getting Carrizo / Bristol Ridge APU's fully ROCm supported if you thought that a worthy pursuit. As I said I would totally pick up the expense of shipping to and from you. Let me know.

          Thanks again for your replay and all the work you do on behalf of us on the "Red" Team and Linux in general.



          Not supported or limited support under ROCm

          Limited support
          • AMD "Carrizo" and "Bristol Ridge" APUs are enabled to run OpenCL, but do not yet support HCC, HIP, or our libraries built on top of these compilers and runtimes.
            • As of ROCm 2.1, "Carrizo" and "Bristol Ridge" require the use of upstream kernel drivers.
            • In addition, various "Carrizo" and "Bristol Ridge" platforms may not work due to OEM and ODM choices when it comes to key configurations parameters such as inclusion of the required CRAT tables and IOMMU configuration parameters in the system BIOS.
            • Before purchasing such a system for ROCm, please verify that the BIOS provides an option for enabling IOMMUv2 and that the system BIOS properly exposes the correct CRAT table. Inquire with your vendor about the latter.
          • AMD "Raven Ridge" APUs are enabled to run OpenCL, but do not yet support HCC, HIP, or our libraries built on top of these compilers and runtimes.
            • As of ROCm 2.1, "Raven Ridge" requires the use of upstream kernel drivers.
            • In addition, various "Raven Ridge" platforms may not work due to OEM and ODM choices when it comes to key configurations parameters such as inclusion of the required CRAT tables and IOMMU configuration parameters in the system BIOS.
            • Before purchasing such a system for ROCm, please verify that the BIOS provides an option for enabling IOMMUv2 and that the system BIOS properly exposes the correct CRAT table. Inquire with your vendor about the latter.
          Not supported
          • "Tonga", "Iceland", "Vega M", and "Vega 12" GPUs are not supported in ROCm 2.9.x
          • We do not support GFX8-class GPUs (Fiji, Polaris, etc.) on CPUs that do not have PCIe 3.0 with PCIe atomics.
            • As such, we do not support AMD Carrizo and Kaveri APUs as hosts for such GPUs.

          Comment


          • #25
            Originally posted by Jumbotron View Post
            Finally to your thought that Carrizo "was effectively the last APU we shipped" that is technically not correct. Bristol Ridge and it's low power version Stoney Ridge were the last Bulldozer based APUs before AMD went clean sheet design with Zen core based APUs with integrated Vega GPUs.
            Sorry, that comment was a bit ambiguous. I should have said "the last APU we shipped before the HSA market focus shifted...", ie before we implemented an HSA subset on dGPUs (full HSA minus access to unpinned memory), added tools to simplify porting from CUDA, and launched the ROCm brand name. We kept making APUs as you say, but outside of the embedded space our business units were focused more on gaming and general desktop markets.

            Originally posted by Jumbotron View Post
            Finally to your thought that Carrizo "was effectively the last APU we shipped" that is technically not correct. Bristol Ridge and it's low power version Stoney Ridge were the last Bulldozer based APUs before AMD went clean sheet design with Zen core based APUs with integrated Vega GPUs.

            In all seriousness, I have both an HP desktop with a Bristol Ridge A12-9800P and a Lenovo Ideapad laptop with a Bristol Ridge A12-9700. I know you only have Kaveri based computers to work on. I would seriously entertain the idea of shipping either my desktop or laptop ( my expense ) to you for getting Carrizo / Bristol Ridge APU's fully ROCm supported if you thought that a worthy pursuit. As I said I would totally pick up the expense of shipping to and from you. Let me know.
            Thanks, but I'm pretty sure we have Carrizo systems at the office - it's just my home systems that are mostly Kaveri-based, along with a newer Ryzen box for builds and dGPU work. I'm not sure about laptops but will find out.

            One challenge I mentioned earlier is that we had little success getting the required SBIOS changes adopted by laptop vendors. The primary missing thing was the CRAT ACPI table, although IIRC there are several other changes required.
            Last edited by bridgman; 05 August 2020, 03:04 PM.
            Test signature

            Comment


            • #26
              Originally posted by bridgman View Post

              Sorry, that comment was a bit ambiguous. I should have said "the last APU we shipped before the HSA market focus shifted...", ie before we implemented an HSA subset on dGPUs (full HSA minus access to unpinned memory), added tools to simplify porting from CUDA, and launched the ROCm brand name. We kept making APUs as you say, but outside of the embedded space our business units were focused more on gaming and general desktop markets.



              Thanks, but I'm pretty sure we have Carrizo systems at the office - it's just my home systems that are mostly Kaveri-based, along with a newer Ryzen box for builds and dGPU work. I'm not sure about laptops but will find out.

              One challenge I mentioned earlier is that we had little success getting the required SBIOS changes adopted by laptop vendors. The primary missing thing was the CRAT ACPI table, although IIRC there are several other changes required.


              Thanks again for the reply. Yeah...your last sentence is a real pain point. I understand to an extent the turmoil AMD was going through towards the end of the Bulldozer era and a change of hands in upper management, design teams, and even CEO. AMD was in a bit of a mess before Lisa and the gang turned things around. There were no Carrizo desktop CPU's and worse yet no server parts. Even with Bristol Ridge coming in as a stopgap Desktop part before Zen could properly come out the door, Carrizo and Bristol Ridge CPUs got put into some really crap consumer PCs by companies that didn't care too much that the BIOS they used had proper IOMMUv2 support not to mention CRAT and ACPI tables. And AMD probably had too much on their plate at the time to really care. That is ESPECIALLY true with my HP Pavilion desktop running the highest end Bristol Ridge APU. It throws outs, depending on what version of Ubuntu I'm running, anywhere from 8-12 ACPI errors upon boot up. And my Lenovo Ideapad which has the 9700P mobile Bristol Ridge throws out a message about not knowing if IOMMUv2 is activated but at least it doesn't throw out any ACPI errors. Unfortunately, when I go into the Lenovo's BIOS to see if IOMMU is on there is no option for it at all.

              **Sigh** I guess I'm just going to hold on to my Bristol Ridge babies until either they die first or Zen 4 gets released with 3rd Gen Infinity Architecture fully baked along with a fully baked HMM in the Linux Kernel. I may be wrong to see it this way, but with Zen 4, 3rd Gen Infinity Arch. and HMM, the dream of AMD Fusion back in 2011 along with HSA will be, at long last, finally realized without too many "special hoops". It seems to me that HSA ended up being the Compute version of AMD's 3DNow! Kinda cool concept...but ultimately not ever broadly accepted by the industry.

              Thanks again for the info. And sorry if I seemed a bit too pendantic in my posts. I was once a Linux newbie (still consider myself one to an extent) so my posts are as much for people new to Linux and the CPU architecture world who might be lurking around sites like Phoronix just gaining knowledge about the whole thing. I know I did this for a couple of years before even joining Phoronix. Your posts have always been enlightening and useful and I personally thank you for that ! Cheers !

              Comment


              • #27
                bridgman how to use this code this year? did a fork appear? or can a patch be applied? and is it advantageous to install the old LLVM for example? i really haven't tried it. if it is faster than the latest version without HSA, i would install it

                Comment


                • #28
                  Originally posted by guglovich View Post
                  bridgman how to use this code this year? did a fork appear? or can a patch be applied? and is it advantageous to install the old LLVM for example? i really haven't tried it. if it is faster than the latest version without HSA, i would install it
                  I'm not exactly sure what you are asking, both LLVM and GCC have native backend support for various AMD GPUs. HSAIL was an intermediate language which was never really used that much.

                  Comment


                  • #29
                    Originally posted by agd5f View Post

                    I'm not exactly sure what you are asking, both LLVM and GCC have native backend support for various AMD GPUs. HSAIL was an intermediate language which was never really used that much.
                    But the support is still there? I just found out that there is AMD HSA support in a working OpenCL implementation called Pocl for AMD, and I would like to understand what backend my APU can be used with and for what tasks.

                    Comment

                    Working...
                    X