Announcement

Collapse
No announcement yet.

AMD HSA Offloading Support Dropped From The GCC Compiler

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by oleid View Post

    HSA is used as well, but not so much with GCC, it would seem. AMD's own SDK is clang-based. I'm using it with tensorflow; works fine.
    I had been under the impression that tensorflow==cuda, and that maybe there was a quite-convoluted way to get tensorflow on top of AMD hardware with ROCm. Can you comment on how hard this is to do in practice. (Pointers would be even nicer.) I'm at a point where I think it would behoove me to start learning tensorflow, and am looking to put together some hardware, figuring my 2013-era stuff is inadequate.

    Comment


    • #12
      Originally posted by ms178 View Post
      I don't have a crystal ball but what about future HPC capable APUs?! Heterogenious Computing (and HSA) might make a comeback with them - it would be great to have some sort of compiler support for exploiting all of their capabilities and the HSA stack could be a base. But maybe with oneAPU Level Zero (or other through other means), there are better ways to start with having better chances of succeeding in the market?!

      I haven't heard of AMD's software strategy for heterogenious computing for a long time (in fact their HSA summit in 2012 was the last big event around it). And with all of their recent design wins in the server market it would be great to hear what they do plan for the future to make better use of their capabilities from the software side (on Linux and Windows, please).
      A long, long time ago I was working for IBM, and when the Vector Facility became available to us I was looking forward to doing circuit simulation with it. It turned out to be a fairly minor performance improvement, nothing like what we were hoping for. It was explained to us as the "gather-scatter problem". You had to gather the information together out of a sparse matrix and put it into a form that could be pushed into the Vector Facility, and then the results had to be scattered back into the sparse matrix. There was so much overhead that it wasn't a big win. It would seem to me that something similar has to be done for GPU computing... Presuming the entire problem can't be done just by the GPU alone, you wind up with significant overhead getting in and out of the GPU.

      I saw HSA as a way around this, and a case where Unified Memory might actually be a win. But that also requires that the code understand that it's running on a UMA system and knows that it doesn't have to push data from place to place prior to working with it. If the "normal" model remains the separate GPU card, I could see UMA systems running the same way, moving data needlessly from one part of main memory to another, just to look like a normal GPU system.

      I never got a good answer to this question.

      Comment


      • #13
        Originally posted by phred14 View Post

        A long, long time ago I was working for IBM, and when the Vector Facility became available to us I was looking forward to doing circuit simulation with it. It turned out to be a fairly minor performance improvement, nothing like what we were hoping for. It was explained to us as the "gather-scatter problem". You had to gather the information together out of a sparse matrix and put it into a form that could be pushed into the Vector Facility, and then the results had to be scattered back into the sparse matrix. There was so much overhead that it wasn't a big win. It would seem to me that something similar has to be done for GPU computing... Presuming the entire problem can't be done just by the GPU alone, you wind up with significant overhead getting in and out of the GPU.

        I saw HSA as a way around this, and a case where Unified Memory might actually be a win. But that also requires that the code understand that it's running on a UMA system and knows that it doesn't have to push data from place to place prior to working with it. If the "normal" model remains the separate GPU card, I could see UMA systems running the same way, moving data needlessly from one part of main memory to another, just to look like a normal GPU system.

        I never got a good answer to this question.
        AMD actually has several recent patents on near memory scatter gather implementations... Probably to be implemented in a smart interposer or an enhanced HBM buffer die.

        Comment


        • #14
          Originally posted by cb88 View Post

          AMD actually has several recent patents on near memory scatter gather implementations... Probably to be implemented in a smart interposer or an enhanced HBM buffer die.
          Actually the guy who explained the "gather-scatter problem" to us in IBM days worked for AMD, last I saw. If you have a number or title on any of those patents, I'd be curious to look them up and see if he's on any of them.

          Comment


          • #15
            Originally posted by phred14 View Post

            A long, long time ago I was working for IBM, and when the Vector Facility became available to us I was looking forward to doing circuit simulation with it. It turned out to be a fairly minor performance improvement, nothing like what we were hoping for. It was explained to us as the "gather-scatter problem". You had to gather the information together out of a sparse matrix and put it into a form that could be pushed into the Vector Facility, and then the results had to be scattered back into the sparse matrix. There was so much overhead that it wasn't a big win.
            Which is why "real" vector ISA's, starting with Cray-I back in the 1970'ies, support scatter-gather memory ops, in order to efficiently vectorize sparse matrix style problems.

            It would seem to me that something similar has to be done for GPU computing...
            You can view scatter-gather ops as a way to extract memory-level parallelism (MLP). GPU's do something similar with the SIMT programming model, with massive numbers of lightweight threads able to push lots of MLP.

            Comment


            • #16
              Underfox on twitter posts alot of AMD patents this was one of them:

              Complete Patent Searching Database and Patent Data Analytics Services.


              Comment


              • #17
                Originally posted by phred14 View Post

                I had been under the impression that tensorflow==cuda, and that maybe there was a quite-convoluted way to get tensorflow on top of AMD hardware with ROCm. Can you comment on how hard this is to do in practice. (Pointers would be even nicer.) I'm at a point where I think it would behoove me to start learning tensorflow, and am looking to put together some hardware, figuring my 2013-era stuff is inadequate.
                Yes, your hardware will probably be too old. But for simple stuff using only your CPU works just fine. More than enough for learning purpose.

                Nevertheless, if you happen to find a better GPU, here is what I did: I'm using docker on ArchLinux, as I didn't want to break my setup during upgrades.

                AMD has a few notes on their docker images here.

                Basically:
                • Use a very recent mainline kernel or use the non-mainline module with kernel 5.4 (current lts kernel) via DKMS (e.g. on ArchLinux packages: linux-lts and rock-dkms-bin).
                • Install docker
                • Install rocm/tensorflow via docker.
                The mainline kernel didn't work for me, hence I'm using the dkms module. But you might be more lucky.

                I use the following script to map the current folder to a docker container:

                sudo docker run -it --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video --volume $PWD:/data yourname/tensorflow-rocm

                And this is my personal dockerfile I use; it has a few more goodies I need:

                FROM rocm/tensorflow

                # Initialize the image we are working with
                RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get -y install \
                python3-matplotlib \
                protobuf-compiler python-pil python-lxml python-tk
                RUN pip3 install https://github.com/microsoft/onnxcon.../v1.6.5.tar.gz
                RUN pip3 install https://github.com/onnx/keras-onnx/a.../v1.6.5.tar.gz
                RUN pip3 install https://github.com/onnx/tensorflow-o...ive/master.zip
                RUN pip3 install onnxruntime pandas tensorflow_model_optimization
                RUN pip3 install Cython contextlib2 jupyter matplotlib

                RUN apt-get clean && rm -rf /var/lib/apt/lists/*
                Building the dockerfile is done the usual way:

                docker build --tag=yourname/tensorflow-rocm:latest PATH/TO/DOCKERFILE

                Comment


                • #18
                  Originally posted by szymon_g View Post
                  rip HSA.
                  shame, on the paper it looked great.
                  It's really the "HSA" name and associated cross-vendor standardization initiative that hasn't caught on, but the software stack and most of the associated hardware (everything except ATS/PRI from IOMMUv2) is still very active, it's just called the ROCm stack now.

                  The HSA/ROCm stack was extended to include dGPUs, which did not have demand-paged memory at the time and could not assume IOMMUv2 support since other CPU vendors do not expose ATS/PRI functionality on their PCIE lanes, and we recently started using dGPU paths on APUs as well in order to make it easier for APUs to serve as development vehicles for dGPUs. I believe Renoir was the first to use dGPU paths but need to check.
                  Test signature

                  Comment


                  • #19
                    Originally posted by phred14 View Post
                    I had been under the impression that tensorflow==cuda, and that maybe there was a quite-convoluted way to get tensorflow on top of AMD hardware with ROCm. Can you comment on how hard this is to do in practice. (Pointers would be even nicer.) I'm at a point where I think it would behoove me to start learning tensorflow, and am looking to put together some hardware, figuring my 2013-era stuff is inadequate.
                    Test signature

                    Comment


                    • #20
                      I waited for Kaveri for several years, did it for a long time, then the appearance of ROCm, from which Kaveri is removed. Thanks to AMD for marketing and deceiving customers!

                      Comment

                      Working...
                      X