Originally posted by oleid
View Post
Announcement
Collapse
No announcement yet.
AMD HSA Offloading Support Dropped From The GCC Compiler
Collapse
X
-
- Likes 2
-
Originally posted by ms178 View PostI don't have a crystal ball but what about future HPC capable APUs?! Heterogenious Computing (and HSA) might make a comeback with them - it would be great to have some sort of compiler support for exploiting all of their capabilities and the HSA stack could be a base. But maybe with oneAPU Level Zero (or other through other means), there are better ways to start with having better chances of succeeding in the market?!
I haven't heard of AMD's software strategy for heterogenious computing for a long time (in fact their HSA summit in 2012 was the last big event around it). And with all of their recent design wins in the server market it would be great to hear what they do plan for the future to make better use of their capabilities from the software side (on Linux and Windows, please).
I saw HSA as a way around this, and a case where Unified Memory might actually be a win. But that also requires that the code understand that it's running on a UMA system and knows that it doesn't have to push data from place to place prior to working with it. If the "normal" model remains the separate GPU card, I could see UMA systems running the same way, moving data needlessly from one part of main memory to another, just to look like a normal GPU system.
I never got a good answer to this question.
- Likes 2
Comment
-
Originally posted by phred14 View Post
A long, long time ago I was working for IBM, and when the Vector Facility became available to us I was looking forward to doing circuit simulation with it. It turned out to be a fairly minor performance improvement, nothing like what we were hoping for. It was explained to us as the "gather-scatter problem". You had to gather the information together out of a sparse matrix and put it into a form that could be pushed into the Vector Facility, and then the results had to be scattered back into the sparse matrix. There was so much overhead that it wasn't a big win. It would seem to me that something similar has to be done for GPU computing... Presuming the entire problem can't be done just by the GPU alone, you wind up with significant overhead getting in and out of the GPU.
I saw HSA as a way around this, and a case where Unified Memory might actually be a win. But that also requires that the code understand that it's running on a UMA system and knows that it doesn't have to push data from place to place prior to working with it. If the "normal" model remains the separate GPU card, I could see UMA systems running the same way, moving data needlessly from one part of main memory to another, just to look like a normal GPU system.
I never got a good answer to this question.
- Likes 2
Comment
-
Originally posted by cb88 View Post
AMD actually has several recent patents on near memory scatter gather implementations... Probably to be implemented in a smart interposer or an enhanced HBM buffer die.
- Likes 1
Comment
-
Originally posted by phred14 View Post
A long, long time ago I was working for IBM, and when the Vector Facility became available to us I was looking forward to doing circuit simulation with it. It turned out to be a fairly minor performance improvement, nothing like what we were hoping for. It was explained to us as the "gather-scatter problem". You had to gather the information together out of a sparse matrix and put it into a form that could be pushed into the Vector Facility, and then the results had to be scattered back into the sparse matrix. There was so much overhead that it wasn't a big win.
It would seem to me that something similar has to be done for GPU computing...
Comment
-
Originally posted by phred14 View Post
I had been under the impression that tensorflow==cuda, and that maybe there was a quite-convoluted way to get tensorflow on top of AMD hardware with ROCm. Can you comment on how hard this is to do in practice. (Pointers would be even nicer.) I'm at a point where I think it would behoove me to start learning tensorflow, and am looking to put together some hardware, figuring my 2013-era stuff is inadequate.
Nevertheless, if you happen to find a better GPU, here is what I did: I'm using docker on ArchLinux, as I didn't want to break my setup during upgrades.
AMD has a few notes on their docker images here.
Basically:- Use a very recent mainline kernel or use the non-mainline module with kernel 5.4 (current lts kernel) via DKMS (e.g. on ArchLinux packages: linux-lts and rock-dkms-bin).
- Install docker
- Install rocm/tensorflow via docker.
I use the following script to map the current folder to a docker container:
sudo docker run -it --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video --volume $PWD:/data yourname/tensorflow-rocm
And this is my personal dockerfile I use; it has a few more goodies I need:
FROM rocm/tensorflow
# Initialize the image we are working with
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get -y install \
python3-matplotlib \
protobuf-compiler python-pil python-lxml python-tk
RUN pip3 install https://github.com/microsoft/onnxcon.../v1.6.5.tar.gz
RUN pip3 install https://github.com/onnx/keras-onnx/a.../v1.6.5.tar.gz
RUN pip3 install https://github.com/onnx/tensorflow-o...ive/master.zip
RUN pip3 install onnxruntime pandas tensorflow_model_optimization
RUN pip3 install Cython contextlib2 jupyter matplotlib
RUN apt-get clean && rm -rf /var/lib/apt/lists/*
docker build --tag=yourname/tensorflow-rocm:latest PATH/TO/DOCKERFILE
Comment
-
Originally posted by szymon_g View Postrip HSA.
shame, on the paper it looked great.
The HSA/ROCm stack was extended to include dGPUs, which did not have demand-paged memory at the time and could not assume IOMMUv2 support since other CPU vendors do not expose ATS/PRI functionality on their PCIE lanes, and we recently started using dGPU paths on APUs as well in order to make it easier for APUs to serve as development vehicles for dGPUs. I believe Renoir was the first to use dGPU paths but need to check.Test signature
- Likes 1
Comment
-
Originally posted by phred14 View PostI had been under the impression that tensorflow==cuda, and that maybe there was a quite-convoluted way to get tensorflow on top of AMD hardware with ROCm. Can you comment on how hard this is to do in practice. (Pointers would be even nicer.) I'm at a point where I think it would behoove me to start learning tensorflow, and am looking to put together some hardware, figuring my 2013-era stuff is inadequate.Test signature
Comment
Comment