Announcement

Collapse
No announcement yet.

AMD Job Posting Confirms More Details Around Their AI GPU Compute Stack Plans

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Panix
    replied
    Originally posted by Developer12 View Post

    It happens in LLVM but not for good reasons. Most of them are historical "we need OpenCL and we don't want to build it" reasons or "we would like to somehow use the same driver for windows and linux, and we can't ship the linux driver on windows" reasons.

    All those extra dialects are BAD, BAD news. It's already really bad that they all have their own forks of LLVM with rando modifications to try to make it work better with GPU code. That shit is impossible to upstream and it makes it really difficult (if not impossible) for them to move to new versions of LLVM. Further fragmentation with MLIR is not going to help the situation.

    AMD is not a shining example of good driver development. Everyone on windows knows that already. AMD's driver and it's dumb locking issues are the reason there's no sane way to teardown a DRM-sched. AMD could pull their head out of their ass, drop their LLVM-based driver that is buggy garbage anyway, and move people to the parallel community drivers for their cards that actually work and perform well, but no, they're letting VALVe do all the work. It was because AMD's LLVM-based compiler is so terrible that VALVe funded the development of the ACO compiler for AMD cards. And then there's the absolute clusterfuck that is their "maybe it works maybe it doesn't" pretend-support for most of their cards, which is the main reason AI workloads are staying the fuck away from wasting time with AMD. Look on forums and you'll see an endless river of "I spent three days trying to get ROCm to work on my last-gen AMD card and it keeps crashing in a weird way when I make small changes" or "I used to be able to run my workload on card X with ROCm Y, but small update to ROCM Y.1 broke everything so I guess I'll just use the old version forever."

    AMD is not the adult in the room when it comes to GPU drivers. At this point the only way their cards will ever get traction is if AMD get the hell out of the way and just leave the driver development to other people. That's how it played out with gaming on linux, that's how it's going to play out with AI.
    Agreed. AMD is so behind in that field - their focus was on gaming and only recently, have they invested any $/time in AI/Compute. My next gpu will be Nvidia - I have no choice. Practically everyone I've chatted with have said - to 'buy nvidia' - for Blender and Stable Diffusion (video editing too - recommendations are for a nvidia gpu). So....

    Leave a comment:


  • Developer12
    replied
    Originally posted by oleid View Post

    I'm not so sure there. MLIR can quite successfully being used for Nvidia hardware.

    Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.


    And people are busy implementing the remaining bits of PTX in MLIR's GPU dialect.
    The development of all GPU manufacturers happens in llvm. They even have their own dialects in mlir to support specific features. It would be stupid if AMD would not support MLIR and by this llvm.

    But I agree, it is a lot more work for distributions to package and use llvm. At least Gentoo supports multiple parallel llvm installations.
    It happens in LLVM but not for good reasons. Most of them are historical "we need OpenCL and we don't want to build it" reasons or "we would like to somehow use the same driver for windows and linux, and we can't ship the linux driver on windows" reasons.

    All those extra dialects are BAD, BAD news. It's already really bad that they all have their own forks of LLVM with rando modifications to try to make it work better with GPU code. That shit is impossible to upstream and it makes it really difficult (if not impossible) for them to move to new versions of LLVM. Further fragmentation with MLIR is not going to help the situation.

    AMD is not a shining example of good driver development. Everyone on windows knows that already. AMD's driver and it's dumb locking issues are the reason there's no sane way to teardown a DRM-sched. AMD could pull their head out of their ass, drop their LLVM-based driver that is buggy garbage anyway, and move people to the parallel community drivers for their cards that actually work and perform well, but no, they're letting VALVe do all the work. It was because AMD's LLVM-based compiler is so terrible that VALVe funded the development of the ACO compiler for AMD cards. And then there's the absolute clusterfuck that is their "maybe it works maybe it doesn't" pretend-support for most of their cards, which is the main reason AI workloads are staying the fuck away from wasting time with AMD. Look on forums and you'll see an endless river of "I spent three days trying to get ROCm to work on my last-gen AMD card and it keeps crashing in a weird way when I make small changes" or "I used to be able to run my workload on card X with ROCm Y, but small update to ROCM Y.1 broke everything so I guess I'll just use the old version forever."

    AMD is not the adult in the room when it comes to GPU drivers. At this point the only way their cards will ever get traction is if AMD get the hell out of the way and just leave the driver development to other people. That's how it played out with gaming on linux, that's how it's going to play out with AI.

    Leave a comment:


  • oleid
    replied
    Originally posted by Developer12 View Post

    MLIR might alleviate some of the issues, but the fundamental truth is that LLVM just was never intended or architected to compile code for GPUs. MESA with NIR will *always* outperform it.
    I'm not so sure there. MLIR can quite successfully being used for Nvidia hardware.

    Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.


    And people are busy implementing the remaining bits of PTX in MLIR's GPU dialect.
    The development of all GPU manufacturers happens in llvm. They even have their own dialects in mlir to support specific features. It would be stupid if AMD would not support MLIR and by this llvm.

    But I agree, it is a lot more work for distributions to package and use llvm. At least Gentoo supports multiple parallel llvm installations.

    Leave a comment:


  • Developer12
    replied
    Originally posted by oleid View Post

    That was discussed here before, yet, I couldn't find it anymore. My takeaway was: NIR is better suited for graphics driver needs, however, llvm is better for compute.

    Edit: also, there is now MLIR and apparently the projects mentioned in the article make heavy use of it. That didn't exist back then when NIR was started.

    Edit2:
    I think the main problem with using llvm (the project) for GPUs is llvm (the intermediate representation). MLIR is the solution here, one of the reasons being higher level than llvm IR.
    LLVM has been "good for compute" only because you historically for OpenCL for free (originally for targeting CPUs). With rusticl this advantage evaporates.

    MLIR might alleviate some of the issues, but the fundamental truth is that LLVM just was never intended or architected to compile code for GPUs. MESA with NIR will *always* outperform it.

    MLIR also does *nothing* to alleviate the massive, massive headache that is packaging LLVM to work with MESA. They operate on totally different release schedules, LLVM doesn't play nice when loading multiple versions of itself, there's gobs and gobs of workaround code involved and it takes FOREVER for users to actually get access to new features and fixes, let alone drop old buggy versions of LLVM. And then there's just that LLVM is goddamn impossible to package cleanly so the burden makes distros hate it.

    Leave a comment:


  • ssokolow
    replied
    *yawn* Let me know when I can feel confident that Easy Diffusion, automatic1111, Kohya_ss, and any other Stable Diffusion-related container-esque distro that comes to my attention can be trusted to Just Work with their "clone our repo and run this shell script" install flow.

    The whole reason I upgraded from an Athlon II X2 270 (a 2011 CPU) with a brand new Cyber Monday'd RTX 3060 12GiB to a Ryzen 5 7600 was because it required too much expertise to swap out the bundled copy of TensorFlow for one built without a requirement for AVX. I don't want to risk that again, but with "one built with support for not just AMD, but my AMD card" substituted in that sentence.

    Leave a comment:


  • CochainComplex
    replied
    Originally posted by Spacefish View Post

    nah, they just didn´t enable the consumer cards by default in ROCm.. They where still supported and worked perfectly fine, GCN / Polaris Vega was supported anyway, as that was still the time, where the accelerators sold for ML / compute where the same architecture than the consumer gaming cards. The Vega 64 Consumer card even had HBM2 memory

    And RDNA1 was/is supported as well. You can run PyTorch on a 5700XT today! ML (MIOpen) support for RDNA took a littlebit, but it works fine since some years now!
    Yes its improving. I was complaining about this since years but I do see some improvement. Which is great.

    Leave a comment:


  • oleid
    replied
    Originally posted by Developer12 View Post

    There are broadly two ways to build a GPU driver: get your shader compiler from LLVM, or use NIR. AMD chose the former while everyone else chose the latter. Yes, it means that AMD got OpenCL for free. But, LLVM is both a terrible compiler for GPUs and an absolute boat-anchor when it comes to release schedules and packaging.
    That was discussed here before, yet, I couldn't find it anymore. My takeaway was: NIR is better suited for graphics driver needs, however, llvm is better for compute.

    Edit: also, there is now MLIR and apparently the projects mentioned in the article make heavy use of it. That didn't exist back then when NIR was started.

    Edit2:
    I think the main problem with using llvm (the project) for GPUs is llvm (the intermediate representation). MLIR is the solution here, one of the reasons being higher level than llvm IR.
    Last edited by oleid; 14 October 2024, 01:37 AM.

    Leave a comment:


  • Developer12
    replied
    Oh, and another thing: the sheer fact that AMD is going to LLVM for this tells you they've LEARNED ABSOLUTELY NOTHING.

    There are broadly two ways to build a GPU driver: get your shader compiler from LLVM, or use NIR. AMD chose the former while everyone else chose the latter. Yes, it means that AMD got OpenCL for free. But, LLVM is both a terrible compiler for GPUs and an absolute boat-anchor when it comes to release schedules and packaging.

    With rusticl shipping OpenCL support there's even less reason to choose LLVM than there ever was, and yet AMD are doubling down on this stupid decision.

    Leave a comment:


  • Developer12
    replied
    Is the GPU support going to continue to only cover a handful of GPUS, while leaving everyone else to wonder which others are buggy, broken, and a total waste of time?

    Because I'm all out of patience. Even if SOME of amd's GPUs work, it's not worth my time and risk to try to find which ones those are, especially when the next update could break them. AMD makes good hardware but their software stack completely negates any positivity with a steaming pile of uncertainty, labour, and risk.

    Leave a comment:


  • Spacefish
    replied
    Originally posted by pWe00Iri3e7Z9lHOX2Qx View Post

    When it comes to GPU compute, "broad coverage" is the exact opposite of what AMD has done so far.
    nah, they just didn´t enable the consumer cards by default in ROCm.. They where still supported and worked perfectly fine, GCN / Polaris Vega was supported anyway, as that was still the time, where the accelerators sold for ML / compute where the same architecture than the consumer gaming cards. The Vega 64 Consumer card even had HBM2 memory

    And RDNA1 was/is supported as well. You can run PyTorch on a 5700XT today! ML (MIOpen) support for RDNA took a littlebit, but it works fine since some years now!

    Leave a comment:

Working...
X