Rusticl OpenCL Driver Nearing Cross-Vendor Shared Virtual Memory Support

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • coder
    Senior Member
    • Nov 2014
    • 8924

    #21
    Originally posted by ultimA View Post
    what ends up happening is that there will be a lot of round-trips of data back and forth, or trips of small chucks of data instead of more optimized large batches, because the programmer treats the memories as if they were unified, but they are not.
    So, your solution is only to give the programmer a screwdriver, because if you give them a more powerful tool, you're afraid they'll misuse it? That's almost always a bad argument.

    Originally posted by ultimA View Post
    Without SVM, the programmer is forced to think about transfer sizes, shoving data at the right times between algorithmic stages, overlaying data transfers with processing and so on.
    Someone who doesn't know when SVM is appropriate and how best to use it shouldn't be programming GPUs. Their programs will suck, but such a naive programmer's code is probably already rife with performance pitfalls, like overuse of synchronization and atomics.

    Clearly, the right answer is that unsophisticated programmers should stick to higher-level languages & frameworks, while those of us with a clue should have the tools we need, so that we don't have to kludge up ugly and inferior workarounds.

    Originally posted by ultimA View Post
    The absence of a shared memory space basically forces you to think about these problems (but ofc you can still choose sub-optimal solutions), whereas SVM lets the programmer get away without thinking at all. And if you are targeting efficient and fast code, not just code-offloaded to the GPU, then you'll need to come up with similarly complex schemes with SVM as without SVM, save for a few cases.
    Here's something for you to think about: why did Nvidia go to the trouble and incur the overhead of making NVLink cache-coherent?
    Last edited by coder; 05 January 2025, 06:05 PM.

    Comment

    • ultimA
      Senior Member
      • Jul 2011
      • 292

      #22
      Originally posted by coder View Post
      So, your solution is only to give the programmer a screwdriver, because if you give them a more powerful tool, you're afraid they'll misuse it? That's almost always a bad argument.
      It's not a "solution". I merely stated that devs who want the right performance generally prefer to manage memory transfers themselves, hence will stay away from making use of SVM.

      Originally posted by coder View Post
      Someone who doesn't know when SVM is appropriate and how best to use it shouldn't be programming GPUs.
      Nice theory, but "should"s and "shouldn't"s often do not apply in practice. If only people developed software who know what they are doing and how to use their tools, we wouldn't have shitty software in the world.

      Originally posted by coder View Post
      Here's something for you to think about: why did Nvidia go to the trouble and incur the overhead of making NVLink cache-coherent?
      You've got it completely backwards. Nvidia wanted to provide a unified memory model for multiple GPUs, because they correctly recognized ease-of-development will help their high-performance solutions spread better. To make multiple GPUs share memory efficiently, they needed an interconnect that surpasses PCIe's bandwidth and has lower latency. This became NVLink. The fact that NVLink increases the available cross-GPU bandwidth is not its sole purpose, and that property is not just a goal in itself but also a tool for other goals as well.
      Last edited by ultimA; 05 January 2025, 09:27 PM.

      Comment

      • ultimA
        Senior Member
        • Jul 2011
        • 292

        #23
        Originally posted by cb88 View Post

        Except that is false... the performance cost is due to the LACK of this on most SDKs thus requiring GPU->SYSTEM->GPU for any transfers.
        Sorry, but you are wrong on multiple levels.

        First, you don't need a shared virtual address space between devices to enable direct GPU-to-GPU transfers without CPU intervention. This feature is called P2P copy or P2P transfers, and has been supported by both AMD and NVIDIA over their runtimes, and for AMD even in OpenCL (I'm not sure about Intel). As for Rusticl's try to make this cross-vendor, its utility is very limited. They cannot make it work with NVIDIA as NVIDIA has consciously barred this feature to only be accessible with CUDA, which is why Rusticl points out its implementation is for only Intel and AMD. Trying to build a high-performance multi-gpu system out of Intel iGPUs (or even discrete cards)... well let's just say there's a reason nobody bothered to implement this feature yet.

        Second, the performance issue I described didn't even pertain to whether the data flows over the CPU or not. Even if it flows directly GPU-to-GPU, it will be much slower with the cards/drivers trying to figure out what and when and where and doing multiple needless copies in small chunks, compared with the developer thinking and optimizing for it, which is absolutely necessary for acceptable performance in any non-trivial GPU-compute application.
        Last edited by ultimA; 05 January 2025, 09:37 PM.

        Comment

        • ultimA
          Senior Member
          • Jul 2011
          • 292

          #24
          Originally posted by Svyatko View Post

          But in the case of shared memory (APU - iGPU) the memory is ... shared?
          Not so simple. If it's a system with a CPU and an iGPU, then there's no reason to do cross-vendor SVM in the first place, so all this is for naught.
          On the other hand, if it's a system with an iGPU and a discrete GPU, then the arguments from my previous post apply - the two GPUs still need to exchange data, but manual memory management (instead of SVM) will be even more important as the iGPU is a lot slower to access "its" memory.

          Comment

          • coder
            Senior Member
            • Nov 2014
            • 8924

            #25
            Originally posted by ultimA View Post
            You've got it completely backwards. Nvidia wanted to provide a unified memory model for multiple GPUs,
            Making it cache-coherent is not a natural consequence of having a unified address space. You basically just dodged the question. They went out of their way to make it cache-coherent, which cannot be explained without acknowledging that they intend programmers to use it like shared memory.

            I'm not going to bother responding to your other points, because they're just doubling down on your prior position without adding any new insights or buttressing your claims with citations.

            Comment

            • coder
              Senior Member
              • Nov 2014
              • 8924

              #26
              Originally posted by ultimA View Post
              They cannot make it work with NVIDIA as NVIDIA has consciously barred this feature to only be accessible with CUDA, which is why Rusticl points out its implementation is for only Intel and AMD.
              I recall reading about someone getting SVM working on Nvidia GPUs without going through CUDA, but I don't remember the details. I wouldn't presume it's impossible for Rusticle to do, but perhaps either a lower priority or maybe takes more effort, due to the state of their Mesa driver.

              Originally posted by ultimA View Post
              Trying to build a high-performance multi-gpu system out of Intel iGPUs (or even discrete cards)... well let's just say there's a reason nobody bothered to implement this feature yet.
              Intel supports multi-GPU Ponte Vecchio configurations. Indeed, this is what they have in the Aurora supercomputer.

              Comment

              • coder
                Senior Member
                • Nov 2014
                • 8924

                #27
                Originally posted by ultimA View Post
                Not so simple. If it's a system with a CPU and an iGPU, then there's no reason to do cross-vendor SVM in the first place,
                SVM abso-fucking-lutely makes sense, when you're sharing data between a CPU and iGPU!! The fact that Rusticle happens to enable cross-vendor sharing is a happy side-effect of implementing support for different vendors' GPUs within the same userspace stack.

                Comment

                • zboszor
                  Senior Member
                  • May 2016
                  • 188

                  #28
                  Anyway, regardless whether SVM is slow or not, this functionality (either SVM or via cl_ext_buffer_device_address or BDA for short) is needed for chipStar and AdaptiveCpp.

                  Also, Intel's unified shared memory (USM) is needed to support Intel's stuff, like oneDNN and OpenVINO.
                  According to the MR comment, the pieces to support SVM/BDA will also allow implementing USM. See https://gitlab.freedesktop.org/mesa/...5#note_2464776

                  Comment

                  • Jabberwocky
                    Senior Member
                    • Aug 2011
                    • 1205

                    #29
                    Originally posted by ultimA View Post
                    It makes developing with OpenCL easier, but actively using it comes at a great performance cost. So demand isn't that great, which is why other stacks prioritized other more useful features. But it makes a nice marketing headline "Hey, we are the first to implement this thing that most people do not want to use."
                    Meanwhile everyone running large scale multi models on consumer hardware relies on this to prevent their batches from crashing.

                    This even helped me to figure out the limits of my cards running monolithic models one at a time in chains. It's faster offloading to RAM than to reload the entire thing from disk.

                    Recovering from out of VRAM usually required manual intervention at best and a system restart at worst.

                    The big difference here IMO is OpenCL vs CUDA, not SVM (Shared Virtual Memory) vs VM (Virtual Memory).

                    Comment

                    • ultimA
                      Senior Member
                      • Jul 2011
                      • 292

                      #30
                      Originally posted by coder View Post
                      Making it cache-coherent is not a natural consequence of having a unified address space. You basically just dodged the question. They went out of their way to make it cache-coherent, which cannot be explained without acknowledging that they intend programmers to use it like shared memory.
                      I didn't dodge the question: they did it to ease development on their platform, and they needed NVLink to make it also performant. This goes for cache-coherency just like as it goes for a unified address space.

                      Originally posted by coder View Post
                      SVM abso-fucking-lutely makes sense, when you're sharing data between a CPU and iGPU!!
                      Indeed it does. I said in this case there's no reason to make it cross-vendor, not that there's no reason for SVM at all in this case.

                      Comment

                      Working...
                      X