Announcement

Collapse
No announcement yet.

PoCL-Remote Allows OpenCL To Be Transparently Used Across Networked Systems

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PoCL-Remote Allows OpenCL To Be Transparently Used Across Networked Systems

    Phoronix: PoCL-Remote Allows OpenCL To Be Transparently Used Across Networked Systems

    PoCL began as an open-source project providing a CPU-based OpenCL implementation and over the years has added support for various LLVM back-ends such as for targeting AMD HSA, Intel Level Zero, and NVIDIA CUDA/PTX with its OpenCL implementation. The latest back-end merged ahead of Portable Computing Language 5.0 is a remote back-end that allows for OpenCL codes to be transparently utilized on networked systems for distributed computing...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Does this work for any OpenCL application or does it have to be intentionally implemented?

    Comment


    • #3
      Very impressive. I especially like the fact that this implementation is network agnostic. rCUDA has been around since 2011 but still wants to see infiniband underneath ...

      Comment


      • #4
        Originally posted by schmidtbag View Post
        Does this work for any OpenCL application or does it have to be intentionally implemented?
        Supposed to work for any OpenCL program.
        Michael Larabel
        https://www.michaellarabel.com/

        Comment


        • #5
          Nice, I was looking for something like this back in the day when writing my thesis.

          In some ways it could be utilized for digital preservation of performance intensive software, running an old OS in an unaccelerated i.e qemu instance, offloading the intensive parts outside the emulated boundary onto the host.

          Comment


          • #6
            Now I’m wondering if this can be used for distributed generative AI locally, like local llama or stable diffusion

            Comment


            • #7
              Originally posted by pegasus View Post
              rCUDA has been around since 2011 but still wants to see infiniband underneath ...
              That’s a really good deal for the company that makes Infiniband adapters. I wonder who they are…

              Comment


              • #8
                Throwing arbitrary kernels at transparent distributes systems doesn't sound like the best idea.

                I can't see that performing well, cool concept, but when you want to leverage such systems, you really need to have the systems and kernels designed accordingly and managing resources intelligently.

                Maybe it will be usable for really big through and thorough bulk batches, but overall that's a recipe for poor hardware utilization,

                Improper grain distribution of work can result in precisely the same types of performance regressions when you try to grain SMT work too fine and adding more threads merely increase the synchronization bottleneck, so you actually lose performance for every added thread...

                "Across network systems" sounds like a LOT of latency, even if you are at 100 gbit throughput, you still pay a number of penalties while traversing PHYs, buffers and encoders / decoders. And GPU vendors still struggle to properly schedule work at "same chip level".

                You really need a "you do this specific things and you do those" workload graph, with latency and bandwidth requirements and constraints, a "distributed jobs markup language" of a sort to give the available node scheduler the right hints.


                Comment


                • #9
                  Originally posted by Eirikr1848 View Post
                  Now I’m wondering if this can be used for distributed generative AI locally, like local llama or stable diffusion
                  I'm not sure if OpenCL is the best API for dataflow parallelism, which is how you want to scale single inference performance, using multiple nodes. You'd really rather setup your dataflow graph and just let it go, instead of having a host node try to orchestrate everything. How much difference it makes, in practice... hard to say.

                  BTW, I'm surprised it's apparently not using RDMA.

                  Comment


                  • #10
                    Originally posted by ddriver View Post
                    Maybe it will be usable for really big through and thorough bulk batches, but overall that's a recipe for poor hardware utilization,
                    I could see it performing well with kernels that work exclusively on local data and each take a comparatively long time. In such cases, it would be an easy path to utilize distributed computation, without having to rewrite your code.

                    I don't have examples in mind, but I'm sure some exist. Maybe they built it for precisely such a workload.

                    Comment

                    Working...
                    X