Announcement

**schmidtbag** · 05 September 2023, 08:45 AM

Does this work for any OpenCL application or does it have to be intentionally implemented?

**pegasus** · 05 September 2023, 08:57 AM

Very impressive. I especially like the fact that this implementation is network agnostic. rCUDA has been around since 2011 but still wants to see infiniband underneath ...

**Michael** · 05 September 2023, 09:18 AM

Originally posted by schmidtbag View Post

Does this work for any OpenCL application or does it have to be intentionally implemented?

Supposed to work for any OpenCL program.

**kpedersen** · 05 September 2023, 10:22 AM

Nice, I was looking for something like this back in the day when writing my thesis.

In some ways it could be utilized for digital preservation of performance intensive software, running an old OS in an unaccelerated i.e qemu instance, offloading the intensive parts outside the emulated boundary onto the host.

**Eirikr1848** · 05 September 2023, 12:35 PM

Now I’m wondering if this can be used for distributed generative AI locally, like local llama or stable diffusion

**amxfonseca** · 05 September 2023, 03:15 PM

Originally posted by pegasus View Post

rCUDA has been around since 2011 but still wants to see infiniband underneath ...

That’s a really good deal for the company that makes Infiniband adapters. I wonder who they are…

**ddriver** · 07 September 2023, 04:14 AM

Throwing arbitrary kernels at transparent distributes systems doesn't sound like the best idea.

I can't see that performing well, cool concept, but when you want to leverage such systems, you really need to have the systems and kernels designed accordingly and managing resources intelligently.

Maybe it will be usable for really big through and thorough bulk batches, but overall that's a recipe for poor hardware utilization,

Improper grain distribution of work can result in precisely the same types of performance regressions when you try to grain SMT work too fine and adding more threads merely increase the synchronization bottleneck, so you actually lose performance for every added thread...

"Across network systems" sounds like a LOT of latency, even if you are at 100 gbit throughput, you still pay a number of penalties while traversing PHYs, buffers and encoders / decoders. And GPU vendors still struggle to properly schedule work at "same chip level".

You really need a "you do this specific things and you do those" workload graph, with latency and bandwidth requirements and constraints, a "distributed jobs markup language" of a sort to give the available node scheduler the right hints.

**coder** · 07 September 2023, 06:03 AM

Originally posted by Eirikr1848 View Post

Now I’m wondering if this can be used for distributed generative AI locally, like local llama or stable diffusion

I'm not sure if OpenCL is the best API for dataflow parallelism, which is how you want to scale single inference performance, using multiple nodes. You'd really rather setup your dataflow graph and just let it go, instead of having a host node try to orchestrate everything. How much difference it makes, in practice... hard to say.

BTW, I'm surprised it's apparently not using RDMA.

**coder** · 07 September 2023, 06:08 AM

Originally posted by ddriver View Post

Maybe it will be usable for really big through and thorough bulk batches, but overall that's a recipe for poor hardware utilization,

I could see it performing well with kernels that work exclusively on local data and each take a comparatively long time. In such cases, it would be an easy path to utilize distributed computation, without having to rewrite your code.

I don't have examples in mind, but I'm sure some exist. Maybe they built it for precisely such a workload.

Announcement

PoCL-Remote Allows OpenCL To Be Transparently Used Across Networked Systems

PoCL-Remote Allows OpenCL To Be Transparently Used Across Networked Systems

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment