Announcement

**Pekka** · 08 September 2023, 02:08 AM

Originally posted by ddriver View Post

Throwing arbitrary kernels at transparent distributes systems doesn't sound like the best idea.

I can't see that performing well, cool concept, but when you want to leverage such systems, you really need to have the systems and kernels designed accordingly and managing resources intelligently.

Maybe it will be usable for really big through and thorough bulk batches, but overall that's a recipe for poor hardware utilization,

Improper grain distribution of work can result in precisely the same types of performance regressions when you try to grain SMT work too fine and adding more threads merely increase the synchronization bottleneck, so you actually lose performance for every added thread...

"Across network systems" sounds like a LOT of latency, even if you are at 100 gbit throughput, you still pay a number of penalties while traversing PHYs, buffers and encoders / decoders. And GPU vendors still struggle to properly schedule work at "same chip level".

You really need a "you do this specific things and you do those" workload graph, with latency and bandwidth requirements and constraints, a "distributed jobs markup language" of a sort to give the available node scheduler the right hints.

Is there someone throwing arbitrary kernels around? Obviously distributed execution is not for any arbitrary program/kernel. It boils down to the application's operations per data ratio as well as latency requirements if it makes sense to offload at all - across the PCIe or across the network (regardless what is used to actually perform the distribution, PoCL-R, MPI or whatever).

The original use case for PoCL-R was to research adaptive edge offloading of heterogeneous workloads in latency critical use scenarios (especially modern networking techniques in mind). This is one of the selling points of 5G and beyond; being able to offload also latency critical tasks. Whether and how widely it can be done is still an open research question which involves low latency compression techniques among other topics.

Another use case we study for PoCL-R is massive-scale HPC where similar considerations apply with the main difference that the applications are not usually latency critical. How widely it can replace MPI in that domain is a question and whether extending OpenCL to cover more of the use cases is worthwhile is another one that can be studied in the future.

Where things get interesting in the future is the mix; the low latency high-performance cases. These haven't yet even been imagined yet due to not having the foundational tech. and tooling available.

BR,
Pekka

**Pekka** · 08 September 2023, 02:18 AM

Originally posted by coder View Post

I'm not sure if OpenCL is the best API for dataflow parallelism, which is how you want to scale single inference performance, using multiple nodes. You'd really rather setup your dataflow graph and just let it go, instead of having a host node try to orchestrate everything. How much difference it makes, in practice... hard to say.

BTW, I'm surprised it's apparently not using RDMA.

OpenCL works well for dataflow parallelism thanks to the power of the command queues, events, and now even the command buffering. In the truest sense, the pipes introduced in OpenCL 2.0 are a proper data flow programming/execution model, but we have found that for most use cases they are unnecessary. It boils down to how fast you can (re)launch the heterogeneous task graphs made out of command queues (and if the runtime can parallelize commands out from multiple CQs) if it works efficiently for "streaming" use cases or not.
For capturing HW accelerators we are utilizing the built-in-kernel interface of OpenCL, a powerful but underutilized concept of the API.

Related to RDMA: We write in the announcement web page (TLDR?): "In multi-server setups the effects of server to server transfers can be mitigated somewhat by building PoCL with RDMA support enabled, if RDMA is supported by the networking hardware." also "It is worth noting that PoCL-R will leave buffers resident on devices after use, so unchanged buffers do not need to be transferred again on next use. This means that static buffers such as neural network coefficients only need to be uploaded once during launch and afterwards inference can be performed repeatedly without this initial buffer transfer cost." There's more in the Arxiv article. Jan who has implemented the RDMA support might want to update its current status if there's something to add.

BR,
Pekka

**Eirikr1848** · 30 September 2023, 08:56 PM

Originally posted by coder View Post

I'm not sure if OpenCL is the best API for dataflow parallelism, which is how you want to scale single inference performance, using multiple nodes. You'd really rather setup your dataflow graph and just let it go, instead of having a host node try to orchestrate everything. How much difference it makes, in practice... hard to say.

BTW, I'm surprised it's apparently not using RDMA.

Truth. However, unless OpenVINO supports AVX CPUs, Nvidia, AMD, Intel GPUs and "VPU-type" devices and FPGAs for workload distribution: Distributed OpenCL seems to my n00b4evah brain to be the only viable solution for now.

Heck, POCL even works on my actually ancient Pentium III Xeon quad-CPU setup with dual PCI Radeon HD 5450 2GB GDDR5 cards. OpenCL on all four CPUs and both GPUs. Sure, image gen is slow as dirt. But better than nothing.

Adding that into a small lab/home lab pool really does not seem to have another option.

(Shout-out to ArchLinux32 team for keeping old tech alive)

Announcement

PoCL-Remote Allows OpenCL To Be Transparently Used Across Networked Systems

Comment

Comment

Comment