Announcement

**Spacefish** · 04 March 2023, 01:43 PM

Hope they will add Versal AI cores to Desktop CPUs or GPUs some time soon.
Really looking forward to the 7040 Series Phoenix APUs, as they got Versal Cores.

**Laughing1** · 05 March 2023, 12:56 AM

Originally posted by Spacefish View Post

Hope they will add Versal AI cores to Desktop CPUs or GPUs some time soon.
Really looking forward to the 7040 Series Phoenix APUs, as they got Versal Cores.

AMD's new Ryzen 7040 Series laptop APU has special AI sauce

https://www.pcgamer.com/amds-new-ryzen-7040-series-laptop-apu-has-special-ai-sauce/

AMD beats Intel to putting a dedicated AI Engine into a consumer CPU.

**Meteorhead** · 05 March 2023, 07:50 AM

There's just one sad thing about this. We had a technology for over a decade now that did this unification: it's called OpenCL. It's a shame how such a data-parallel workload got fused off into 10 different APIs and technologies only to be unified again under a Python umbrella.

**aviallon** · 05 March 2023, 07:57 AM

Originally posted by Meteorhead View Post

There's just one sad thing about this. We had a technology for over a decade now that did this unification: it's called OpenCL. It's a shame how such a data-parallel workload got fused off into 10 different APIs and technologies only to be unified again under a Python umbrella.

CUDA might be the culprit here.

**pong** · 05 March 2023, 11:20 PM

How far away can we be from a less monolithic / vertically integrated stack / ecosystem and something more granular?
At the bottom end the CPU/GPU implements some registers and instructions and various cache / memory access capabilities and then various levels of ALU / CORE / SIMD / thread / thread group / work group / ... parallelism for the most part.
At the next higher end there's the overall resource scheduling, data and code loading from the host CPU / memory.
Then there's a usual bit of shared memory between host / GPU or within the execution groups within the GPU.

A lot of that GENERAL sort of stuff already has to be considered for even CPUs these days with NUMA and threading and multi-core and multi-chiplet
and localized caches and backplane connected multiprocessor systems etc. to say nothing of MPI / distributed computing etc.

So obviously stuff like LLVM and general compilers already deal with ISA / architecture level optimization for caches, instruction ordering,
instruction translation / elucidation for higher level generic IR / pseudocode inputs, resource contention for registers / memory / cache / ALUs etc.
Vectorization optimizers deal with SIMD generation where possible and loop unrolling and variously invariant code optimization happens and all that.
Then layers of scheduling stuff deal with thread / process scheduling and even job scheduling across distributed systems, message passing, MPI, etc. etc.
Even performance analyzers and during run and post-run profiling and optimization to make the efficiency improve / adapt to complex work loads and
architectures.

And more and more languages and ecosystems, development models are just adapting concurrency models independent of the actual machine level implementation of
cores / threads e.g. goroutines, CSP, XMOS, erlang, event driven / message passing systems, shared-nothing, etc.

So CUDA and Pytorch and Tensorflow and whatever are great and so is C++ or DPC++ or OpenMP or whatever but surely with the right tools and
separation of concerns and intermediate layers of optimizers / translators etc. we can get beyond what CPU / GPU vendor we're targeting at
run time wrt. how we're coding things at a high level and open compiler / optimizer / execution environment management runtime tools
just make your code work fairly efficiently on whatever ISA / architecture / CPU / GPU you may have with a minimal amount of
vendor specific "lock in" of the entire tool chain / compiler / optimizer / scheduler / runtime / driver / ... etc.

So whether you start with OCL or OMP or DPC++ or rust or GO or C++ or whatever you'd want it to build / run efficiently if you've
indicated manually or by automatic analysis where the parallelism / data flow should be focused on various RAM / IO / distributed resources in
your mesh of computing and storage resources.

Originally posted by Meteorhead View Post

There's just one sad thing about this. We had a technology for over a decade now that did this unification: it's called OpenCL. It's a shame how such a data-parallel workload got fused off into 10 different APIs and technologies only to be unified again under a Python umbrella.

**coder** · 07 March 2023, 03:03 AM

Originally posted by Meteorhead View Post

There's just one sad thing about this. We had a technology for over a decade now that did this unification: it's called OpenCL. It's a shame how such a data-parallel workload got fused off into 10 different APIs and technologies only to be unified again under a Python umbrella.

I blame the big software vendors, but Google most of all. After Apple withdrew support for OpenCL, it was only Google that had the clout to make it stick. Instead, they chose to push RenderScript as their preferred GPU compute solution for Android and literally banned Android devices from shipping with OpenCL support installed. Microsoft was always going to go its own way, with things like DirectCompute, C++AMP, etc. So, it really came down to Google... and they failed us.

Among vendors, Intel is the lone OpenCL holdout. For that, my next GPU will be Intel. But, the damage was already done. Without a vibrant community, OpenCL got off track and fell too far behind CUDA.

AMD got distracted by HSA, which went nowhere, and then threw in the towel and made its CUDA clone called HIP. That left a window of at least 5 years that were crucial for deep learning, where CUDA was basically the only viable API. Between that and Nvidia's support for educational institutions & deep learning researchers, they easily sewed up that market.

Originally posted by aviallon View Post

CUDA might be the culprit here.

Eh, I don't blame Nvidia as much as Google. Vendors -- especially dominant ones -- are always inclined to push their own proprietary APIs and frameworks. In many ways, CUDA did pioneer GPU compute programmability, not entirely unlike how Mantle prototyped a new GPU rendering API. I think OpenCL generally benefited from that.

It's only the biggest software vendors and platform companies who have the clout to push standards on hardware vendors (e.g. Vulkan).

**coder** · 07 March 2023, 03:14 AM

Originally posted by pong View Post

A lot of that GENERAL sort of stuff already has to be considered for even CPUs these days with NUMA and threading and multi-core and multi-chiplet
and localized caches and backplane connected multiprocessor systems etc. to say nothing of MPI / distributed computing etc.

I like how OpenCL exposes memory hierarchies and workgroup parallelism. We need more of those ideas to permeate userspace APIs and OS schedulers.

Originally posted by pong View Post

Then layers of scheduling stuff deal with thread / process scheduling and even job scheduling across distributed systems, message passing, MPI, etc. etc.

Not well. The threading APIs we have are ancient, and OS kernels are organized around those. This area has been slow to adapt, because you need simultaneous changes in both threading APIs and at the OS level. I think Apple has done some interesting things, here.

Announcement

AMD Unified Inference Frontend 1.1 Released

AMD Unified Inference Frontend 1.1 Released

Comment

Comment

Comment

Comment

Comment

Comment

Comment