Originally posted by brucethemoose
View Post
The high-level story is that most work in this space is done using Tensorflow or PyTorch, so if you can route that code to your NPU (or GPU) you're golden.
Apple began M1 life with none of this in place. Since then they route much of Tensorflow or PyTorch to their NPU or GPU but it's an on-going, large project, especially for PyTorch. On the other hand, as described here: https://www.semianalysis.com/p/nvidi...itritonpytorch
there are interesting developments in the world of PyTorch that are drastically reducing the size of this burden.
So it's not really a question of compiling something like C++ code in the one case and not in the other. It's more a question of did you write your code using Tensorflow or PyTorch? And if so hopefully you mostly use the primitives that are already routed to HW and not the primitives that still run on the CPU...
It's easy to make Intel look good if you find something written for OpenVINO, but that's no different from making Apple look good by finding code that targets their API's. As far as I can tell, *realistically* (as opposed to dick-measuring) the situation is as I've described; everything interesting and rapidly moving (like eg the new Diffusion art stuff) begins in PyTorch or Tensorflow, and what you care about is how well those map to your hardware.
A better version of your complaint would be for Apple AMX. There it WOULD be ideal if code were just compiled straight to AMX rather than having to call Apple APIs. In my discussion of AMX in my PDFs I give a long justification for why Apple are doing it this way (for now...) along with many tens of pages showing just how much AMX has already changed since the first release, which clarifies to some extent why Apple doesn't want to have to slow down the design changes to cope with backward compatibility, not yet anyway.
Comment