No announcement yet.

Intel Advanced Matrix Extensions [AMX] Performance With Xeon Scalable Sapphire Rapids

  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by brucethemoose View Post

    Apples and oranges. The Apple NPU can keep up with a small Nvidia GPU (or a large Apple GPU) using a fraction of the task energy... its actually quite remarkable, and not even in the same ballpark as AVX512-VNNI.

    But VNNI, on the other hand, will "just work" with zero code changes on stuff that already runs on the CPU, I think, while the changes required for the NPU are more drastic even at a high level. And there arent weird quirks (like the npu RNG being funky or it not liking low precision in specific scenarios), nor is there any need to partition models and shuffle stuff around.

    The NPU thing is all over the place.
    The high-level story is that most work in this space is done using Tensorflow or PyTorch, so if you can route that code to your NPU (or GPU) you're golden.
    Apple began M1 life with none of this in place. Since then they route much of Tensorflow or PyTorch to their NPU or GPU but it's an on-going, large project, especially for PyTorch. On the other hand, as described here:
    there are interesting developments in the world of PyTorch that are drastically reducing the size of this burden.

    So it's not really a question of compiling something like C++ code in the one case and not in the other. It's more a question of did you write your code using Tensorflow or PyTorch? And if so hopefully you mostly use the primitives that are already routed to HW and not the primitives that still run on the CPU...

    It's easy to make Intel look good if you find something written for OpenVINO, but that's no different from making Apple look good by finding code that targets their API's. As far as I can tell, *realistically* (as opposed to dick-measuring) the situation is as I've described; everything interesting and rapidly moving (like eg the new Diffusion art stuff) begins in PyTorch or Tensorflow, and what you care about is how well those map to your hardware.

    A better version of your complaint would be for Apple AMX. There it WOULD be ideal if code were just compiled straight to AMX rather than having to call Apple APIs. In my discussion of AMX in my PDFs I give a long justification for why Apple are doing it this way (for now...) along with many tens of pages showing just how much AMX has already changed since the first release, which clarifies to some extent why Apple doesn't want to have to slow down the design changes to cope with backward compatibility, not yet anyway.