Announcement

Collapse
No announcement yet.

Intel Advanced Matrix Extensions [AMX] Performance With Xeon Scalable Sapphire Rapids

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Did absolutely none of those workloads succeed when forced down below AVX512_CORE? For any that could succeed, it would be fascinating to see AVX2 results, as that represents the relevant baseline experienced by consumer Intel 12th & 13th gen chips and AMD Zen < 4.

    Who cares? It seems relevant to me, as small bits of various sorts of AI/ML processing are more and more likely to trickle down into consumer applications. Seeing how badly AVX2 gets beaten would speak to how significant it might be for one of the CPU vendors to start adding this stuff into consumer chips. (As indeed AMD have done, with AVX512 in Zen 4)

    Comment


    • #12
      Wonder which market AMX is targeted at...

      - These are server processors, so not at the edge to speed up ML inference, we have iGPUs for that anyway.

      - Ultra large networks which don´t fit into VRAM?
      Probably not, as these nets need so many training steps, that it would take forever.. Segmenting the nets / clustering them and running them on multiple accelerators is much better. + The overhead of loading the weights / training batch into VRAM shrinks from generation to generation, as PCIe 5.0 x16 has a pretty large bandwidth ~50GByte/sec (PCIe Link to accelerator) where Sapphire Rapids has ~500GByte/sec (CPU <-> DDR5 RAM).
      Even with data not fitting in the HBM2 Memory of the accelerator card, you are probably still faster, if you do some operations on this data before writing it back to main memory.

      - Simulation? Same thing as with ML probably.

      For most "enterprise" applications which require a high matrix throughput, there are specialized accelerator cards, why do they integrate something like that into a server CPU? It only wastes some silicon for ~98% of the users / makes the CPU more expensive to manufacture.

      Probably that trend is why we see ARM CPU and things like Bergamo for Datacenters recently..

      Comment


      • #13
        I'm not an ML expert, but I, like others here, fail to see the use case of CPU accelerators for most test cases over GPUs. It makes more sense on mobile and embedded where you want good power efficiency and to save cost.

        Comment


        • #14
          As I understand it the main benefit is in latency for single shot inferences. Yes, a GPU will complete 100k inferences much faster than this, but what about a single inference in the middle of a very long pipeline requiring general purpose compute, inference, then more compute.

          that is the type of workload which will benefit. The question is, how common are they?

          also, someone mentioned edge but claimed it isn’t relevant. That may be somewhat true for AMD (although their embedded line very much does target the edge), but intel scalable goes down to ~400 usd parts with lower TDPs. They are still targeting the edge with many of these processors.

          Comment


          • #15
            RISC-V is of course working on an extension specifically for matrix ops acceleration, but as they are mostly new elsewhere, it is not considered high priority.

            The focus is on doing things right, and evaluating the state of the art is how you can then do better.

            e.g.: https://riscv.org/blog/2022/07/archi...alibaba-cloud/

            Comment


            • #16
              Originally posted by Vorpal View Post
              Cool, I guess. But the numbers won't look as good if you also compare to running the same problem on the GPU, which most "serious" ML research does, especially on servers!

              Maybe for inference at the edge it could be useful, but there you won't have server processors, but mostly a mix of some mobile x86 and a lot of ARM. And maybe some IOT RISCV these days too. None of which have AMX.

              So what is the actual use case?
              And where are the GPU comparisons?
              Like the “E”-cores, Intel marketing seems to care more about using Apple terms than matching Apple use cases.
              Apple’s story is
              - for LATENCY compute you use the CPU. Apple AMX is an adjunct to the CPU, and to this end supports FP64 and FP32 (along with some smaller formats). Apple AMX essentially fills the role of AVX512 (and can perform many AVX512-style vector operations) but extended out to matrices, not just vectors.
              - for THROUGHPUT compute you mainly use the GPU
              - for the specialized case of throughput that is inference (and mainly visual) you use the NPU.

              Intel’s story is that they have AVX-512 already, but they don’t (at least at the high end) have a GPU on-board...
              So the idea seems to be to provide something that’s kinda like a small GPU/NPU - capable of some of the inference (though not as low power as Apple’s NPU), and some of the training throughput (lower performance than a GPU, but lower latency, so better for cases of small batches, maybe).
              The one thing that’s a poor match is that it’s not really the same sort of thing as Apple’s AMX. That’s more an engine for generic HPC, not a specialized AI engine. (The very first design started off with AI as one of the use cases, but that’s been deprecated in the multiple subsequent redesigns.)

              Problem is the same problem as their entire SPR accelerators strategy. Accelerators are valuable for OPTIONALITY, as an add-on to a CPU that only does the task occasionally. They are much less valuable for users that run a task 24/7, who should be using a dedicated ASIC.
              The SPR accelerators would be great for home users (cf Apple users with accelerators on M1); they are much less interesting at the high end (if you are doing serious training, why aren’t you using a GPU? If you’re doing serious inference, likewise use specialized HW…)
              Last edited by name99; 17 January 2023, 03:38 AM.

              Comment


              • #17
                Originally posted by Spacefish View Post
                Wonder which market AMX is targeted at...

                - These are server processors, so not at the edge to speed up ML inference, we have iGPUs for that anyway.

                - Ultra large networks which don´t fit into VRAM?
                Probably not, as these nets need so many training steps, that it would take forever.. Segmenting the nets / clustering them and running them on multiple accelerators is much better. + The overhead of loading the weights / training batch into VRAM shrinks from generation to generation, as PCIe 5.0 x16 has a pretty large bandwidth ~50GByte/sec (PCIe Link to accelerator) where Sapphire Rapids has ~500GByte/sec (CPU <-> DDR5 RAM).
                Even with data not fitting in the HBM2 Memory of the accelerator card, you are probably still faster, if you do some operations on this data before writing it back to main memory.

                - Simulation? Same thing as with ML probably.

                For most "enterprise" applications which require a high matrix throughput, there are specialized accelerator cards, why do they integrate something like that into a server CPU? It only wastes some silicon for ~98% of the users / makes the CPU more expensive to manufacture.

                Probably that trend is why we see ARM CPU and things like Bergamo for Datacenters recently..
                I don’t understand the negative here, they’re bringing dedicated tensor hardware to the CPU. Not everyone needs a GPU but still want the performance of Nvidia’s Tensor cores, AMD’s Matrix cores, Google’s TPU etc. Giving customers another option is great. Did you have the same complaint about wasted silicon/cost when Nvidia added Tensor cores to their gaming GPUs or AMD when they added their AI accelerators to the RX 7900 series?

                Comment


                • #18
                  Originally posted by WannaBeOCer View Post

                  I don’t understand the negative here, they’re bringing dedicated tensor hardware to the CPU. Not everyone needs a GPU but still want the performance of Nvidia’s Tensor cores, AMD’s Matrix cores, Google’s TPU etc. Giving customers another option is great. Did you have the same complaint about wasted silicon/cost when Nvidia added Tensor cores to their gaming GPUs or AMD when they added their AI accelerators to the RX 7900 series?
                  The negative is the same point I made – optionality. In desktop/mobile chips AMX (and the other accelerators) would be a fantastic edition. But Intel is selling them (one accelerator at a time... so many SKUs all with different accelerators!) to a very different sort of market, a market where you tend to do the same thing often enough that you generally are better off buying dedicated HW (a GPU, a DPU, an inference NPU) rather than doing the work on a (very expensive!) CPU chip.

                  ie the negative is mostly a belief that this is a lousy business strategy, not against the tech per se. Maybe the negative is wrong (Intel SHOULD know its customers better than we do!) but it feels part of the general aimlessness of Intel over the past five years or so, desperately trying to talk up what they have rather than thinking carefully about where the different attacks are coming from (AMD or ARM with many cores, DPU and GPU with accelerators, Apple and ARM for mobile). Ultimately this is a reflection of how Intel's turnaround times are just far too long. They are trying to predict what the competition will look like in 7 years (difficult even under the best of conditions, impossible if you insist on wearing rose-colored glasses) and then are incapable of adjusting halfway through even as reality becomes more clear.

                  Comment


                  • #19
                    Originally posted by name99 View Post

                    The negative is the same point I made – optionality. In desktop/mobile chips AMX (and the other accelerators) would be a fantastic edition. But Intel is selling them (one accelerator at a time... so many SKUs all with different accelerators!) to a very different sort of market, a market where you tend to do the same thing often enough that you generally are better off buying dedicated HW (a GPU, a DPU, an inference NPU) rather than doing the work on a (very expensive!) CPU chip.

                    ie the negative is mostly a belief that this is a lousy business strategy, not against the tech per se. Maybe the negative is wrong (Intel SHOULD know its customers better than we do!) but it feels part of the general aimlessness of Intel over the past five years or so, desperately trying to talk up what they have rather than thinking carefully about where the different attacks are coming from (AMD or ARM with many cores, DPU and GPU with accelerators, Apple and ARM for mobile). Ultimately this is a reflection of how Intel's turnaround times are just far too long. They are trying to predict what the competition will look like in 7 years (difficult even under the best of conditions, impossible if you insist on wearing rose-colored glasses) and then are incapable of adjusting halfway through even as reality becomes more clear.
                    AMX would be overkill for a mobile chip. AVX512_VNNI would have been enough for AI task to be accelerated like Adobe Premiere Pro AI and some others. This is why I recommend Apple‘s hardware for content creation due to their Neural Engine or a Nvidia RTX GPU.

                    I’m sure we’ll see Xeon-W based Sapphire Rapids chips for workstations.

                    In this market people buy hardware and replace it every 5 years. Sapphire Rapids will save on rack space and power. GPU accelerators and other dedicated accelerators cost between $4000-$20000 just for the add in card that uses more space and power. Looking at benchmarks Sapphire Rapids outperforms my Titan RTX by ~4% while having a total system power draw that’s less than my desktop and in a data center uses less space.
                    Last edited by WannaBeOCer; 17 January 2023, 06:38 PM.

                    Comment


                    • #20
                      Originally posted by WannaBeOCer View Post

                      AMX would be overkill for a mobile chip. AVX512_VNNI would have been enough for AI task to be accelerated like Adobe Premiere Pro AI and some others. This is why I recommend Apple‘s hardware for content creation due to their Neural Engine or a Nvidia RTX GPU.
                      Apples and oranges. The Apple NPU can keep up with a small Nvidia GPU (or a large Apple GPU) using a fraction of the task energy... its actually quite remarkable, and not even in the same ballpark as AVX512-VNNI.

                      But VNNI, on the other hand, will "just work" with zero code changes on stuff that already runs on the CPU, I think, while the changes required for the NPU are more drastic even at a high level. And there arent weird quirks (like the npu RNG being funky or it not liking low precision in specific scenarios), nor is there any need to partition models and shuffle stuff around.



                      Comment

                      Working...
                      X