Announcement

Collapse
No announcement yet.

Intel Advanced Matrix Extensions [AMX] Performance With Xeon Scalable Sapphire Rapids

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • filbo
    replied
    Did absolutely none of those workloads succeed when forced down below AVX512_CORE? For any that could succeed, it would be fascinating to see AVX2 results, as that represents the relevant baseline experienced by consumer Intel 12th & 13th gen chips and AMD Zen < 4.

    Who cares? It seems relevant to me, as small bits of various sorts of AI/ML processing are more and more likely to trickle down into consumer applications. Seeing how badly AVX2 gets beaten would speak to how significant it might be for one of the CPU vendors to start adding this stuff into consumer chips. (As indeed AMD have done, with AVX512 in Zen 4)

    Leave a comment:


  • brucethemoose
    replied
    Originally posted by jeisom View Post
    I am still in the dark for who these instructions are targeted for. AI/ML workloads work much better on dedicated hardware ai accelerator or gpu. If you are in the market for one of these chips, a low end compute engine of some sort would also be in consideration. Probably make more sense just to put a ml accelerator unit on the chips and instructions to more efficiently move the data over. But again that approach would be better for consumer chips, not enterprise. Really surprising that neither AMD nor Intel have released chips with this in mind as it has been done for years on ARM based chips.

    It would be nice to see more ai/ml that benefited end users that doesn't run in the cloud. Most of what I see is media processing, image categorization(at least on Apple) and some work in content creation(photoshop).
    See above posts, but basically its for hilariously large models like Facebook runs, or for light use instances where a dedicated PCIe accelerator just isn't worth the extra cost, but a lesser CPU doesn't quite cut the mustard. AMX/OpenVINO is also extremely easy to "switch on" in PyTorch and such, which is not the case for some other accelerators.

    Centaur made an x86 CPU that is precisely what you are describing, and was excellent on paper, but it never caught on: https://fuse.wikichip.org/news/3256/...s-an-ai-punch/

    Intel laptop CPUs have had proprietary AI accelerators for years, but they probably need beefier/less niche designs for more "general" use. And AMD is now shipping an AI accelerator in their laptop chips, but it too needs some software enablement. The only one that "just works" right now is Apple CoreML, and its actually quite good.



    And yeah, the content creation AI is coming like a tidal wave. There is already a brewing controversy over text-to-image and the accompanying Photoshop/GIMP/Krita plugins, its just not mainstream yet because its so finicky to set up (largely thanks to Nvidia :/).
    Last edited by brucethemoose; 16 January 2023, 05:32 PM.

    Leave a comment:


  • jeisom
    replied
    I am still in the dark for who these instructions are targeted for. AI/ML workloads work much better on dedicated hardware ai accelerator or gpu. If you are in the market for one of these chips, a low end compute engine of some sort would also be in consideration. Probably make more sense just to put a ml accelerator unit on the chips and instructions to more efficiently move the data over. But again that approach would be better for consumer chips, not enterprise. Really surprising that neither AMD nor Intel have released chips with this in mind as it has been done for years on ARM based chips.

    It would be nice to see more ai/ml that benefited end users that doesn't run in the cloud. Most of what I see is media processing, image categorization(at least on Apple) and some work in content creation(photoshop).

    Leave a comment:


  • brucethemoose
    replied
    Originally posted by Michael View Post

    Right, but that I'll obviously get criticism for comparing a $17k processor against a couple hundred dollar consumer card that ultimately likely not too useful of a relevant comparison in practice...
    TBH an RTX 2080 (the equivalent of the still very common Nvidia Tesla T4) or an RTX 3060 will probably smoke Sapphire Rapids in these benchmarks.

    In CUDA ML world, there is basically zero difference between a high end RTX gaming card and the Quadro/server cards other than VRAM size. They perform the same, barring the top-end HBM cards like the A100 which have no desktop equivalent. They are the same compilation target on the software side of things. So maybe readers will complain, but those complaints will just be nonsense.



    What I would really be interested in is a "hybrid" benchmark with ColossalAI or DeepSpeed where some GPU is doing the heavy lifting, and the CPU is handling everything else that doesn't fit into VRAM. Other than the aformentioned light workload scenario, this is where Sapphire Rapids could really shine over EPYC as a ML server host. But I realize this is a tall ask.
    Last edited by brucethemoose; 16 January 2023, 04:01 PM.

    Leave a comment:


  • peterdk
    replied
    They would be relevant I guess if they outperformed the CPU. Which is certainly possible, also on consumer cards.

    Leave a comment:


  • Michael
    replied
    Originally posted by brucethemoose View Post

    ML benchmarks should run "out of the box" on regular RTX cards. Or even AMD cards with ROCM PyTorch builds.

    These dont look like they will push over the VRAM pool.
    Right, but that I'll obviously get criticism for comparing a $17k processor against a couple hundred dollar consumer card that ultimately likely not too useful of a relevant comparison in practice...

    Leave a comment:


  • brucethemoose
    replied
    Originally posted by Michael View Post

    That I don't have any review samples on any of the professional cards to test...
    ML benchmarks should run "out of the box" on regular RTX cards. Or even AMD cards with ROCM PyTorch builds.

    These dont look like they will push over the VRAM pool.


    Originally posted by Vorpal View Post
    Cool, I guess. But the numbers won't look as good if you also compare to running the same problem on the GPU, which most "serious" ML research does, especially on servers!
    Some pre/post processing or utility scripts run on the CPU, and sometimes users run the less expensive bits of a GPU model on CPU to save VRAM.

    There are even some work-in-progress frameworks that do this "automatically" on existing codebases, like Microsoft Deepspeed and ColossalAI.

    DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective. - microsoft/DeepSpeed




    Historically, Facebook is kinda infamous for running models too big to fit on GPUs, hence Intel has basically made custom Facebook SKUs like Cooper Lake for years.

    In other cases something simple and/or infrequent like face detection is just not worth buying a GPU instance for if the CPU will get it done.
    Last edited by brucethemoose; 16 January 2023, 03:45 PM.

    Leave a comment:


  • Michael
    replied
    Originally posted by Vorpal View Post
    And where are the GPU comparisons?
    That I don't have any review samples on any of the professional cards to test...

    Leave a comment:


  • c117152
    replied
    Originally posted by Vorpal View Post
    So what is the actual use case?
    It's likely they finally reached the SIMD width where the decoder pathway complexity is actually cheaper for proper vector and matrix instructions like RISC-V's RVV.

    Leave a comment:


  • Vorpal
    replied
    Cool, I guess. But the numbers won't look as good if you also compare to running the same problem on the GPU, which most "serious" ML research does, especially on servers!

    Maybe for inference at the edge it could be useful, but there you won't have server processors, but mostly a mix of some mobile x86 and a lot of ARM. And maybe some IOT RISCV these days too. None of which have AMX.

    So what is the actual use case?
    And where are the GPU comparisons?
    Last edited by Vorpal; 16 January 2023, 02:39 PM. Reason: Fix typo

    Leave a comment:

Working...
X