Announcement

**Anux** · 07 July 2023, 04:48 PM

Originally posted by jrdoane View Post

I'm sure that there are cases where it makes more of a difference than not. Clearly it's a winner when it comes to raw memory bandwidth. The real question is if the physical location of HBM2e being far closer to the cores than memory is what's really making a difference. There should be a very real latency benefit by not having all of those wires and traces going through the motherboard compared to just on an interposer.

The bandwidth is what brings the improvement (High Bandwidth Memory not Low Latency Memory), I think the latency is actually worse than good DDR5.

Apart from wire length there is also a very real problem of much more wires/pins, DDR5 has ~ 260 and HBM over 1000. You would need much more layers on your mother board and we are already at 7 (?) and much more pins on the CPU socket. This CPU has 4 HBM stacks and already 4600 pins, it would need > 8000 pins for external HBM.

**Yalok** · 07 July 2023, 05:37 PM

I wonder if this performance advantage of HBM2e here is only in synthetic benchmarks using small datasets, but in a real life situation, when x10 memory is needed, it would perform worse?

**ezst036** · 07 July 2023, 08:03 PM

Originally posted by schmidtbag View Post

I'm surprised HBM2e could make that big of a performance difference.

I know there are some differences, but isn't that a big part of what gives Apple such a boost from having its onboard UMA with it's M1/M2 etc processors?

**coder** · 08 July 2023, 02:56 AM

Originally posted by Michael View Post

Right but basically either way someone will complain "but it's a consumer GPU, with ABC, you would have seen XYZ instead".... or "why no pro cards?", etc....

Also, if you're worried about upsetting Intel with how well a $350 gaming card might measure up to a $10k HPC CPU, I'd suggest the article taking pains to point out how the inexpensive dGPUs don't have enough RAM for things like LLMs, nor are they suitable for training medium or large models.

**blackshard** · 08 July 2023, 05:45 AM

These new Intel CPUs really look like tailored for a very specific task; when you remove the AMX instructions from the equation, you get the same performance as before, and perhaps some improvements due to HBM. May I say this is not so interesting, as long as GPUs already do those kind of computation much better?

**NobodyXu** · 08 July 2023, 05:52 AM

blackshard HBM can also accelerate non-ML related workloads.

**jrdoane** · 08 July 2023, 10:07 AM

Originally posted by Anux View Post

The bandwidth is what brings the improvement (High Bandwidth Memory not Low Latency Memory), I think the latency is actually worse than good DDR5.

Apart from wire length there is also a very real problem of much more wires/pins, DDR5 has ~ 260 and HBM over 1000. You would need much more layers on your mother board and we are already at 7 (?) and much more pins on the CPU socket. This CPU has 4 HBM stacks and already 4600 pins, it would need > 8000 pins for external HBM.

I'm not talking about latency that's a function of frequency. I'm talking about latency in the sense of signal propagation delay due to the length of the connections between the CPU and memory. Simply put, signals take longer to propagate over a longer wire. HBM has the advantage of being physically closer to the CPU and having far shorter electrical paths compared to external (removable,) DDR5. I'm not sure what sort of impact that makes when it comes to memory performance.

**jayN** · 08 July 2023, 05:39 PM

I'm curious. I believe SPR AMX only supports Int8 and BF16, yet your tests are labeled FP16. Were there tests available specifically for bf16?

**kgardas** · 08 July 2023, 05:52 PM

Thanks for the article. I'm curious you seems to test avx and amx, but why is your command-line option using only prehistoric -msse4.1 -msse4.2 ? Have you actually scanned built binaries for any avx512/amx instructions? Just to make sure they are really used and you do not benchmark something completely off the table...

**Michael** · 08 July 2023, 05:54 PM

Originally posted by kgardas View Post

Thanks for the article. I'm curious you seems to test avx and amx, but why is your command-line option using only prehistoric -msse4.1 -msse4.2 ? Have you actually scanned built binaries for any avx512/amx instructions? Just to make sure they are really used and you do not benchmark something completely off the table...

Unfortunately some of the compiler options got cut-off in the output... Those footnotes for the graphs are all auto generated and for some large code-bases the output doesn't all get properly reflected.

Announcement

Intel Xeon Max Performance Delivers A Powerful Combination With AMX + HBM2e

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment