Announcement

Collapse
No announcement yet.

Intel Xeon Max Performance Delivers A Powerful Combination With AMX + HBM2e

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by jrdoane View Post

    I'm sure that there are cases where it makes more of a difference than not. Clearly it's a winner when it comes to raw memory bandwidth. The real question is if the physical location of HBM2e being far closer to the cores than memory is what's really making a difference. There should be a very real latency benefit by not having all of those wires and traces going through the motherboard compared to just on an interposer.
    The bandwidth is what brings the improvement (High Bandwidth Memory not Low Latency Memory), I think the latency is actually worse than good DDR5.

    Apart from wire length there is also a very real problem of much more wires/pins, DDR5 has ~ 260 and HBM over 1000. You would need much more layers on your mother board and we are already at 7 (?) and much more pins on the CPU socket. This CPU has 4 HBM stacks and already 4600 pins, it would need > 8000 pins for external HBM.

    Comment


    • #12
      I wonder if this performance advantage of HBM2e here is only in synthetic benchmarks using small datasets, but in a real life situation, when x10 memory is needed, it would perform worse?

      Comment


      • #13
        Originally posted by schmidtbag View Post
        I'm surprised HBM2e could make that big of a performance difference.
        I know there are some differences, but isn't that a big part of what gives Apple such a boost from having its onboard UMA with it's M1/M2 etc processors?

        Comment


        • #14
          Originally posted by Michael View Post
          Right but basically either way someone will complain "but it's a consumer GPU, with ABC, you would have seen XYZ instead".... or "why no pro cards?", etc....
          Also, if you're worried about upsetting Intel with how well a $350 gaming card might measure up to a $10k HPC CPU, I'd suggest the article taking pains to point out how the inexpensive dGPUs don't have enough RAM for things like LLMs, nor are they suitable for training medium or large models.
          Last edited by coder; 08 July 2023, 04:04 PM.

          Comment


          • #15
            These new Intel CPUs really look like tailored for a very specific task; when you remove the AMX instructions from the equation, you get the same performance as before, and perhaps some improvements due to HBM. May I say this is not so interesting, as long as GPUs already do those kind of computation much better?

            Comment


            • #16
              blackshard HBM can also accelerate non-ML related workloads.

              Comment


              • #17
                Originally posted by Anux View Post
                The bandwidth is what brings the improvement (High Bandwidth Memory not Low Latency Memory), I think the latency is actually worse than good DDR5.

                Apart from wire length there is also a very real problem of much more wires/pins, DDR5 has ~ 260 and HBM over 1000. You would need much more layers on your mother board and we are already at 7 (?) and much more pins on the CPU socket. This CPU has 4 HBM stacks and already 4600 pins, it would need > 8000 pins for external HBM.
                I'm not talking about latency that's a function of frequency. I'm talking about latency in the sense of signal propagation delay due to the length of the connections between the CPU and memory. Simply put, signals take longer to propagate over a longer wire. HBM has the advantage of being physically closer to the CPU and having far shorter electrical paths compared to external (removable,) DDR5. I'm not sure what sort of impact that makes when it comes to memory performance.

                Comment


                • #18
                  I'm curious. I believe SPR AMX only supports Int8 and BF16, yet your tests are labeled FP16. Were there tests available specifically for bf16?

                  Comment


                  • #19
                    Thanks for the article. I'm curious you seems to test avx and amx, but why is your command-line option using only prehistoric -msse4.1 -msse4.2 ? Have you actually scanned built binaries for any avx512/amx instructions? Just to make sure they are really used and you do not benchmark something completely off the table...

                    Comment


                    • #20
                      Originally posted by kgardas View Post
                      Thanks for the article. I'm curious you seems to test avx and amx, but why is your command-line option using only prehistoric -msse4.1 -msse4.2 ? Have you actually scanned built binaries for any avx512/amx instructions? Just to make sure they are really used and you do not benchmark something completely off the table...
                      Unfortunately some of the compiler options got cut-off in the output... Those footnotes for the graphs are all auto generated and for some large code-bases the output doesn't all get properly reflected.
                      Michael Larabel
                      https://www.michaellarabel.com/

                      Comment

                      Working...
                      X