Intel Xeon Max Performance Delivers A Powerful Combination With AMX + HBM2e
The Intel Xeon Max 9480 flagship Sapphire Rapids CPU with HBM2e memory tops out at 56 cores / 112 threads, so how can that compete with the latest AMD EPYC processors hitting 96 cores for Genoa (or 120 cores with the forthcoming Bergamo)? Besides the on-package HBM2e that is unique to the Xeon Max family, the other ace that Xeon Max holds with the rest of the Sapphire Rapids line-up is support for the Advanced Matrix Extensions (AMX). In today's benchmarks of the Intel Xeon Max performance is precisely showing the impact of how HBM2e and AMX in order to compete -- and outperform -- AMD's EPYC 9554 and 9654 processors in AI workloads when effectively leveraging AMX and the onboard HBM2e memory.
With the Intel Xeon Max 9468/9480 benchmarks published on Phoronix last week it provided a look at the performance of these "SPR HBM2e" processors when that high-bandwidth memory was outright disabled, running in DDR5+HBM2e caching mode, and then in HBM2e-only mode for seeing what the 64GB of high-bandwidth memory means for HPC/server performance in 2023. Overall the impact of the Xeon Max performance relative to the high bandwidth memory was 18~20% based on the geometric mean of all the tests while some improved a lot more than that. But 56 cores still falls short of the top core count parts offered by AMD EPYC and Ampere Computing or even the Sapphire Rapids non-Max Xeon Platinum providing 60 cores. AMD EPYC 9004 series processors can also offer 12 channels of DDR5 memory rather than 8 channels elsewhere.
For workloads that are memory intensive and able to effectively leverage not only the HBM2e memory and also the Advanced Matrix Extensions (AMX) new to Sapphire Rapids, it can be enough to compete against the current AMD EPYC Genoa processors as well as outperforming those top-tier parts. Advanced Matrix Extensions are Intel's play for 4th Gen Xeon Scalable to accelerate AI workloads. Intel has been working on the open-source/Linux support for AMX going back to mid-2020 and since then has landed the enablement code into the GCC and LLVM/Clang compilers, the Linux kernel bits, and other toolchain components. Intel has also contributed AMX-related bits to their Cloud Hypervisor, KVM, and other common Linux software components. Intel is also already working on the software side for AMX-COMPLEX to premiere with Granite Rapids D.
Thanks to upstreaming their support well in advance of the Sapphire Rapids launch, from the compiler side there has been Advanced Matrix Extensions support going back to GCC 11 and LLVM/Clang 13. I've covered much more about AMX performance earlier in the year in Intel Advanced Matrix Extensions [AMX] Performance With Xeon Scalable Sapphire Rapids.
Plus with Intel's robust open-source software ecosystem they implemented AMX support pre-launch into their prominent and widely-used projects like the oneDNN neural network library and the OpenVINO toolkit. AMX support has also since been added to other key HPC software projects like libxsmm. Intel's timely open-source support, maintaining countless great and widely-used open-source software projects, and their stellar contributions to existing third-party open-source projects really remains top-notch in the industry. I chose to use OpenVINO as well for being very well familiar with it over the years and has proven to work out well for benchmarking purposes across the x86_64 hardware spectrum while does expressly have AMX support from early on.
For being able to analyze the combined AMX and HBM2e impact on performance, the Open Visual Inference and Neural network Optimization (OpenVINO) toolkit was used. Like as shown in the prior Xeon Max article, toggling the HBM2e usage can be done via switching to Flat (1LM) mode in the BIOS if wanting to effectively avoid HBM2e use without assigning anything to it, HBM caching (2LM) mode, and then HBM-only mode by depopulating the DDR5 DIMMs. With oneDNN/OpenVINO, AMX usage can also be toggled via the ONEDNN_MAX_CPU_ISA environment variable where AVX512_CORE_AMX is the default on Sapphire Rapids or can optionally pull back to lower AVX/AVX-512 versions. For this article to assess the combined impact on the Xeon Max 9480, the following configurations were tested:
Xeon Max 9480 2P, No HBM, Max AVX512 FP16 - Running the two Xeon Max 9480 processors with just DDR5 memory being used and limiting oneDNN/OpenVINO to AVX-512 FP16 but no AMX, a.k.a. the barebones run.
Xeon Max 9480 2P, No HBM - The flagship Xeon Max CPUs with AMX being used by the software under test but the HBM2e memory is not being used (DDR5 only).
Xeon Max 9480 2P, HBM Caching - The Xeon Max 9480 processors with AMX and running in the HBM2e caching mode between the combined 128GB of HBM memory and 512GB of DDR5 system memory.
Xeon Max 9480 2P, HBM Only - The most optimal configuration for Xeon Max 9480 processors in running entirely off the 128GB of HBM2e (64GB per socket) and with AMX being used. No DDR5 was populated and in these OpenVINO benchmarks they are able to run with less than 2GB of RAM per thread making it suitable for testing.
These Xeon Max numbers were compared to both the AMD EPYC 9554 2P (64 cores per socket) and AMD EPYC 9654 2P (96 cores per socket) processors. The AMD EPYC 9654 is the current top-end SKU ahead of Bergamo and Genoa-X shipping soon.
The Xeon Max processors continue to be tested in a Supermicro Hyper SuperServer SYS-221H-TN platform. Thanks to Intel and Supermicro for providing the server hardware for making this Xeon Max performance benchmarking possible.
During the testing, the CPU package power consumption was recorded for comparison purposes between the varying Xeon Max run configurations.