Intel Advanced Matrix Extensions [AMX] Performance With Xeon Scalable Sapphire Rapids
One of the most exciting features of Intel's 4th Gen Xeon Scalable "Sapphire Rapids" processors is the introduction of Advanced Matrix Extensions (AMX). The Intel AMX ISA extensions are intended for speeding-up AI and machine learning related workloads. In this article is a look at the AMX performance on the Xeon Platinum 8490H processors on/off for machine learning performance.
Intel disclosed AMX three years ago already and while the Sapphire Rapids launch was delayed multiple times, that has allowed Intel engineers additional time to get the software support squared away. On the compiler side the initial AMX enablement work premiered in GCC 11 back in 2021 and has been part of LLVM 12 since its late 2020 release. The GNU Assembler (Gas) support for AMX debuted in 2020 as well. The Linux kernel patches around AMX handling have been upstream since Linux 5.16. Since then with Linux 5.17 is AMX support with KVM and with Linux 6.0 is an AMX power management fix. AMX support has also worked its way into other components like QEMU 7.0 as part of the Linux virtualization stack.
For Sapphire Rapids with AMX the initial accelerator is the Tile-Matrix Multiply (TUML) unit for BF16 and INT8 data types to speed-up matrix multiplication as is used in AI/ML workloads. As Intel has already started, AMX-FP16 support is on the way for Granite Rapids with FP16 input type support. AMX is a standalone extension separate from AVX and the support presence can be checked via /proc/cpuinfo with the new "amx_bf16", "amx_int8", and "amx_tile" flags. All current Xeon Sapphire Rapids processors support the Advanced Matrix Extensions.
Just as it took time for AVX-512 support to become widely adopted among open-source software packages, Intel AMX adoption will take some time as well. While the compiler support has been around for 2+ years along with the programming reference manual documentation. Intel software partners have only had Sapphire Rapids servers in recent months and those independent software developers now await hardware availability or finding 4th Gen Xeon Scalable processors from the public cloud providers. Plus there is also the Intel DevCloud too.
Those wanting to learn all the fine details about the Advanced Matrix Extensions instruction set can see the Intel reference manual chapter three for all the interesting technical details at length.
Let's move on to looking at what can leverage Advanced Matrix Extensions right away and then getting to the exciting benchmark numbers with the raw performance as well as power efficiency.