Google Engineer Introduces "Light AVX" Support Within LLVM

Merged to LLVM 16 Git yesterday -- just prior to LLVM 16 feature development ending -- was this introducing of the "light" AVX concept to this open-source compiler. This light AVX mode allows for generating of 256-bit load/stores even if the preference is set (via the -mprefer-vector-width=128 compiler option) to prefer a 128-bit vector width.
This light mode of AVX can be enabled for the Clang compiler by passing +allow-light-256-bit to the -mattr= compiler option. This light AVX mode is wired up to be utilized on Intel Icelake processors and older where there can be the performance (power/frequency) impact when encountering AVX 256-bit/512-bit use. Newer Intel CPUs don't have any major problems with AVX-512 use -- in case you missed it, see my AVX-512 Sapphire Rapids benchmark comparison. Similarly, AMD's AVX-512 support introduced with Zen 4 processors also doesn't have the frequency/power problems with AVX-512.
Ilya Tocar summed up this light AVX work for LLVM with the commit message:
AVX/AVX512 instructions may cause frequency drop on e.g. Skylake. The magnitude of frequency/performance drop depends on instruction (multiplication vs load/store) and vector width. Currently users, that want to avoid this drop can specify -mprefer-vector-width=128. However this also prevents generations of 256-bit wide instructions, that have no associated frequency drop (mainly load/stores).
Add a tuning flag that allows generations of 256-bit AVX load/stores, even when -mprefer-vector-width=128 is set, to speed-up memcpy&co. Verified that running memcpy loop on all cores has no frequency impact and zero CORE_POWER:LVL[12]_TURBO_LICENSE perf counters.
Makes coping memory faster e.g.:
BM_memcpy_aligned/256 80.7GB/s ± 3% 96.3GB/s ± 9% +19.33% (p=0.000 n=9+9)
This "light" AVX option for prior generations of Intel CPUs will be found in LLVM 16.0 that is expected for release around 7 March.
6 Comments