Google Engineer Introduces "Light AVX" Support Within LLVM

Written by Michael Larabel in LLVM on 25 January 2023 at 09:00 AM EST. 8 Comments

Google engineer Ilya Tocar has introduced the notion of "light" AVX support within the LLVM compiler infrastructure for utilizing some benefits of Advanced Vector Extensions (AVX) but trying to avoid the power/frequency impact that AVX-512 use has on older generations of Intel processors.

Merged to LLVM 16 Git yesterday -- just prior to LLVM 16 feature development ending -- was this introducing of the "light" AVX concept to this open-source compiler. This light AVX mode allows for generating of 256-bit load/stores even if the preference is set (via the -mprefer-vector-width=128 compiler option) to prefer a 128-bit vector width.

Light AVX

This light mode of AVX can be enabled for the Clang compiler by passing +allow-light-256-bit to the -mattr= compiler option. This light AVX mode is wired up to be utilized on Intel Icelake processors and older where there can be the performance (power/frequency) impact when encountering AVX 256-bit/512-bit use. Newer Intel CPUs don't have any major problems with AVX-512 use -- in case you missed it, see my AVX-512 Sapphire Rapids benchmark comparison. Similarly, AMD's AVX-512 support introduced with Zen 4 processors also doesn't have the frequency/power problems with AVX-512.

Ilya Tocar summed up this light AVX work for LLVM with the commit message:

AVX/AVX512 instructions may cause frequency drop on e.g. Skylake. The magnitude of frequency/performance drop depends on instruction (multiplication vs load/store) and vector width. Currently users, that want to avoid this drop can specify -mprefer-vector-width=128. However this also prevents generations of 256-bit wide instructions, that have no associated frequency drop (mainly load/stores).

Add a tuning flag that allows generations of 256-bit AVX load/stores, even when -mprefer-vector-width=128 is set, to speed-up memcpy&co. Verified that running memcpy loop on all cores has no frequency impact and zero CORE_POWER:LVL[12]_TURBO_LICENSE perf counters.

Makes coping memory faster e.g.:
BM_memcpy_aligned/256 80.7GB/s ± 3% 96.3GB/s ± 9% +19.33% (p=0.000 n=9+9)

This "light" AVX option for prior generations of Intel CPUs will be found in LLVM 16.0 that is expected for release around 7 March.

8 Comments