GCC 11 Squeezes In Another Zen 3 Optimization

Written by Michael Larabel in GNU on 18 March 2021 at 03:30 AM EDT. 1 Comment

Just weeks ahead of the GCC 11 stable release we saw Znver3 tuning work out of SUSE for allowing the GNU Compiler Collection to better cater towards the AMD Zen 3 microarchitecture. That tuning work follows the initial patch at the end of last year that introduced "Znver3" and flipped on the new instructions. Now another patch working on the Zen 3 tuning for GCC has been posted and already merged.

Jan Hubicka of SUSE has been the one working on this AMD Zen 3 tuning support for GCC 11 that is coming in at the last minute, presumably due to AMD wanting it timed for the EPYC 7003 series debut. Following the initial tuning patch from Monday, on Wednesday a second patch was posted. This latest patch is enabling the use of AVX2 "GATHER" instructions on Zen 3. AVX2 GATHER support allows for vector elements to be loaded from non-contiguous memory locations but over the years have been mixed feelings and results over its usefulness.

While AMD CPUs back to Excavator have fully supported AVX2 including the GATHER instructions, GCC hasn't been auto-generating those instructions on recent versions of GCC for AMD processors. Back in 2018 the GCC compiler disabled the auto-generation of gather instructions since it was found to be slow except for Skylake and newer Intel processors. Namely these AVX2 gather instructions were found to be slow for Intel Haswell where AVX2 was introduced and then with Zen 1 processors.

Now with Zen 3, the AVX2 gather performance is in better shape and thus with "-march=znver3" under this very latest patch those instructions will now be auto-generated where appropriate.

With this latest Znver3 patch, which was already merged, Hubicka noted:

For TSVC it get used by 5 benchmarks with following runtime improvements:

s4114: 1.424 -> 1.209 (84.9017%)
s4115: 2.021 -> 1.065 (52.6967%)
s4116: 1.549 -> 0.854 (55.1323%)
s4117: 1.386 -> 1.193 (86.075%)
vag: 2.741 -> 1.940 (70.7771%)

there is regression in

s4112: 1.115 -> 1.184 (106.188%)

Hubicka also added that Intel's ICC is using gather for some of the tests while LLVM Clang and AOCC are not.

I'll have up some fresh AMD Znver3 GCC benchmarks shortly as well as having prepared already some benchmarks of AOCC 3.0 on a Ryzen 9 5950X system. It's great seeing all of this AMD Zen 3 compiler work happen albeit too bad it didn't happen months ago so it would be ready in a released compiler by the time these desktop and server processors began shipping, like we have become to enjoy out of Intel's timely open-source compiler enablement work over the past many years.

1 Comment