GCC 13 Now Enables 512-bit Vector For AMD Zen 4 Tuning

Written by Michael Larabel in AMD on 7 February 2023 at 08:00 AM EST. 7 Comments

GNU Compiler Collection compiler expert Jan Hubicka at SUSE continues working on last-minute tweaks to the GCC 13 for benefiting AMD's latest Zen 4 processors.

Back in October AMD contributed basic Zen 4 "znver4" support to GCC 13 while over the past two months or so has been a lot of tuning work carried out by SUSE to enable more optimizations and tuning compared to the initial support of Znver4 that followed the same paths and cost tables as Zen 3.

Over the past few weeks has been a lot of GCC 13 tuning for Zen 4 and at least one more optimization landed today ahead of the GCC 13.1 stable release due out in the next month or two.

GCC 13 Znver4 AVX-512 vector change

Hubicka's new patch enables 512-bit vectors for Zen 4. Up to now the compiler tuning for Zen 4 preferred using 256-bit AVX instructions instead of 512-bit AVX instructions in the auto vectorizer. However, further testing has proved that going for the 512-bit vectors is indeed the most optimal approach. Hubicka explained with the commit making the one line of code change:

Enable 512 bit vector for zen4

While internally 512 registers are splits into two 256 halves, 512 bit vectors reduces number of instructions to retire and has chance to improve paralelism. There are few tsvc benchmarks that improves significantly:

           runtime
benchmark  256bit  512bit
s2275      48.57   20.67    -58%
s311       32.29   16.06    -50%
s312       32.30   16.07    -50%
vsumr      32.30   16.07    -50%
s314       10.77   5.42     -50%
s313       21.52   10.85    -50%
vdotr      43.05   21.69    -50%
s316       10.80   5.64     -48%
s235       61.72   33.91    -45%
s161       15.91   9.95     -38%
s3251      32.13   20.31    -36%

And there are no benchmarks with off-noise regression.  The basic matrix multiplication loop improves by 32%.  It is also expected that 512 bit vectors are more power effecient (I can't masure that).

The down side is that loops with low trip counts may get slower when the unvectorized prologue and epilogue is hit more often.  With SPECfp this problem happens with x264 (12% regression) and bwaves (6% regression) and this is tracked in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410 and will need more work on vectorizer to support masked epilogues.

After some additional testing it seems that using 512 bit vectors by default is now overall better choice.

It's great seeing this last minute tuning work for the Znver4 target continue ahead of the GCC 13.1 annual feature release coming up in March~April. (Granted, ideally, it would have been seeing this work handled pre-launch for Zen 4 and so that a released compiler available already would have this tuned support for those wishing to exploit "-march=znver4" handling.) It will be fun to see how the GCC 13.1 compiler performance compares to AMD's AOCC 4.0 compiler on Ryzen 7000 series and EPYC 9004 series processors.

Over on the upstream LLVM/Clang side, (sadly) there isn't anything new in the review queue with the lone Znver4 commit there being the initial enablement from December that copied over the tunings from Zen 3 and flipped on the new instructions.

7 Comments