Show Your Support: Did you know that you can get Phoronix Premium for under $4 per month? Try it today to view our site ad-free, multi-page articles on a single page, and more while the proceeds allow us to write more Linux hardware reviews. At the very least, please disable your ad-blocker.
GCC 13 Now Enables 512-bit Vector For AMD Zen 4 Tuning
Back in October AMD contributed basic Zen 4 "znver4" support to GCC 13 while over the past two months or so has been a lot of tuning work carried out by SUSE to enable more optimizations and tuning compared to the initial support of Znver4 that followed the same paths and cost tables as Zen 3.
Over the past few weeks has been a lot of GCC 13 tuning for Zen 4 and at least one more optimization landed today ahead of the GCC 13.1 stable release due out in the next month or two.
Hubicka's new patch enables 512-bit vectors for Zen 4. Up to now the compiler tuning for Zen 4 preferred using 256-bit AVX instructions instead of 512-bit AVX instructions in the auto vectorizer. However, further testing has proved that going for the 512-bit vectors is indeed the most optimal approach. Hubicka explained with the commit making the one line of code change:
Enable 512 bit vector for zen4
While internally 512 registers are splits into two 256 halves, 512 bit vectors reduces number of instructions to retire and has chance to improve paralelism. There are few tsvc benchmarks that improves significantly:
benchmark 256bit 512bit
s2275 48.57 20.67 -58%
s311 32.29 16.06 -50%
s312 32.30 16.07 -50%
vsumr 32.30 16.07 -50%
s314 10.77 5.42 -50%
s313 21.52 10.85 -50%
vdotr 43.05 21.69 -50%
s316 10.80 5.64 -48%
s235 61.72 33.91 -45%
s161 15.91 9.95 -38%
s3251 32.13 20.31 -36%
And there are no benchmarks with off-noise regression. The basic matrix multiplication loop improves by 32%. It is also expected that 512 bit vectors are more power effecient (I can't masure that).
The down side is that loops with low trip counts may get slower when the unvectorized prologue and epilogue is hit more often. With SPECfp this problem happens with x264 (12% regression) and bwaves (6% regression) and this is tracked in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410 and will need more work on vectorizer to support masked epilogues.
After some additional testing it seems that using 512 bit vectors by default is now overall better choice.
It's great seeing this last minute tuning work for the Znver4 target continue ahead of the GCC 13.1 annual feature release coming up in March~April. (Granted, ideally, it would have been seeing this work handled pre-launch for Zen 4 and so that a released compiler available already would have this tuned support for those wishing to exploit "-march=znver4" handling.) It will be fun to see how the GCC 13.1 compiler performance compares to AMD's AOCC 4.0 compiler on Ryzen 7000 series and EPYC 9004 series processors.
Over on the upstream LLVM/Clang side, (sadly) there isn't anything new in the review queue with the lone Znver4 commit there being the initial enablement from December that copied over the tunings from Zen 3 and flipped on the new instructions.