AMD Zen 4 Cost Table & Tuning Patches Posted For The GCC Compiler

Written by Michael Larabel in AMD on 6 December 2022 at 08:17 AM EST. 3 Comments

Back in October AMD sent out their initial Zen 4 "znver4" enablement for the GCC compiler. That initial Zen 4 support was since merged for GCC 13 but that initial enablement carried over the cost tables from Zen 3 and didn't do much in the way of tuning but rather just flipping on the new instructions supported by the Ryzen 7000 series and EPYC 9004 series processors. Today there is finally some juicy tuning patches being sent out for GCC.

Sent out today was an initial set of tuning patches for the Zen 4 support with GCC, including a cost table to reflect the instruction costs on Zen 4 processors rather than just carrying over the old costs from Zen 3.

While AMD engineers sent out the initial Znver4 patches in October, this GNU Compiler Collection tuning work is being carried out by SUSE engineers. In the past SUSE engineers have taken up some of the Zen enablement work and tuning, which again is now the case. AMD and SUSE have been longtime partners from working on the GCC compiler to the RadeonHD driver days when they were starting AMD's open-source driver effort to various other open-source collaborations over the years.

Longtime GCC compiler developer Jan Hubicka of SUSE sent out part 1 with the cost table updates for Zen 4. Jan commented on that work:

this patch updates cost of znver4 mostly based on data measued by Agner Fog. Compared to previous generations x87 became bit slower which is probably not big deal (and we have minimal benchmarking coverage for it). One interesting improvement is reduction of FMA cost. I also updated costs of AVX256 loads/stores based on latencies (not throughput which is twice of avx256). Overall AVX512 vectorization seems to improve noticeably some of TSVC benchmarks but since internally 512 vectors are split to 256 vectors it is somewhat risky and does not win in SPEC scores (mostly by regressing benchmarks with loop that have small trip count like x264 and exchange), so for now I am going to set AVX256_OPTIMAL tune but I am still playing with it. We improved since ZNVER1 on choosing vectorization size and also have vectorized prologues/epilogues so it may be possible to make avx512 small win overall.

In general I would like to keep cost tables latency based unless we have a good reason to not do so. There are some interesting diferences in znver3 tables that I also patched and seems performance neutral. I will send that separately.

Bootstrapped/regtested x86_64-linux, also benchmarked on SPEC2017 along with AVX512 tuning. I plan to commit it tomorrow unless there are some comments.

Jan followed up a short time after that with the part two patches for new tuning flags:

this patch adds tunes needed for zen4 microarchitecture. I added two new knobs. TARGET_AVX512_SPLIT_REGS which is used to specify that internally 512 vectors are split to 256 vectors. This affects vectorization costs and reassociation width. It probably should also affect RTX costs however I doubt it is very useful since RTL optimizers are usually not judging between 256 and 512 vectors.

I also added X86_TUNE_AVOID_256FMA_CHAINS. Since fma has improved in zen4 this flag may not be a win except for very specific benchmarks. I am still doing some more detailed testing here.

Oherwise I disabled gathers on zen4 for 2 parts nad 4 parts. We can open code them and since the latencies has only increased since zen3 opencoding is better than actual instruction. This shows at 4 tsvc benchmarks.

I ended up setting AVX256_OPTIMAL. This is a compromise. There are some tsvc benchmarks that increase noticeably (up to 250%) however there are also few regressions. Most of these can be solved by incrasing vec_perm cost in the vectorizer. However this does not cure about 14% regression on x264 that is quite important. Here we produce vectorized loops for avx512 that probably would be faster if the loops in question had high enough iteration count. We hit this problem with avx256 too: since the loop iterates few times, only prologues/epilogues are used. Adding another round of prologue/epilogue code does not make it better.

Finally I enabled avx stores for constant sized memcpy and memset. I am not sure why this is an opt-in feature. I think for most hardware this is a win.

He's looking at getting these Znver4 tuning patches merged soon for GCC 13. The GCC 13.1 stable release should be out in March~April with this initial Zen 4 support -- the same version where Intel with their more timely compiler enablement patches is introducing Grand Ridge and Granite Rapids, Meteor Lake, and other new CPUs and instruction set extensions well ahead of the CPU launches rather than after the fact.

For those wanting optimized compiler support right now for Ryzen 7000 series and EPYC 9004 "Genoa" series processors, AMD's AOCC 4.0 compiler is available in binary form with their optimized Zen 4 compiler support.

3 Comments