No announcement yet.

The Impact Of GCC Zen Compiler Tuning On AMD Ryzen Performance

  • Filter
  • Time
  • Show
Clear All
new posts

  • #51
    Originally posted by mlau View Post
    I found that the best performance with zen can be had when building with "-march=znver1 -mtune=broadwell -mprefer-avx128". The performance increase over -mtune=znver1 in e.g. scimark is as high as 20% in some instances.
    In cases where tuning for broadwell helps, it would be interesting to try `-march=znver1 -mno-prefer-avx128`. (And maybe `-mno-avx256-split-unaligned-store`).

    -mtune=znver1 enables `-mprefer-avx128` in gcc6.3 and 7.1 (I checked on Godbolt with -fverbose-asm). Zen is designed to handle 256b vectors with its very high uop throughput. IIRC, its front-end can issue 6 uops per clock, but if they're all from single-uop instructions it can only issue only 5 uops per clock. So it has better uop throughput when running 256b AVX code (which decodes to 2 uops per instruction).

    Auto-vectorization with wider vectors can have more overhead, so -mprefer-avx128 may be a good default for Ryzen even if it loses on Scimark. But if it loses in general, you should file a missed-optimization bug on gcc's bugzilla and let them know they might want to tweak their tuning options.

    Also, I see that Zen tuning includes -mavx256-split-unaligned-store, but not -mavx256-split-unaligned-load. Haswell and later tuning disables both. -mtune=generic enables both, even though it's only good on Sandybridge/IvyBridge, and on AMD Bulldozer-family. The worst part is that it affects integer AVX2 loads/stores since gcc7.1. No AVX2-supporting CPU benefits from splitting loads (except maybe Excavator?), but `-mavx2` doesn't affect tuning options. So `-mavx2` without `-mtune=haswell` or something is very likely to be sub-optimal, unless the compiler can prove that arrays are always aligned. (I filed about this issue)

    If gcc doesn't know for sure that an array is 32B-aligned, those options make it emit

    vmovdqu [rdi], xmm0 / vextracti128 [rdi+16], ymm0, 1
    instead of
    vmovdqu [rdi], ymm0

    for `_mm256_storeu_si256(dst, v)`, or for auto-vectorized code.

    This is really bad if your data is actually 32B-aligned most/all of the time, but the compiler doesn't figure that out. In that case it's a pure downside even on Sandybridge.

    This splitting of loads and stores is usually a win on Sandybridge for 16B-aligned data that's not 32B-aligned, but it loses on Haswell.


    • #52
      I found this test weird. It's not just *tuning* options that are being changed, it's which set of instruction-set extensions are enabled. That's interesting to test too (e.g. whether Zen benefits a lot or a little from letting the compiler auto-vectorize with AVX2, use BMI2's more efficient shift instructions, and stuff like that).

      -O3 -march=znver1 -mtune=generic would enable all of Zen's instruction-sets, but *tune* the same as plain -O3.

      -march=k8-sse3 is a really weird choice. K10 (-mtune=amdfam10 or -mtune=barcelona) would seem to be more sensible. -mtune=bdver4 (Excavator) would also be a good choice to compare against, since it's the next-most-recent AMD CPU.

      Anyway, it's hard to know which effects are from different tuning and which are from instruction-sets.