Announcement

Collapse
No announcement yet.

The Impact Of GCC Zen Compiler Tuning On AMD Ryzen Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Peter_Cordes
    replied
    I found this test weird. It's not just *tuning* options that are being changed, it's which set of instruction-set extensions are enabled. That's interesting to test too (e.g. whether Zen benefits a lot or a little from letting the compiler auto-vectorize with AVX2, use BMI2's more efficient shift instructions, and stuff like that).

    -O3 -march=znver1 -mtune=generic would enable all of Zen's instruction-sets, but *tune* the same as plain -O3.

    -march=k8-sse3 is a really weird choice. K10 (-mtune=amdfam10 or -mtune=barcelona) would seem to be more sensible. -mtune=bdver4 (Excavator) would also be a good choice to compare against, since it's the next-most-recent AMD CPU.

    Anyway, it's hard to know which effects are from different tuning and which are from instruction-sets.

    Leave a comment:


  • Peter_Cordes
    replied
    Originally posted by mlau View Post
    I found that the best performance with zen can be had when building with "-march=znver1 -mtune=broadwell -mprefer-avx128". The performance increase over -mtune=znver1 in e.g. scimark is as high as 20% in some instances.
    In cases where tuning for broadwell helps, it would be interesting to try `-march=znver1 -mno-prefer-avx128`. (And maybe `-mno-avx256-split-unaligned-store`).

    -mtune=znver1 enables `-mprefer-avx128` in gcc6.3 and 7.1 (I checked on Godbolt with -fverbose-asm). Zen is designed to handle 256b vectors with its very high uop throughput. IIRC, its front-end can issue 6 uops per clock, but if they're all from single-uop instructions it can only issue only 5 uops per clock. So it has better uop throughput when running 256b AVX code (which decodes to 2 uops per instruction).

    Auto-vectorization with wider vectors can have more overhead, so -mprefer-avx128 may be a good default for Ryzen even if it loses on Scimark. But if it loses in general, you should file a missed-optimization bug on gcc's bugzilla and let them know they might want to tweak their tuning options.

    Also, I see that Zen tuning includes -mavx256-split-unaligned-store, but not -mavx256-split-unaligned-load. Haswell and later tuning disables both. -mtune=generic enables both, even though it's only good on Sandybridge/IvyBridge, and on AMD Bulldozer-family. The worst part is that it affects integer AVX2 loads/stores since gcc7.1. No AVX2-supporting CPU benefits from splitting loads (except maybe Excavator?), but `-mavx2` doesn't affect tuning options. So `-mavx2` without `-mtune=haswell` or something is very likely to be sub-optimal, unless the compiler can prove that arrays are always aligned. (I filed https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568 about this issue)

    If gcc doesn't know for sure that an array is 32B-aligned, those options make it emit

    vmovdqu [rdi], xmm0 / vextracti128 [rdi+16], ymm0, 1
    instead of
    vmovdqu [rdi], ymm0

    for `_mm256_storeu_si256(dst, v)`, or for auto-vectorized code.

    This is really bad if your data is actually 32B-aligned most/all of the time, but the compiler doesn't figure that out. In that case it's a pure downside even on Sandybridge.

    This splitting of loads and stores is usually a win on Sandybridge for 16B-aligned data that's not 32B-aligned, but it loses on Haswell.

    Leave a comment:


  • Sin2x
    replied
    I believe this needs to be retested with GCC 7. GCC 6 is not optimized for Zen according to Gentoo wiki: https://wiki.gentoo.org/wiki/Ryzen#GCC_6.3.2B

    EDIT: Wrong thread.

    Leave a comment:


  • mlau
    replied
    I found that the best performance with zen can be had when building with "-march=znver1 -mtune=broadwell -mprefer-avx128". The performance increase over -mtune=znver1 in e.g. scimark is as high as 20% in some instances.

    Leave a comment:


  • s_j_newbury
    replied
    I wonder if the FLAC regression is due to some kind of interaction with the run-time SIMD detection?

    FLAC assumes recent gcc versions always have certain ISA extensions available and builds them all in, it's a really common misconception I think borne out of the fact that most distribution toolchains are generic x86-64; and the generic build does always make them available. Since I always use a target specific toolchain (including avx math functions, which isn't default on supported CPUs!) I always patch FLAC amongst a few other projects like Boost and Chromium to only include support for enabled extensions and disable the run-time detection.

    Leave a comment:


  • liam
    replied
    Originally posted by name99 View Post

    Because you have eight cores/sixteen threads available and there will be situations where enough of those cores are doing enough memory intensive operations that bandwidth becomes the bottleneck...
    To put it differently, you're giving this eight core SoC basically the same bandwidth as an iPad... (Not exactly comparable --- I assume Ryzen has two independent memory controller queues and can sustain more open pages than iPad, which has essentially one controller that is run at 128-bit wide rather than 64 bits wide, but basically same order of magnitude.)

    This is the constant on-going cheapness of the x86 world --- both Intel and AMD are so obsessed with product segmentation that they cripple their commodity CPUs in terms of the memory controllers, so that the server revenue is not compromised. That might seem like a great plan except all it's going to do is push everyone with a bandwidth-intensive but compute light workload to buy an ARM server in a year or three...
    My question was rhetorical. From the reviews I've read, thus far, BANDWIDTH isn't an issue. Having that big victim cache certainly helps matters.
    If you've come across a review that shows this to be an issue, I'd definitely read it.

    Leave a comment:


  • name99
    replied
    Originally posted by liam View Post


    Why do you need more than two channels in a single socket, non-rdimm system?
    Because you have eight cores/sixteen threads available and there will be situations where enough of those cores are doing enough memory intensive operations that bandwidth becomes the bottleneck...
    To put it differently, you're giving this eight core SoC basically the same bandwidth as an iPad... (Not exactly comparable --- I assume Ryzen has two independent memory controller queues and can sustain more open pages than iPad, which has essentially one controller that is run at 128-bit wide rather than 64 bits wide, but basically same order of magnitude.)

    This is the constant on-going cheapness of the x86 world --- both Intel and AMD are so obsessed with product segmentation that they cripple their commodity CPUs in terms of the memory controllers, so that the server revenue is not compromised. That might seem like a great plan except all it's going to do is push everyone with a bandwidth-intensive but compute light workload to buy an ARM server in a year or three...

    Leave a comment:


  • carewolf
    replied
    Originally posted by zboson View Post

    AMD has supported RdRand since Excavator. Zen is the first AMD arch to support RdSeed. You can see the new Zen instructions here
    http://www.anandtech.com/show/11170/...0x-and-1700/10

    Zen has two Zen exclusive instructions (x86 does not have) but it dropped FMA4 and XOP.

    You can also see what is supported here
    http://openbenchmarking.org/system/1...01800X/cpuinfo

    I wonder if Zen instructions set has everything Kaby Laky has?
    Yes, except the Intel specific stuff of course There has always been a few things AMD and Intel have done differently, various tooling of the CPU in particular, for instance performance counters being one, and it appears now memory encryption and buffer overflow protecting extensions are new ones. Ryzen in particular doesn't have MPX (buffer overflow guards), TSX (transactional memory) and SGX (memory encryption), but AMD already have altenatives for the last two. It remains to be seen if MPX will become generally used and if AMD will adopt or make their own version.

    Leave a comment:


  • atomsymbol
    replied
    Originally posted by phoronix View Post
    Phoronix: The Impact Of GCC Zen Compiler Tuning On AMD Ryzen Performance

    The latest in our AMD Ryzen Linux benchmarking is looking at the impact of compiled binaries when making use of Zen "znver1" compiler optimizations with the GNU Compiler Collection (GCC) compared to other optimization levels like Bulldozer and K8-SSE3.

    http://www.phoronix.com/vr.php?view=24234
    On my A10-7850K machine, the GCC 6.3.0 compiler generates slower code with -march=bdver3 than with -mavx (the latter is without any -march).

    It would be useful to know whether this soft-error of GCC is an issue on Ryzen as well.

    Exists there an openchmarking.org result comparing "-O3 -mavx" with "-O3 -march=znver1" on Ryzen?

    Leave a comment:


  • zboson
    replied
    Originally posted by qsmcomp View Post

    It seems that AMD's not supporting RDRND?
    AMD has supported RdRand since Excavator. Zen is the first AMD arch to support RdSeed. You can see the new Zen instructions here
    http://www.anandtech.com/show/11170/...0x-and-1700/10

    Zen has two Zen exclusive instructions (x86 does not have) but it dropped FMA4 and XOP.

    You can also see what is supported here
    http://openbenchmarking.org/system/1...01800X/cpuinfo

    I wonder if Zen instructions set has everything Kaby Laky has?

    Leave a comment:

Working...
X