Michael! Everyone knows how to configure GCC except you, how dare you do things differently than they do?
Announcement
Collapse
No announcement yet.
The Impact Of GCC Zen Compiler Tuning On AMD Ryzen Performance
Collapse
X
-
Originally posted by qsmcomp View Post
Aggressive inlining will make generated code larger and might do harm to caching / branch predicting?
If you are generating code for very small embedded devices, you might want to take offset in the -Os optimization, and add options from -O3 that improves the performance for your code.Last edited by carewolf; 04 March 2017, 06:28 AM.
Comment
-
Originally posted by qsmcomp View Post
It seems that AMD's not supporting RDRND?
Zen has two Zen exclusive instructions (x86 does not have) but it dropped FMA4 and XOP.
You can also see what is supported here
OpenBenchmarking.org, Phoronix Test Suite, Linux benchmarking, automated benchmarking, benchmarking results, benchmarking repository, open source benchmarking, benchmarking test profiles
I wonder if Zen instructions set has everything Kaby Laky has?
Comment
-
Originally posted by zboson View Post
AMD has supported RdRand since Excavator. Zen is the first AMD arch to support RdSeed. You can see the new Zen instructions here
http://www.anandtech.com/show/11170/...0x-and-1700/10
Zen has two Zen exclusive instructions (x86 does not have) but it dropped FMA4 and XOP.
You can also see what is supported here
http://openbenchmarking.org/system/1...01800X/cpuinfo
I wonder if Zen instructions set has everything Kaby Laky has?
- Likes 1
Comment
-
Originally posted by liam View Post
Why do you need more than two channels in a single socket, non-rdimm system?
To put it differently, you're giving this eight core SoC basically the same bandwidth as an iPad... (Not exactly comparable --- I assume Ryzen has two independent memory controller queues and can sustain more open pages than iPad, which has essentially one controller that is run at 128-bit wide rather than 64 bits wide, but basically same order of magnitude.)
This is the constant on-going cheapness of the x86 world --- both Intel and AMD are so obsessed with product segmentation that they cripple their commodity CPUs in terms of the memory controllers, so that the server revenue is not compromised. That might seem like a great plan except all it's going to do is push everyone with a bandwidth-intensive but compute light workload to buy an ARM server in a year or three...
Comment
-
Originally posted by name99 View Post
Because you have eight cores/sixteen threads available and there will be situations where enough of those cores are doing enough memory intensive operations that bandwidth becomes the bottleneck...
To put it differently, you're giving this eight core SoC basically the same bandwidth as an iPad... (Not exactly comparable --- I assume Ryzen has two independent memory controller queues and can sustain more open pages than iPad, which has essentially one controller that is run at 128-bit wide rather than 64 bits wide, but basically same order of magnitude.)
This is the constant on-going cheapness of the x86 world --- both Intel and AMD are so obsessed with product segmentation that they cripple their commodity CPUs in terms of the memory controllers, so that the server revenue is not compromised. That might seem like a great plan except all it's going to do is push everyone with a bandwidth-intensive but compute light workload to buy an ARM server in a year or three...
If you've come across a review that shows this to be an issue, I'd definitely read it.
Comment
-
I wonder if the FLAC regression is due to some kind of interaction with the run-time SIMD detection?
FLAC assumes recent gcc versions always have certain ISA extensions available and builds them all in, it's a really common misconception I think borne out of the fact that most distribution toolchains are generic x86-64; and the generic build does always make them available. Since I always use a target specific toolchain (including avx math functions, which isn't default on supported CPUs!) I always patch FLAC amongst a few other projects like Boost and Chromium to only include support for enabled extensions and disable the run-time detection.
- Likes 1
Comment
-
I believe this needs to be retested with GCC 7. GCC 6 is not optimized for Zen according to Gentoo wiki: https://wiki.gentoo.org/wiki/Ryzen#GCC_6.3.2B
EDIT: Wrong thread.
Comment
-
Originally posted by mlau View PostI found that the best performance with zen can be had when building with "-march=znver1 -mtune=broadwell -mprefer-avx128". The performance increase over -mtune=znver1 in e.g. scimark is as high as 20% in some instances.
-mtune=znver1 enables `-mprefer-avx128` in gcc6.3 and 7.1 (I checked on Godbolt with -fverbose-asm). Zen is designed to handle 256b vectors with its very high uop throughput. IIRC, its front-end can issue 6 uops per clock, but if they're all from single-uop instructions it can only issue only 5 uops per clock. So it has better uop throughput when running 256b AVX code (which decodes to 2 uops per instruction).
Auto-vectorization with wider vectors can have more overhead, so -mprefer-avx128 may be a good default for Ryzen even if it loses on Scimark. But if it loses in general, you should file a missed-optimization bug on gcc's bugzilla and let them know they might want to tweak their tuning options.
Also, I see that Zen tuning includes -mavx256-split-unaligned-store, but not -mavx256-split-unaligned-load. Haswell and later tuning disables both. -mtune=generic enables both, even though it's only good on Sandybridge/IvyBridge, and on AMD Bulldozer-family. The worst part is that it affects integer AVX2 loads/stores since gcc7.1. No AVX2-supporting CPU benefits from splitting loads (except maybe Excavator?), but `-mavx2` doesn't affect tuning options. So `-mavx2` without `-mtune=haswell` or something is very likely to be sub-optimal, unless the compiler can prove that arrays are always aligned. (I filed https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568 about this issue)
If gcc doesn't know for sure that an array is 32B-aligned, those options make it emit
vmovdqu [rdi], xmm0 / vextracti128 [rdi+16], ymm0, 1
instead of
vmovdqu [rdi], ymm0
for `_mm256_storeu_si256(dst, v)`, or for auto-vectorized code.
This is really bad if your data is actually 32B-aligned most/all of the time, but the compiler doesn't figure that out. In that case it's a pure downside even on Sandybridge.
This splitting of loads and stores is usually a win on Sandybridge for 16B-aligned data that's not 32B-aligned, but it loses on Haswell.
Comment
Comment