If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.
Announcement
Collapse
No announcement yet.
The Impact Of GCC Zen Compiler Tuning On AMD Ryzen Performance
According to an analysis of Ryzen cache by hardware.fr (which Chrome translates well): latency is very poor when a core from one of the two four-core-clusters accesses something in the L3 cache of the other.
LOL. I pulled the highlighted line from the Himeno website. I was trying to see if compiler flags would have any impact on the results (it doesn't seem so).
It appears to be more of optimization AMD has to do in Ryzen's cache.
Based on the results PTS has shown, Ryzen definitely is a work in progress. Somethings seem to have gotten a great deal of attention, other areas less so. Ryzen2/Ryzen Server will probably handle Himeno much better in relative terms.
This may be so, frankly it is typical on any CPU design. Effectively all CPUs are a works in progress. However in this case i really think AMD accomplished most of its goals. We are getting very good performance out of the box on most existing software. That is a good thing for most of us and frankly there is nothing to be disappointed about here.
What will be interesting is the impact new compilers and even possibly new Linux vesions have on performance. Micheal is already investigating this some but im really interested in what if anything compilers can do for Ryzen a year from now when support should be firmed up.
I don't know about the Zen architecture but with the bulldozer architecture -mvzeroupper is not necessary. It's only Intel that suffers (maybe Zen now as well) from the false dependency on the upper half of AVX when it's dirty.
That is one non-sensical line.I would always enable finline-function first. The rest mainly makes sense together with profiled optimization, so after you have generated a profile, you can use that profile with unroll-loops etc (In fact I believe that is default when doing profile guided optimizations second run).
I wish more build-systems had support for making profile generating and profile using builds, or could do both, first making one, then running a bunch of tests and benchmark and then compile with the generated profile.
Aggressive inlining will make generated code larger and might do harm to caching / branch predicting?
Aggressive inlining will make generated code larger and might do harm to caching / branch predicting?
I believe you are talking to someone who knows of that possibility, but has made the experience that it is more likely to be an improvement. That's why he says "first".
I believe you are talking to someone who knows of that possibility, but has made the experience that it is more likely to be an improvement. That's why he says "first".
In fact I'm using my experience compiling code for a router. So that “negative optimization” might be false for a mainstream desktop processor.
According to an analysis of Ryzen cache by hardware.fr (which Chrome translates well): latency is very poor when a core from one of the two four-core-clusters accesses something in the L3 cache of the other.
I suspected that would be an issue. So it's basically a NUMA issue, and the OS schedulers aren't yet aware how to best place threads on Zen.
The page I'd linked to explains that they used software written by the authors of AIDA64 and which will be integrated into future version of that. It also said that AMD had told them that bandwidth between the two clusters was 22GB/s, compared with at least 175GB/s within each.
Zen is also dual channel if I recall whereas Skylake (not sure which was first) is quad-channel. This means Zen is more affected by memory bandwidth. That's maybe the second most disappointing thing about Zen after sticking with AVX128. I'm still likely to build a Zen system. It will be the first desktop I have build in years.
Why do you need more than two channels in a single socket, non-rdimm system?
Comment