The Impact Of GCC Zen Compiler Tuning On AMD Ryzen Performance

zboson replied

03 March 2017, 02:31 PM
Originally posted by AsuMagic View Post

Interesting. So Zen will be more competitive with Intel on tasks that are less memory-intensive and more cache efficient? Which is why it seems to perform poorly on some games?

Zen is also dual channel if I recall whereas Skylake (not sure which was first) is quad-channel. This means Zen is more affected by memory bandwidth. That's maybe the second most disappointing thing about Zen after sticking with AVX128. I'm still likely to build a Zen system. It will be the first desktop I have build in years.

Last edited by zboson; 03 March 2017, 02:43 PM.
Leave a comment:
AsuMagic replied

03 March 2017, 02:18 PM
Originally posted by carewolf View Post

I wish more build-systems had support for making profile generating and profile using builds, or could do both, first making one, then running a bunch of tests and benchmark and then compile with the generated profile.

+1. CMake is a pain in the ass to configure in that regard.
Likes 1
Leave a comment:
carewolf replied

03 March 2017, 02:05 PM
Originally posted by qsmcomp View Post

with
O3 -fno-inline-functions -funroll-loops -fpeel-loops -ftracer
the results might be more 'tricky'.

That is one non-sensical line.I would always enable finline-function first. The rest mainly makes sense together with profiled optimization, so after you have generated a profile, you can use that profile with unroll-loops etc (In fact I believe that is default when doing profile guided optimizations second run).

I wish more build-systems had support for making profile generating and profile using builds, or could do both, first making one, then running a bunch of tests and benchmark and then compile with the generated profile.
Likes 1
Leave a comment:
carewolf replied

03 March 2017, 01:56 PM
This is in gcc:
/************************************************** ***************************/
/* AVX instruction selection tuning (some of SSE flags affects AVX, too) */
/************************************************** ***************************/

/* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if false, unaligned loads are
split. */
DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL, "256_unaligned_load_optimal",
~(m_NEHALEM | m_SANDYBRIDGE | m_GENERIC))

/* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if false, unaligned stores are
split. */
DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL, "256_unaligned_store_optimal",
~(m_NEHALEM | m_SANDYBRIDGE | m_BDVER | m_ZNVER1 | m_GENERIC))

/* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
the auto-vectorizer. */
DEF_TUNE (X86_TUNE_AVX128_OPTIMAL, "avx128_optimal", m_BDVER | m_BTVER2
| m_ZNVER1)

Though that is gcc 7. I wonder if -mno-prefer-avx128 works. I am not sure all the tuning have been set from extensive benchmarking, some are just carried on from bulldozer settings and might not make sense any longer. But it certainly explains a few dramatic benchmark differences if -march=native for zen only does avx on 128bit at a time.
Leave a comment:
AsuMagic replied

03 March 2017, 01:54 PM
Originally posted by VikingGe View Post

According to some AIDA benchmarks, Ryzen seems to have a rather ridiculous memory latency that is much worse than that of Bulldozer, which already wasn't particularly great. That might explain the mixed results.

And Intel should have roughly 2x the FMA throughput per core, maybe that matters too in that test. Not sure what exactly it does, never really cared.

Interesting. So Zen will be more competitive with Intel on tasks that are less memory-intensive and more cache efficient? Which is why it seems to perform poorly on some games?
Leave a comment:
zboson replied

03 March 2017, 01:52 PM
It's too bad AMD dropped FMA4. The most disappointing thing about Zen is that it only has two 128-bit FMA units. Its peak FLOPS is the same as Sandy Bridge. Haswell has twice the peak flops of Zen because it can do two 256-bit FMA per cycle. FMA4 is better than FMA3 I think so I wish they kept it. XOP also was interesting because it fixed a few things that are only fixed in AVX512 which only exists in Knights Landing currently. Zen is still awesome though.

Last edited by zboson; 03 March 2017, 02:27 PM.
Likes 2
Leave a comment:
zboson replied

03 March 2017, 01:49 PM
Originally posted by qsmcomp View Post

Try -march=haswell -mtune=haswell -mno-rdrnd and -march=haswell -mtune=znver1 -mno-rdrnd -mprefer-avx128 -mvzeroupper.
You will see more interesting results.

Why `-mno-rdrnd`?

I don't know about the Zen architecture but with the bulldozer architecture -mvzeroupper is not necessary. It's only Intel that suffers (maybe Zen now as well) from the false dependency on the upper half of AVX when it's dirty.
Leave a comment:
VikingGe replied

03 March 2017, 01:46 PM
Originally posted by edwaleni View Post

Based on the results PTS has shown, Ryzen definitely is a work in progress. Somethings seem to have gotten a great deal of attention, other areas less so. Ryzen2/Ryzen Server will probably handle Himeno much better in relative terms.

According to some AIDA benchmarks, Ryzen seems to have a rather ridiculous memory latency that is much worse than that of Bulldozer, which already wasn't particularly great. That might explain the mixed results.

And Intel should have roughly 2x the FMA throughput per core, maybe that matters too in that test. Not sure what exactly it does, never really cared.
Leave a comment:
curaga replied

03 March 2017, 01:33 PM
Michael, you can disable instruction sets even with -march. -march=bdver1 -mno-xop -mno-fma4
Likes 1
Leave a comment:
edwaleni replied

03 March 2017, 12:51 PM
Originally posted by indepe View Post

The Himeno benchmark seems to be designed to always do the opposite of what you'd expect. (just joking)

LOL. I pulled the highlighted line from the Himeno website. I was trying to see if compiler flags would have any impact on the results (it doesn't seem so).

It appears to be more of optimization AMD has to do in Ryzen's cache.

Based on the results PTS has shown, Ryzen definitely is a work in progress. Somethings seem to have gotten a great deal of attention, other areas less so. Ryzen2/Ryzen Server will probably handle Himeno much better in relative terms.
Leave a comment:

Announcement

The Impact Of GCC Zen Compiler Tuning On AMD Ryzen Performance

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: