Originally posted by qsmcomp
View Post
Announcement
Collapse
No announcement yet.
The Impact Of GCC Zen Compiler Tuning On AMD Ryzen Performance
Collapse
X
-
Originally posted by indepe View Post
The Himeno benchmark seems to be designed to always do the opposite of what you'd expect. (just joking)
It appears to be more of optimization AMD has to do in Ryzen's cache.
Based on the results PTS has shown, Ryzen definitely is a work in progress. Somethings seem to have gotten a great deal of attention, other areas less so. Ryzen2/Ryzen Server will probably handle Himeno much better in relative terms.
Comment
-
Originally posted by edwaleni View PostBased on the results PTS has shown, Ryzen definitely is a work in progress. Somethings seem to have gotten a great deal of attention, other areas less so. Ryzen2/Ryzen Server will probably handle Himeno much better in relative terms.
And Intel should have roughly 2x the FMA throughput per core, maybe that matters too in that test. Not sure what exactly it does, never really cared.
Comment
-
Originally posted by qsmcomp View PostTry -march=haswell -mtune=haswell -mno-rdrnd and -march=haswell -mtune=znver1 -mno-rdrnd -mprefer-avx128 -mvzeroupper.
You will see more interesting results.
I don't know about the Zen architecture but with the bulldozer architecture -mvzeroupper is not necessary. It's only Intel that suffers (maybe Zen now as well) from the false dependency on the upper half of AVX when it's dirty.
Comment
-
It's too bad AMD dropped FMA4. The most disappointing thing about Zen is that it only has two 128-bit FMA units. Its peak FLOPS is the same as Sandy Bridge. Haswell has twice the peak flops of Zen because it can do two 256-bit FMA per cycle. FMA4 is better than FMA3 I think so I wish they kept it. XOP also was interesting because it fixed a few things that are only fixed in AVX512 which only exists in Knights Landing currently. Zen is still awesome though.Last edited by zboson; 03 March 2017, 02:27 PM.
- Likes 2
Comment
-
Originally posted by VikingGe View PostAccording to some AIDA benchmarks, Ryzen seems to have a rather ridiculous memory latency that is much worse than that of Bulldozer, which already wasn't particularly great. That might explain the mixed results.
And Intel should have roughly 2x the FMA throughput per core, maybe that matters too in that test. Not sure what exactly it does, never really cared.
Comment
-
This is in gcc:
/************************************************** ***************************/
/* AVX instruction selection tuning (some of SSE flags affects AVX, too) */
/************************************************** ***************************/
/* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if false, unaligned loads are
split. */
DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL, "256_unaligned_load_optimal",
~(m_NEHALEM | m_SANDYBRIDGE | m_GENERIC))
/* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if false, unaligned stores are
split. */
DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL, "256_unaligned_store_optimal",
~(m_NEHALEM | m_SANDYBRIDGE | m_BDVER | m_ZNVER1 | m_GENERIC))
/* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
the auto-vectorizer. */
DEF_TUNE (X86_TUNE_AVX128_OPTIMAL, "avx128_optimal", m_BDVER | m_BTVER2
| m_ZNVER1)
Though that is gcc 7. I wonder if -mno-prefer-avx128 works. I am not sure all the tuning have been set from extensive benchmarking, some are just carried on from bulldozer settings and might not make sense any longer. But it certainly explains a few dramatic benchmark differences if -march=native for zen only does avx on 128bit at a time.
Comment
-
Originally posted by qsmcomp View Post
with
O3 -fno-inline-functions -funroll-loops -fpeel-loops -ftracer
the results might be more 'tricky'.
I wish more build-systems had support for making profile generating and profile using builds, or could do both, first making one, then running a bunch of tests and benchmark and then compile with the generated profile.
- Likes 1
Comment
-
Originally posted by carewolf View PostI wish more build-systems had support for making profile generating and profile using builds, or could do both, first making one, then running a bunch of tests and benchmark and then compile with the generated profile.
- Likes 1
Comment
Comment