Announcement

Collapse
No announcement yet.

The Impact Of GCC Zen Compiler Tuning On AMD Ryzen Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • zboson
    replied
    Originally posted by AsuMagic View Post

    Interesting. So Zen will be more competitive with Intel on tasks that are less memory-intensive and more cache efficient? Which is why it seems to perform poorly on some games?
    Zen is also dual channel if I recall whereas Skylake (not sure which was first) is quad-channel. This means Zen is more affected by memory bandwidth. That's maybe the second most disappointing thing about Zen after sticking with AVX128. I'm still likely to build a Zen system. It will be the first desktop I have build in years.
    Last edited by zboson; 03 March 2017, 02:43 PM.

    Leave a comment:


  • AsuMagic
    replied
    Originally posted by carewolf View Post
    I wish more build-systems had support for making profile generating and profile using builds, or could do both, first making one, then running a bunch of tests and benchmark and then compile with the generated profile.
    +1. CMake is a pain in the ass to configure in that regard.

    Leave a comment:


  • carewolf
    replied
    Originally posted by qsmcomp View Post

    with
    O3 -fno-inline-functions -funroll-loops -fpeel-loops -ftracer
    the results might be more 'tricky'.
    That is one non-sensical line.I would always enable finline-function first. The rest mainly makes sense together with profiled optimization, so after you have generated a profile, you can use that profile with unroll-loops etc (In fact I believe that is default when doing profile guided optimizations second run).

    I wish more build-systems had support for making profile generating and profile using builds, or could do both, first making one, then running a bunch of tests and benchmark and then compile with the generated profile.

    Leave a comment:


  • carewolf
    replied
    This is in gcc:
    /************************************************** ***************************/
    /* AVX instruction selection tuning (some of SSE flags affects AVX, too) */
    /************************************************** ***************************/

    /* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if false, unaligned loads are
    split. */
    DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL, "256_unaligned_load_optimal",
    ~(m_NEHALEM | m_SANDYBRIDGE | m_GENERIC))

    /* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if false, unaligned stores are
    split. */
    DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL, "256_unaligned_store_optimal",
    ~(m_NEHALEM | m_SANDYBRIDGE | m_BDVER | m_ZNVER1 | m_GENERIC))

    /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
    the auto-vectorizer. */
    DEF_TUNE (X86_TUNE_AVX128_OPTIMAL, "avx128_optimal", m_BDVER | m_BTVER2
    | m_ZNVER1)

    Though that is gcc 7. I wonder if -mno-prefer-avx128 works. I am not sure all the tuning have been set from extensive benchmarking, some are just carried on from bulldozer settings and might not make sense any longer. But it certainly explains a few dramatic benchmark differences if -march=native for zen only does avx on 128bit at a time.

    Leave a comment:


  • AsuMagic
    replied
    Originally posted by VikingGe View Post
    According to some AIDA benchmarks, Ryzen seems to have a rather ridiculous memory latency that is much worse than that of Bulldozer, which already wasn't particularly great. That might explain the mixed results.

    And Intel should have roughly 2x the FMA throughput per core, maybe that matters too in that test. Not sure what exactly it does, never really cared.
    Interesting. So Zen will be more competitive with Intel on tasks that are less memory-intensive and more cache efficient? Which is why it seems to perform poorly on some games?

    Leave a comment:


  • zboson
    replied
    It's too bad AMD dropped FMA4. The most disappointing thing about Zen is that it only has two 128-bit FMA units. Its peak FLOPS is the same as Sandy Bridge. Haswell has twice the peak flops of Zen because it can do two 256-bit FMA per cycle. FMA4 is better than FMA3 I think so I wish they kept it. XOP also was interesting because it fixed a few things that are only fixed in AVX512 which only exists in Knights Landing currently. Zen is still awesome though.
    Last edited by zboson; 03 March 2017, 02:27 PM.

    Leave a comment:


  • zboson
    replied
    Originally posted by qsmcomp View Post
    Try -march=haswell -mtune=haswell -mno-rdrnd and -march=haswell -mtune=znver1 -mno-rdrnd -mprefer-avx128 -mvzeroupper.
    You will see more interesting results.
    Why `-mno-rdrnd`?

    I don't know about the Zen architecture but with the bulldozer architecture -mvzeroupper is not necessary. It's only Intel that suffers (maybe Zen now as well) from the false dependency on the upper half of AVX when it's dirty.

    Leave a comment:


  • VikingGe
    replied
    Originally posted by edwaleni View Post
    Based on the results PTS has shown, Ryzen definitely is a work in progress. Somethings seem to have gotten a great deal of attention, other areas less so. Ryzen2/Ryzen Server will probably handle Himeno much better in relative terms.
    According to some AIDA benchmarks, Ryzen seems to have a rather ridiculous memory latency that is much worse than that of Bulldozer, which already wasn't particularly great. That might explain the mixed results.

    And Intel should have roughly 2x the FMA throughput per core, maybe that matters too in that test. Not sure what exactly it does, never really cared.

    Leave a comment:


  • curaga
    replied
    Michael, you can disable instruction sets even with -march. -march=bdver1 -mno-xop -mno-fma4

    Leave a comment:


  • edwaleni
    replied
    Originally posted by indepe View Post

    The Himeno benchmark seems to be designed to always do the opposite of what you'd expect. (just joking)
    LOL. I pulled the highlighted line from the Himeno website. I was trying to see if compiler flags would have any impact on the results (it doesn't seem so).

    It appears to be more of optimization AMD has to do in Ryzen's cache.

    Based on the results PTS has shown, Ryzen definitely is a work in progress. Somethings seem to have gotten a great deal of attention, other areas less so. Ryzen2/Ryzen Server will probably handle Himeno much better in relative terms.

    Leave a comment:

Working...
X