Announcement

Collapse
No announcement yet.

The Impact Of GCC Zen Compiler Tuning On AMD Ryzen Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by qsmcomp View Post
    Try -march=haswell -mtune=haswell -mno-rdrnd and -march=haswell -mtune=znver1 -mno-rdrnd -mprefer-avx128 -mvzeroupper.
    You will see more interesting results.
    Have results to share?

    Comment


    • #12
      Originally posted by indepe View Post

      The Himeno benchmark seems to be designed to always do the opposite of what you'd expect. (just joking)
      LOL. I pulled the highlighted line from the Himeno website. I was trying to see if compiler flags would have any impact on the results (it doesn't seem so).

      It appears to be more of optimization AMD has to do in Ryzen's cache.

      Based on the results PTS has shown, Ryzen definitely is a work in progress. Somethings seem to have gotten a great deal of attention, other areas less so. Ryzen2/Ryzen Server will probably handle Himeno much better in relative terms.

      Comment


      • #13
        Michael, you can disable instruction sets even with -march. -march=bdver1 -mno-xop -mno-fma4

        Comment


        • #14
          Originally posted by edwaleni View Post
          Based on the results PTS has shown, Ryzen definitely is a work in progress. Somethings seem to have gotten a great deal of attention, other areas less so. Ryzen2/Ryzen Server will probably handle Himeno much better in relative terms.
          According to some AIDA benchmarks, Ryzen seems to have a rather ridiculous memory latency that is much worse than that of Bulldozer, which already wasn't particularly great. That might explain the mixed results.

          And Intel should have roughly 2x the FMA throughput per core, maybe that matters too in that test. Not sure what exactly it does, never really cared.

          Comment


          • #15
            Originally posted by qsmcomp View Post
            Try -march=haswell -mtune=haswell -mno-rdrnd and -march=haswell -mtune=znver1 -mno-rdrnd -mprefer-avx128 -mvzeroupper.
            You will see more interesting results.
            Why `-mno-rdrnd`?

            I don't know about the Zen architecture but with the bulldozer architecture -mvzeroupper is not necessary. It's only Intel that suffers (maybe Zen now as well) from the false dependency on the upper half of AVX when it's dirty.

            Comment


            • #16
              It's too bad AMD dropped FMA4. The most disappointing thing about Zen is that it only has two 128-bit FMA units. Its peak FLOPS is the same as Sandy Bridge. Haswell has twice the peak flops of Zen because it can do two 256-bit FMA per cycle. FMA4 is better than FMA3 I think so I wish they kept it. XOP also was interesting because it fixed a few things that are only fixed in AVX512 which only exists in Knights Landing currently. Zen is still awesome though.
              Last edited by zboson; 03 March 2017, 02:27 PM.

              Comment


              • #17
                Originally posted by VikingGe View Post
                According to some AIDA benchmarks, Ryzen seems to have a rather ridiculous memory latency that is much worse than that of Bulldozer, which already wasn't particularly great. That might explain the mixed results.

                And Intel should have roughly 2x the FMA throughput per core, maybe that matters too in that test. Not sure what exactly it does, never really cared.
                Interesting. So Zen will be more competitive with Intel on tasks that are less memory-intensive and more cache efficient? Which is why it seems to perform poorly on some games?

                Comment


                • #18
                  This is in gcc:
                  /************************************************** ***************************/
                  /* AVX instruction selection tuning (some of SSE flags affects AVX, too) */
                  /************************************************** ***************************/

                  /* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if false, unaligned loads are
                  split. */
                  DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL, "256_unaligned_load_optimal",
                  ~(m_NEHALEM | m_SANDYBRIDGE | m_GENERIC))

                  /* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if false, unaligned stores are
                  split. */
                  DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL, "256_unaligned_store_optimal",
                  ~(m_NEHALEM | m_SANDYBRIDGE | m_BDVER | m_ZNVER1 | m_GENERIC))

                  /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
                  the auto-vectorizer. */
                  DEF_TUNE (X86_TUNE_AVX128_OPTIMAL, "avx128_optimal", m_BDVER | m_BTVER2
                  | m_ZNVER1)

                  Though that is gcc 7. I wonder if -mno-prefer-avx128 works. I am not sure all the tuning have been set from extensive benchmarking, some are just carried on from bulldozer settings and might not make sense any longer. But it certainly explains a few dramatic benchmark differences if -march=native for zen only does avx on 128bit at a time.

                  Comment


                  • #19
                    Originally posted by qsmcomp View Post

                    with
                    O3 -fno-inline-functions -funroll-loops -fpeel-loops -ftracer
                    the results might be more 'tricky'.
                    That is one non-sensical line.I would always enable finline-function first. The rest mainly makes sense together with profiled optimization, so after you have generated a profile, you can use that profile with unroll-loops etc (In fact I believe that is default when doing profile guided optimizations second run).

                    I wish more build-systems had support for making profile generating and profile using builds, or could do both, first making one, then running a bunch of tests and benchmark and then compile with the generated profile.

                    Comment


                    • #20
                      Originally posted by carewolf View Post
                      I wish more build-systems had support for making profile generating and profile using builds, or could do both, first making one, then running a bunch of tests and benchmark and then compile with the generated profile.
                      +1. CMake is a pain in the ass to configure in that regard.

                      Comment

                      Working...
                      X