Announcement

Collapse
No announcement yet.

The Impact Of GCC Zen Compiler Tuning On AMD Ryzen Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    Michael! Everyone knows how to configure GCC except you, how dare you do things differently than they do?

    Comment


    • #42
      Originally posted by qsmcomp View Post

      Aggressive inlining will make generated code larger and might do harm to caching / branch predicting?
      Possibly. But it can also make code smaller in some cases. -Os actually enables some inlining that -O2 does not for this reason. I don't know if gcc made that level inlining a separate option yet, it would be nice if they did. The inlining is rather important for vectorizations and good loop unrolling, the compiler can not join operations from nested function calls unless they are inlined.

      If you are generating code for very small embedded devices, you might want to take offset in the -Os optimization, and add options from -O3 that improves the performance for your code.
      Last edited by carewolf; 04 March 2017, 06:28 AM.

      Comment


      • #43
        Originally posted by qsmcomp View Post

        It seems that AMD's not supporting RDRND?
        AMD has supported RdRand since Excavator. Zen is the first AMD arch to support RdSeed. You can see the new Zen instructions here


        Zen has two Zen exclusive instructions (x86 does not have) but it dropped FMA4 and XOP.

        You can also see what is supported here
        OpenBenchmarking.org, Phoronix Test Suite, Linux benchmarking, automated benchmarking, benchmarking results, benchmarking repository, open source benchmarking, benchmarking test profiles


        I wonder if Zen instructions set has everything Kaby Laky has?

        Comment


        • #44
          Originally posted by zboson View Post

          AMD has supported RdRand since Excavator. Zen is the first AMD arch to support RdSeed. You can see the new Zen instructions here
          http://www.anandtech.com/show/11170/...0x-and-1700/10

          Zen has two Zen exclusive instructions (x86 does not have) but it dropped FMA4 and XOP.

          You can also see what is supported here
          http://openbenchmarking.org/system/1...01800X/cpuinfo

          I wonder if Zen instructions set has everything Kaby Laky has?
          Yes, except the Intel specific stuff of course There has always been a few things AMD and Intel have done differently, various tooling of the CPU in particular, for instance performance counters being one, and it appears now memory encryption and buffer overflow protecting extensions are new ones. Ryzen in particular doesn't have MPX (buffer overflow guards), TSX (transactional memory) and SGX (memory encryption), but AMD already have altenatives for the last two. It remains to be seen if MPX will become generally used and if AMD will adopt or make their own version.

          Comment


          • #45
            Originally posted by liam View Post


            Why do you need more than two channels in a single socket, non-rdimm system?
            Because you have eight cores/sixteen threads available and there will be situations where enough of those cores are doing enough memory intensive operations that bandwidth becomes the bottleneck...
            To put it differently, you're giving this eight core SoC basically the same bandwidth as an iPad... (Not exactly comparable --- I assume Ryzen has two independent memory controller queues and can sustain more open pages than iPad, which has essentially one controller that is run at 128-bit wide rather than 64 bits wide, but basically same order of magnitude.)

            This is the constant on-going cheapness of the x86 world --- both Intel and AMD are so obsessed with product segmentation that they cripple their commodity CPUs in terms of the memory controllers, so that the server revenue is not compromised. That might seem like a great plan except all it's going to do is push everyone with a bandwidth-intensive but compute light workload to buy an ARM server in a year or three...

            Comment


            • #46
              Originally posted by name99 View Post

              Because you have eight cores/sixteen threads available and there will be situations where enough of those cores are doing enough memory intensive operations that bandwidth becomes the bottleneck...
              To put it differently, you're giving this eight core SoC basically the same bandwidth as an iPad... (Not exactly comparable --- I assume Ryzen has two independent memory controller queues and can sustain more open pages than iPad, which has essentially one controller that is run at 128-bit wide rather than 64 bits wide, but basically same order of magnitude.)

              This is the constant on-going cheapness of the x86 world --- both Intel and AMD are so obsessed with product segmentation that they cripple their commodity CPUs in terms of the memory controllers, so that the server revenue is not compromised. That might seem like a great plan except all it's going to do is push everyone with a bandwidth-intensive but compute light workload to buy an ARM server in a year or three...
              My question was rhetorical. From the reviews I've read, thus far, BANDWIDTH isn't an issue. Having that big victim cache certainly helps matters.
              If you've come across a review that shows this to be an issue, I'd definitely read it.

              Comment


              • #47
                I wonder if the FLAC regression is due to some kind of interaction with the run-time SIMD detection?

                FLAC assumes recent gcc versions always have certain ISA extensions available and builds them all in, it's a really common misconception I think borne out of the fact that most distribution toolchains are generic x86-64; and the generic build does always make them available. Since I always use a target specific toolchain (including avx math functions, which isn't default on supported CPUs!) I always patch FLAC amongst a few other projects like Boost and Chromium to only include support for enabled extensions and disable the run-time detection.

                Comment


                • #48
                  I found that the best performance with zen can be had when building with "-march=znver1 -mtune=broadwell -mprefer-avx128". The performance increase over -mtune=znver1 in e.g. scimark is as high as 20% in some instances.

                  Comment


                  • #49
                    I believe this needs to be retested with GCC 7. GCC 6 is not optimized for Zen according to Gentoo wiki: https://wiki.gentoo.org/wiki/Ryzen#GCC_6.3.2B

                    EDIT: Wrong thread.

                    Comment


                    • #50
                      Originally posted by mlau View Post
                      I found that the best performance with zen can be had when building with "-march=znver1 -mtune=broadwell -mprefer-avx128". The performance increase over -mtune=znver1 in e.g. scimark is as high as 20% in some instances.
                      In cases where tuning for broadwell helps, it would be interesting to try `-march=znver1 -mno-prefer-avx128`. (And maybe `-mno-avx256-split-unaligned-store`).

                      -mtune=znver1 enables `-mprefer-avx128` in gcc6.3 and 7.1 (I checked on Godbolt with -fverbose-asm). Zen is designed to handle 256b vectors with its very high uop throughput. IIRC, its front-end can issue 6 uops per clock, but if they're all from single-uop instructions it can only issue only 5 uops per clock. So it has better uop throughput when running 256b AVX code (which decodes to 2 uops per instruction).

                      Auto-vectorization with wider vectors can have more overhead, so -mprefer-avx128 may be a good default for Ryzen even if it loses on Scimark. But if it loses in general, you should file a missed-optimization bug on gcc's bugzilla and let them know they might want to tweak their tuning options.

                      Also, I see that Zen tuning includes -mavx256-split-unaligned-store, but not -mavx256-split-unaligned-load. Haswell and later tuning disables both. -mtune=generic enables both, even though it's only good on Sandybridge/IvyBridge, and on AMD Bulldozer-family. The worst part is that it affects integer AVX2 loads/stores since gcc7.1. No AVX2-supporting CPU benefits from splitting loads (except maybe Excavator?), but `-mavx2` doesn't affect tuning options. So `-mavx2` without `-mtune=haswell` or something is very likely to be sub-optimal, unless the compiler can prove that arrays are always aligned. (I filed https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568 about this issue)

                      If gcc doesn't know for sure that an array is 32B-aligned, those options make it emit

                      vmovdqu [rdi], xmm0 / vextracti128 [rdi+16], ymm0, 1
                      instead of
                      vmovdqu [rdi], ymm0

                      for `_mm256_storeu_si256(dst, v)`, or for auto-vectorized code.

                      This is really bad if your data is actually 32B-aligned most/all of the time, but the compiler doesn't figure that out. In that case it's a pure downside even on Sandybridge.

                      This splitting of loads and stores is usually a win on Sandybridge for 16B-aligned data that's not 32B-aligned, but it loses on Haswell.

                      Comment

                      Working...
                      X