Announcement

Collapse
No announcement yet.

The Impact Of GCC Zen Compiler Tuning On AMD Ryzen Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    Michael! Everyone knows how to configure GCC except you, how dare you do things differently than they do?

    Comment


    • #42
      Originally posted by qsmcomp View Post

      Aggressive inlining will make generated code larger and might do harm to caching / branch predicting?
      Possibly. But it can also make code smaller in some cases. -Os actually enables some inlining that -O2 does not for this reason. I don't know if gcc made that level inlining a separate option yet, it would be nice if they did. The inlining is rather important for vectorizations and good loop unrolling, the compiler can not join operations from nested function calls unless they are inlined.

      If you are generating code for very small embedded devices, you might want to take offset in the -Os optimization, and add options from -O3 that improves the performance for your code.
      Last edited by carewolf; 04 March 2017, 06:28 AM.

      Comment


      • #43
        Originally posted by qsmcomp View Post

        It seems that AMD's not supporting RDRND?
        AMD has supported RdRand since Excavator. Zen is the first AMD arch to support RdSeed. You can see the new Zen instructions here
        http://www.anandtech.com/show/11170/...0x-and-1700/10

        Zen has two Zen exclusive instructions (x86 does not have) but it dropped FMA4 and XOP.

        You can also see what is supported here
        http://openbenchmarking.org/system/1...01800X/cpuinfo

        I wonder if Zen instructions set has everything Kaby Laky has?

        Comment


        • #44
          Originally posted by phoronix View Post
          Phoronix: The Impact Of GCC Zen Compiler Tuning On AMD Ryzen Performance

          The latest in our AMD Ryzen Linux benchmarking is looking at the impact of compiled binaries when making use of Zen "znver1" compiler optimizations with the GNU Compiler Collection (GCC) compared to other optimization levels like Bulldozer and K8-SSE3.

          http://www.phoronix.com/vr.php?view=24234
          On my A10-7850K machine, the GCC 6.3.0 compiler generates slower code with -march=bdver3 than with -mavx (the latter is without any -march).

          It would be useful to know whether this soft-error of GCC is an issue on Ryzen as well.

          Exists there an openchmarking.org result comparing "-O3 -mavx" with "-O3 -march=znver1" on Ryzen?

          Comment


          • #45
            Originally posted by zboson View Post

            AMD has supported RdRand since Excavator. Zen is the first AMD arch to support RdSeed. You can see the new Zen instructions here
            http://www.anandtech.com/show/11170/...0x-and-1700/10

            Zen has two Zen exclusive instructions (x86 does not have) but it dropped FMA4 and XOP.

            You can also see what is supported here
            http://openbenchmarking.org/system/1...01800X/cpuinfo

            I wonder if Zen instructions set has everything Kaby Laky has?
            Yes, except the Intel specific stuff of course There has always been a few things AMD and Intel have done differently, various tooling of the CPU in particular, for instance performance counters being one, and it appears now memory encryption and buffer overflow protecting extensions are new ones. Ryzen in particular doesn't have MPX (buffer overflow guards), TSX (transactional memory) and SGX (memory encryption), but AMD already have altenatives for the last two. It remains to be seen if MPX will become generally used and if AMD will adopt or make their own version.

            Comment


            • #46
              Originally posted by liam View Post


              Why do you need more than two channels in a single socket, non-rdimm system?
              Because you have eight cores/sixteen threads available and there will be situations where enough of those cores are doing enough memory intensive operations that bandwidth becomes the bottleneck...
              To put it differently, you're giving this eight core SoC basically the same bandwidth as an iPad... (Not exactly comparable --- I assume Ryzen has two independent memory controller queues and can sustain more open pages than iPad, which has essentially one controller that is run at 128-bit wide rather than 64 bits wide, but basically same order of magnitude.)

              This is the constant on-going cheapness of the x86 world --- both Intel and AMD are so obsessed with product segmentation that they cripple their commodity CPUs in terms of the memory controllers, so that the server revenue is not compromised. That might seem like a great plan except all it's going to do is push everyone with a bandwidth-intensive but compute light workload to buy an ARM server in a year or three...

              Comment


              • #47
                Originally posted by name99 View Post

                Because you have eight cores/sixteen threads available and there will be situations where enough of those cores are doing enough memory intensive operations that bandwidth becomes the bottleneck...
                To put it differently, you're giving this eight core SoC basically the same bandwidth as an iPad... (Not exactly comparable --- I assume Ryzen has two independent memory controller queues and can sustain more open pages than iPad, which has essentially one controller that is run at 128-bit wide rather than 64 bits wide, but basically same order of magnitude.)

                This is the constant on-going cheapness of the x86 world --- both Intel and AMD are so obsessed with product segmentation that they cripple their commodity CPUs in terms of the memory controllers, so that the server revenue is not compromised. That might seem like a great plan except all it's going to do is push everyone with a bandwidth-intensive but compute light workload to buy an ARM server in a year or three...
                My question was rhetorical. From the reviews I've read, thus far, BANDWIDTH isn't an issue. Having that big victim cache certainly helps matters.
                If you've come across a review that shows this to be an issue, I'd definitely read it.

                Comment


                • #48
                  I wonder if the FLAC regression is due to some kind of interaction with the run-time SIMD detection?

                  FLAC assumes recent gcc versions always have certain ISA extensions available and builds them all in, it's a really common misconception I think borne out of the fact that most distribution toolchains are generic x86-64; and the generic build does always make them available. Since I always use a target specific toolchain (including avx math functions, which isn't default on supported CPUs!) I always patch FLAC amongst a few other projects like Boost and Chromium to only include support for enabled extensions and disable the run-time detection.

                  Comment


                  • #49
                    I found that the best performance with zen can be had when building with "-march=znver1 -mtune=broadwell -mprefer-avx128". The performance increase over -mtune=znver1 in e.g. scimark is as high as 20% in some instances.

                    Comment


                    • #50
                      I believe this needs to be retested with GCC 7. GCC 6 is not optimized for Zen according to Gentoo wiki: https://wiki.gentoo.org/wiki/Ryzen#GCC_6.3.2B

                      EDIT: Wrong thread.

                      Comment

                      Working...
                      X