Announcement

Collapse
No announcement yet.

AMD Compiler Optimization Benchmarks With GCC 4.10 (GCC 5.0)

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • AMD Compiler Optimization Benchmarks With GCC 4.10 (GCC 5.0)

    Phoronix: AMD Compiler Optimization Benchmarks With GCC 4.10 (GCC 5.0)

    As a continuation to yesterday's brief GCC 4.9 vs. GCC 4.10 (GCC 5.0) comparison with the AMD A10 A-Series "Kaveri" APU, here's some benchmarks when using the GCC 4.10 development snapshot and trying a variety of CFLAGS/CXXFLAGS to see the current impact on their performance for a variety of Linux benchmarks...

    http://www.phoronix.com/vr.php?view=MTc2NDg

  • #2
    And with the Default settings that each Distro uses?

    Comment


    • #3
      Originally posted by Nille View Post
      And with the Default settings that each Distro uses?
      I looked in the GCC docco and it seemed to suggest that there was no "default" setting, which surprised me at first but I guess makes sense. That means distros would actually have to pick a CPU for each build though.

      Comment


      • #4
        Originally posted by Nille View Post
        And with the Default settings that each Distro uses?
        Nille, the article is about new compiler developments and not about distros. You will not find a distro that ships 4.10 out of the box. The article is meant to give an outlook on what is coming in terms of new optimizations with the next major version of gcc.

        Trying to run benchmarks with a prerelease of gcc-4.10 and the compiler options the distros use is ... like trying to compare people of different countries based on their perfomance in driving a new car's unfinished prototype. It would be as stupid as the European Song Contest.

        Each distro has got a different opinion on what is best for their users. Many distros sacrifice a bit of performance for getting more security. When they use the stack-protector feature of gcc and also write-protect code segments through linker options for example. You would be comparing apples to oranges and the result would only draw a false picture. Some people would start believing "distro A" is worse than "distro B", only because it is slower, but they would not actually know where this difference is coming from. It leads to bad, uninformed opinions.

        In general, if it is important to you that a piece of software runs very fast then try to recompile the package or get the original code. This way you can target the optimizations directly at your specific hardware (i.e. with -march=native), which a distro cannot do, because they always need to keep the code compatible to as many processors as possible. You can also disable options and features of a package, because many distros try to enable as many features as possible, while most people do not need every single feature. However, the distros need to do this so that as many people as possible can find a package useful. Hence trimming down the features can make code more compact and faster, too.

        If it is still not fast enough then you will probably already know that your hardware is too slow and you would only be trying to blame the software for it.

        Comment


        • #5
          It's time to learn Gentoo?

          Comment


          • #6
            Originally posted by bridgman View Post
            I looked in the GCC docco and it seemed to suggest that there was no "default" setting, which surprised me at first but I guess makes sense. That means distros would actually have to pick a CPU for each build though.
            The default that distros use is -march=x86-64 -mtune=generic -O2. No need to pick a specific CPU.

            Comment


            • #7
              Originally posted by souenzzo View Post
              It's time to learn Gentoo?
              Can always learn Antergos(Arch Linux). Compiling with compiler flags is a matter of modifying /etc/makepkg.conf with your preferences and compiling software from the repos with yaourt. yaourt -Sb for official packages and yaourt -S for packages from the AUR.

              Comment


              • #8
                Originally posted by sdack View Post
                In general, if it is important to you that a piece of software runs very fast then try to recompile the package or get the original code. This way you can target the optimizations directly at your specific hardware (i.e. with -march=native),
                I once thought that, too. Then I benchmarked my bulldozer: mtune=generic vs. march=native. And guess what? There was no difference! That's why I'd like to see mtune =generic in these benchmarks, too. After all, this is what you'll roughly get from the distributions.
                Last edited by oleid; 08-16-2014, 04:50 PM.

                Comment


                • #9
                  Originally posted by Nille View Post
                  And with the Default settings that each Distro uses?
                  For Fedora, the 64-bit packages are build with "-O2 -g -mtune=generic". The 32-bit ones are build with "-O2 -g -march=i686 -mtune=atom".

                  Comment


                  • #10
                    Originally posted by oleid View Post
                    I once thought that, too. Then I benchmarked my bulldozer: mtune=generic vs. march=native. And guess what? There was no difference! That's why I'd like to see mtune =generic in these benchmarks, too. After all, this is what you'll roughly get from the distributions.
                    -mtune= only selects the CPU type for instruction scheduling. -march= selects the CPU type for the instructions to use.

                    This certainly makes a difference, but it does not show with just every application. This is what the article is trying to present by the way. For example -march=k8 will select the standard x86 instruction set up to and include SSE2, -march=amdfam10 will further include SSE3 and SSE4A instructions, and -march=bdver1 will also include SSE4.1, SSE4.2 and AVX instructions.

                    Whereas -mtune=k8 will tell the compiler to schedule instructions for a L1=64k/L2=512k configuration and that moving an MMX/SSE register to integer has got a cost of 5. Using -mtune=amdfam10 will make it use the same L1/L2 configuration, but sets the cost of MMX/SEE to integer conversions down to 3. And -mtune=bdver1 will use L1=16k/L2=2048k for cache sizes and sets the cost down to 2.

                    There are a lot more parameters hidden behind these switches. These are just some of the parameters used by GCC to make its decisions. The parameter "generic" will simply pick good, average values for all of these parameters. The differences will not show unless you know what exactly to look for and by choosing an application that you know of will benefit significantly from it. Only with very precise benchmarking tools and setups can one also detect the difference this makes for other applications. The result will usually vary so much, that one needs to make many runs before a clear difference becomes visible, because these will only be tiny and the variations will add a lot of noise into the measurements. Hence the focus on ImageMagick and C-Ray.

                    What makes this special for gcc is that it is not a simple thing to just turn code into machine instructions. The compiler needs to detected patterns in the code before it can decide to use the newer SSE4 and AVX instructions over the older MMX/SSE/SSE2 ones. And putting this into a compiler and making it use every new, last feature of new CPUs is a challenge.
                    Last edited by sdack; 08-16-2014, 05:59 PM.

                    Comment


                    • #11
                      Originally posted by sdack View Post
                      -mtune= only selects the CPU type for instruction scheduling. -march= selects the CPU type for the instructions to use.

                      This certainly makes a difference, but it does not show with just every application. This is what the article is trying to present by the way. For example -march=k8 will select the standard x86 instruction set up to and include SSE2, -march=amdfam10 will further include SSE3 and SSE4A instructions, and -march=bdver1 will also include SSE4.1, SSE4.2 and AVX instructions.

                      Whereas -mtune=k8 will tell the compiler to schedule instructions for a L1=64k/L2=512k configuration and that moving an MMX/SSE register to integer has got a cost of 5. Using -mtune=amdfam10 will make it use the same L1/L2 configuration, but sets the cost of MMX/SEE to integer conversions down to 3. And -mtune=bdver1 will use L1=16k/L2=2048k for cache sizes and sets the cost down to 2.

                      There are a lot more parameters hidden behind these switches. These are just some of the parameters used by GCC to make its decisions. The parameter "generic" will simply pick good, average values for all of these parameters. The differences will not show unless you know what exactly to look for and by choosing an application that you know of will benefit significantly from it. Only with very precise benchmarking tools and setups can one also detect the difference this makes for other applications. The result will usually vary so much, that one needs to make many runs before a clear difference becomes visible, because these will only be tiny and the variations will add a lot of noise into the measurements. Hence the focus on ImageMagick and C-Ray.

                      What makes this special for gcc is that it is not a simple thing to just turn code into machine instructions. The compiler needs to detected patterns in the code before it can decide to use the newer SSE4 and AVX instructions over the older MMX/SSE/SSE2 ones. And putting this into a compiler and making it use every new, last feature of new CPUs is a challenge.
                      Perfectly explained.

                      Comment


                      • #12
                        Originally posted by sdack View Post
                        -mtune= only selects the CPU type for instruction scheduling. -march= selects the CPU type for the instructions to use.
                        Yes, it's called generic optimization. The compiler will generate multiple versions of the very same code and decide on runtime what version to use.

                        Originally posted by sdack View Post
                        This certainly makes a difference, but it does not show with just every application. This is what the article is trying to present by the way. For example -march=k8 will select the standard x86 instruction set up to and include SSE2, -march=amdfam10 will further include SSE3 and SSE4A instructions, and -march=bdver1 will also include SSE4.1, SSE4.2 and AVX instructions.
                        The article wants to present the influence of different architecture optimizations on the performance. But this has nothing to do with runtime CPU dispatching.

                        Originally posted by sdack View Post
                        [...]
                        There are a lot more parameters hidden behind these switches. These are just some of the parameters used by GCC to make its decisions. The parameter "generic" will simply pick good, average values for all of these parameters. The differences will not show unless you know what exactly to look for and by choosing an application that you know of will benefit significantly from it. Only with very precise benchmarking tools and setups can one also detect the difference this makes for other applications. The result will usually vary so much, that one needs to make many runs before a clear difference becomes visible, because these will only be tiny and the variations will add a lot of noise into the measurements. Hence the focus on ImageMagick and C-Ray.
                        And that's exactly what I benchmarked maybe a year ago. mtune=generic vs march=native on my E-350 for C-Ray and Graphics-Magick. And using the current compiler of that time (I guess it was gcc 4.7.x) there was no difference. I'm redoing the benchmark to check if it's still true for gcc 4.9. Of curse these results only affect this very CPU using the current compiler -- as every scientific result.

                        Obviously, if you are doing numeric simulations, you will compile using march=native, but for most distribution packages, this won't make a difference. When I got my E-350, I compiled a lot of packages using my own CFLAGS in order to get most out of this CPU, however, now I'm simply using the distribution provided packages.

                        My point is only to include mtune=generic in these benckmarks to get a glimpse if generic tuning does a good job for this cpu (could be interesting for the compiler people).

                        Comment


                        • #13
                          Originally posted by oleid View Post
                          I once thought that, too. Then I benchmarked my bulldozer: mtune=generic vs. march=native. And guess what? There was no difference! That's why I'd like to see mtune =generic in these benchmarks, too. After all, this is what you'll roughly get from the distributions.
                          I believe on x64 that generic == k8

                          Comment


                          • #14
                            Originally posted by carewolf View Post
                            I believe on x64 that generic == k8
                            ArchLinux uses these:

                            CPPFLAGS="-D_FORTIFY_SOURCE=2"
                            CFLAGS="-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector --param=ssp-buffer-size=4"
                            CXXFLAGS="-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector --param=ssp-buffer-size=4"
                            LDFLAGS="-Wl,-O1,--sort-common,--as-needed,-z,relro"
                            Other binary distributions should use something similar, maybe with special flags for certain packages as e.g. lame.

                            Comment


                            • #15
                              Originally posted by oleid View Post
                              Yes, it's called generic optimization. The compiler will generate multiple versions of the very same code and decide on runtime what version to use.
                              It does not matter what it is called, it is not being done here.

                              Some applications do have code to detect the CPU at run-time and can switch to using different functions or plugins, which then make use of a particular instruction set. However, such features need to be put into the code by the programmer and do not come automatically by using gcc.

                              I suggest you read the documentation.

                              Originally posted by oleid
                              My point is only to include mtune=generic in these benckmarks to get a glimpse if generic tuning does a good job for this cpu (could be interesting for the compiler people).
                              Yeah, and I would like to see some strippers, but this would also not be quite on the topic of the article.

                              Comment

                              Working...
                              X