Announcement

Collapse
No announcement yet.

Intel Core i7 AVX GCC Compiler Tuning Results

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Intel Core i7 AVX GCC Compiler Tuning Results

    Phoronix: Intel Core i7 AVX GCC Compiler Tuning Results

    For those owners of Intel's latest-generation Core i3/i5/i7 "Sandy Bridge" processors, here's a quick look at the impact of some GCC tuning options specific to these latest AVX-enabled Intel processors...

    http://www.phoronix.com/vr.php?view=MTA3NjE

  • #2
    Holy shit, I didn't expect such a boost with a simple compiler flag.
    ## VGA ##
    AMD: X1950XTX, HD3870, HD5870
    Intel: GMA45, HD3000 (Core i5 2500K)

    Comment


    • #3
      Originally posted by darkbasic View Post
      Holy shit, I didn't expect such a boost with a simple compiler flag.
      As many gentoo users have found, compiler flags generally have one of the following effects.

      1: No effect at all
      2: The compiled binary does not work
      3: The compiled binary is slower
      4: The compiled binary is faster

      In addition, performance increases often require certain combinations of compiler flags, making the tweak more complex than adding "-march"

      I once had a bash script for a number of CLI binaries (notably ffmpeg, faac, flac) which would iterate through cflag combinations and compilers (gcc versus icc) After each iteration, the script would run an automated benchmark on the resulting binary. Results were dumped to a file and sorted. The issue that I ran into was that the results would change depending on factors such as platform arch, available memory, and CPU affinity. Other issues involved (pre)linking and libraries, killing misbehaving binaries, memory reclamation, etc.

      Overall, a system with local optimizations performed approximately 50% better on average than a generic "-o2" solution. The problem is that you will never be able to find and fix all of the minor issues caused by the optimizations across all binaries to a level required by a distro. My conclusion was that compiler optimizations are of great benefit to single-task servers (a transcoding server in my case), but are currently out of reach for a general desktop.

      Then I bought a Mac....

      Comment


      • #4
        Originally posted by russofris View Post
        As many gentoo users have found, compiler flags generally have one of the following effects.

        1: No effect at all
        2: The compiled binary does not work
        3: The compiled binary is slower
        4: The compiled binary is faster

        In addition, performance increases often require certain combinations of compiler flags, making the tweak more complex than adding "-march"

        I once had a bash script for a number of CLI binaries (notably ffmpeg, faac, flac) which would iterate through cflag combinations and compilers (gcc versus icc) After each iteration, the script would run an automated benchmark on the resulting binary. Results were dumped to a file and sorted. The issue that I ran into was that the results would change depending on factors such as platform arch, available memory, and CPU affinity. Other issues involved (pre)linking and libraries, killing misbehaving binaries, memory reclamation, etc.

        Overall, a system with local optimizations performed approximately 50% better on average than a generic "-o2" solution. The problem is that you will never be able to find and fix all of the minor issues caused by the optimizations across all binaries to a level required by a distro. My conclusion was that compiler optimizations are of great benefit to single-task servers (a transcoding server in my case), but are currently out of reach for a general desktop.

        Then I bought a Mac....
        Incredible. Possible expected outcome is one option out of the entire set of options?! YOU DON'T SAY!!!!



        I too then bought a Mac and love the little things like, sleep that works.

        Comment


        • #5
          Originally posted by russofris View Post
          As many gentoo users have found, compiler flags generally have one of the following effects.

          1: No effect at all
          2: The compiled binary does not work
          3: The compiled binary is slower
          4: The compiled binary is faster

          In addition, performance increases often require certain combinations of compiler flags, making the tweak more complex than adding "-march"

          I once had a bash script for a number of CLI binaries (notably ffmpeg, faac, flac) which would iterate through cflag combinations and compilers (gcc versus icc) After each iteration, the script would run an automated benchmark on the resulting binary. Results were dumped to a file and sorted. The issue that I ran into was that the results would change depending on factors such as platform arch, available memory, and CPU affinity. Other issues involved (pre)linking and libraries, killing misbehaving binaries, memory reclamation, etc.

          Overall, a system with local optimizations performed approximately 50% better on average than a generic "-o2" solution. The problem is that you will never be able to find and fix all of the minor issues caused by the optimizations across all binaries to a level required by a distro. My conclusion was that compiler optimizations are of great benefit to single-task servers (a transcoding server in my case), but are currently out of reach for a general desktop.

          Then I bought a Mac....
          Did you try profile guided optimization? My guess it that may be the easiest way to get the best binary without resorting to potentially dangerous flags.

          Comment


          • #6
            Originally posted by russofris View Post
            As many gentoo users have found, compiler flags generally have one of the following effects.

            1: No effect at all
            2: The compiled binary does not work
            3: The compiled binary is slower
            4: The compiled binary is faster

            In addition, performance increases often require certain combinations of compiler flags, making the tweak more complex than adding "-march"

            I once had a bash script for a number of CLI binaries (notably ffmpeg, faac, flac) which would iterate through cflag combinations and compilers (gcc versus icc) After each iteration, the script would run an automated benchmark on the resulting binary. Results were dumped to a file and sorted. The issue that I ran into was that the results would change depending on factors such as platform arch, available memory, and CPU affinity. Other issues involved (pre)linking and libraries, killing misbehaving binaries, memory reclamation, etc.

            Overall, a system with local optimizations performed approximately 50% better on average than a generic "-o2" solution. The problem is that you will never be able to find and fix all of the minor issues caused by the optimizations across all binaries to a level required by a distro. My conclusion was that compiler optimizations are of great benefit to single-task servers (a transcoding server in my case), but are currently out of reach for a general desktop.

            Then I bought a Mac....
            my friend do You have those results posted anywhere? I would be interested is seeing best flag combination for those apps as I use my PC mostly for transcoding. Pehaps You happen to still have those automated benchmarks with You? i would greatly appreciate sharing of the knowledge

            regards,
            rz

            Comment


            • #7
              Originally posted by russofris View Post
              Overall, a system with local optimizations performed approximately 50% better on average than a generic "-o2" solution.
              That sounds extremely good. Do you remember which programs made up the bulk of this increase in performance? Personally I see little to no point in optimizing an entire system, kernel included unless you need extremely low latency. If you are using computationally intense programs at a very regular basis in tasks which spans long periods of time then yes I think there's a good reason for compiling these with more aggressive compiler optimizations, I'm talking encoders, compressors, renderers, that kind of stuff. Unless of course you think it's just fun to poke around and try to optimize your system as much as possible, then it's no better or worse way to spend your own time than any other hobby out there.

              Originally posted by WorBlux View Post
              Did you try profile guided optimization? My guess it that may be the easiest way to get the best binary without resorting to potentially dangerous flags.
              I agree, one of the very potent optimizations available is loop-unrolling, however due to the difficulty of accurately estimating this optimization at compile-time it's not turned on by default. However, when you use profile guided optimization it has all the runtime data it needs to make accurate choices when unrolling and thus turns it on by default. Downside to PGO is that you need to compile in two stages with a test run between them, this can of course be automated though, like with firefox, x264 etc.

              Comment


              • #8
                Originally posted by russofris
                Then I bought a Mac....
                ... and??...

                Originally posted by Tgui View Post
                I too then bought a Mac and love the little things like, sleep that works.
                Funny, I keep reading everywhere that macs are pieces of technological wonder, but my Mac Mini fails to boot properly 1 out of 3 boots. It just stays in the grey screen forever. Also, I can't let it turn off the screen through DPMS otherwise the screen is never going to wake up again unless I reboot. On the other hand all my cheap homebuilt PCs running Ubuntu suspend and hibernate perfectly fine all the time.

                Comment


                • #9
                  Originally posted by devius View Post
                  ... and??...



                  Funny, I keep reading everywhere that macs are pieces of technological wonder, but my Mac Mini fails to boot properly 1 out of 3 boots. It just stays in the grey screen forever. Also, I can't let it turn off the screen through DPMS otherwise the screen is never going to wake up again unless I reboot. On the other hand all my cheap homebuilt PCs running Ubuntu suspend and hibernate perfectly fine all the time.
                  Opposite experience here... I've got a 13" Core 2 Duo Macbook Pro and it sleeps/wakes perfectly, as does my wife's Sandy Bridge MBP (13"), and her old iBook (G4), and her brother's and mother's systems (All of them either in or previously in the publishing industry).

                  My desktop (Athlon x2 5000+, then Phenom II x3, then an x6 and Radeon 4850, then a 4770, then 6850) hasn't woken from sleep properly in the last few years, not even once... even with an Ubuntu reinstall and then an eventual replacement with Mint.

                  I won't say that sleep is universally broken on my PCs in Linux (most of the other ones work), but it is for this one.

                  Comment


                  • #10
                    Originally posted by Veerappan View Post
                    Opposite experience here... I've got a 13" Core 2 Duo Macbook Pro and it sleeps/wakes perfectly, as does my wife's Sandy Bridge MBP (13"), and her old iBook (G4), and her brother's and mother's systems (All of them either in or previously in the publishing industry).

                    My desktop (Athlon x2 5000+, then Phenom II x3, then an x6 and Radeon 4850, then a 4770, then 6850) hasn't woken from sleep properly in the last few years, not even once... even with an Ubuntu reinstall and then an eventual replacement with Mint.

                    I won't say that sleep is universally broken on my PCs in Linux (most of the other ones work), but it is for this one.
                    My guess is BIOS issues, are a particularly buggy chipset being included that doesn't reload properly on resume.

                    Comment


                    • #11
                      Originally posted by WorBlux View Post
                      Did you try profile guided optimization? My guess it that may be the easiest way to get the best binary without resorting to potentially dangerous flags.
                      PGO is great. Gives me an easy 5-10% performance boost when optimising wined3d (can't profile the entire wine app because of a bug). PGO works great with Dolphin as well.

                      Comment


                      • #12
                        Smallpt didnt use any SIMD instructions

                        The real advantage of AVX is the wide of the registers. Its 256 bit and SmallPT didnt use SIMD.
                        So, for a real good review, you have to put some programs that use SIMD power.

                        Comment


                        • #13
                          @ Do I have the results? No, The experiments were back during the gentoo 1.4 days (GCC 3.X, ICC7, probably back in the 2007-ish time frame?). You might find partial results on the gentoo forums if they have posts from back then.
                          @ Do I remember which binaries? FLAC, FAAC, imageMagic, LAME and Mencoder (with supporting libraries).
                          @ Do I remember which flags worked best? No, and even if I did, they would probably not apply on modern systems.
                          @ Did you use the super duper new tuner/profiler? No, as I do not believe that it existed at the time. If it did, I was completely unaware of it.

                          What I did was nothing special. I created an array of cflags and then walked through the combinations, running a time'd benchmark each iteration. It was honestly 3 lines of bash for-loop-foo per target app and a single file containing comma delimited cflags. Gentoo made it easy as the build system was already set up.

                          The biggest reason why I scratched it was that I would end up with a working FLAC binary, but random apps that linked to libflac.so would bomb. At that point, it seemed that it really wasn't important enough to me to invest additional time writing automated tests for every app that linked each library. In addition, I had already found an alternate solution to all of my issues.

                          F

                          Comment


                          • #14
                            Originally posted by AnonymousCoward View Post
                            PGO is great. Gives me an easy 5-10% performance boost when optimising wined3d (can't profile the entire wine app because of a bug). PGO works great with Dolphin as well.
                            Hi,

                            how did you get this going with wined3d only ?
                            Do you have some kind of howto ?

                            I'm very interested in +5-10% pef increease in the wine d3d area.
                            Im nowadays compiling wine with march=native but that doesnt give that much of an fps increase.


                            Many thanks,
                            Christian

                            Comment


                            • #15
                              The article mentioned:

                              The -march=corei7-avx option is most appropriate for Sandy Bridge since it enables the Advanced Vector Extensions support as well as the AES and PCLMUL instruction sets for Sandy Bridge. Here's the overview from the GCC i386/x86_64 options page:
                              `core2'
                              Intel Core 2 CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3 instruction set support.
                              `corei7'
                              Intel Core i7 CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1 and SSE4.2 instruction set support.
                              `corei7-avx'
                              Intel Core i7 CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AES and PCLMUL instruction set support.
                              What about -mtune ? Was that used as well?

                              Comment

                              Working...
                              X