Announcement

Collapse
No announcement yet.

GCC 4.6 Compiler Performance With AVX

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • GCC 4.6 Compiler Performance With AVX

    Phoronix: GCC 4.6 Compiler Performance With AVX

    While we are still battling issues with the Intel Linux graphics driver in getting that running properly with Intel's new Sandy Bridge CPUs (at least Intel's Jesse Barnes is now able to reproduce the most serious problem we've been facing, but we'll save the new graphics information for another article), the CPU performance continues to be very compelling. Two weeks ago we published the Intel Core i5 2500K Linux benchmarks that showed just how well this quad-core CPU that costs a little more than $200 USD is able to truly outperform previous generations of Intel hardware. That was just with running the standard open-source benchmarks and other Linux software, which has not been optimized for Intel's latest micro-architecture. Version 4.6 of the GNU Compiler Collection (GCC) though is gearing up for release and it will bring support for the AVX extensions. In this article, we are benchmarking GCC 4.6 on a Sandy Bridge system to see what benefits there are to enabling the Core i7 AVX optimizations.

    http://www.phoronix.com/vr.php?view=15665

  • #2
    GCC 4.6.x Gcrypt, GraphicsMagick and HMMer results are so abysmal, bugs must be filed immediately.

    Comment


    • #3
      Many of these results are very unsurprising.

      One would not expect any kind of change in an HTTP server from AVX/SSE/MMX/AltiVec/NEON, except perhaps in SSL performance or gzip compression performance (which may not be tested by that benchmark, I suspect).

      In many other cases, AVX is basically going to perform identically to SSE2. In a few cases with auto-vectorization of code it's possible for the compiler to cut the number of vector instructions down dramatically (twice as many components per vector in AVX as in SSE).

      In most cases with highly optimized code bases, they are using hand-rolled SSE code. So compiling with AVX turned on will have no effect because the code is explicitly using 128-bit SSE.

      For a lot of things like graphics where the code is very explicitly written around four-component vectors, the most AVX is going to offer is the ability to use double-precision floats instead of single-precision floats, but nobody is actually using double-precision because that eats up twice as much memory, twice as much bandwidth to the GPU, and it just makes things even more slower because the GPUs don't use double-precision floats internally so the driver has to manually convert those buffers from double to single precision before uploading to the GPU. A particularly clever programmer could manage to combine many vector operations to using AVX to perform two such operations simultaneously, but that code will be complex and trying to write/maintain it will be pure hell compared to using an SSE-based vector class.

      The apps that will benefit the most from AVX extensions with this option are applications that (a) have no been hand-optimized to already use SSE primitives and (b) which make use of large arrays processed in loops which can actually be auto-vectorized.

      I'm fairly sure that will mostly boil down to scientific applications and a handful of unsupported and previously slow as crap codec libraries.

      Comment


      • #4
        Well, Intel hired CodeSourcery (long time contributors to GCC) to work specifically on GCC optimizations for the CoreiX range and afaik those optimizations where not ready in time to be included in GCC 4.6 so chances are there's alot more performance coming our way in future GCC releases.

        Comment


        • #5
          Please can you try these tests but without lto switched on

          From my testing (even on 4.6) it causes huge regressions in the speed and size of all executables, it isn't as bad when using gold as the linker but glibc can't be compiled with that yet

          Also what flags are being used to compile the software you're running?

          Comment


          • #6
            Originally posted by FireBurn View Post
            Please can you try these tests but without lto switched on

            From my testing (even on 4.6) it causes huge regressions in the speed and size of all executables, it isn't as bad when using gold as the linker but glibc can't be compiled with that yet
            That is obviously because you are not using flto properly. You need to pass the optimization flags (CFLAGS, CXXFLAGS) to the linker aswell (LDFLAGS) else the resulting code will not be optimized at all and thus be larger and slower. Read the documentation on flto. Afaik this won't be necessary in GCC 4.6 when it is released, but it certainly is on 4.5.x. As for performance improvements with flto, it very much depends on the program as per usual, but it pretty much always manages to cut down the executable by a good margin.

            Comment


            • #7
              Actually I was using LTO properly. The reason the packages were so bit was mostly down to the .a static archives containing all the extra information

              The resulting binaries were still larger and noticeably slower when using LTO - I tried recompiling my whole system using it

              Comment


              • #8
                Originally posted by FireBurn View Post
                Actually I was using LTO properly. The reason the packages were so bit was mostly down to the .a static archives containing all the extra information.
                What extra information? If you are using LTO (-flto, -fwhole-program) correctly there is NO extra information ending up in either binaries or libraries (libs will not work with -fwhole-program). What you are most likely talking about are the GIMPLE code that is placed in the elf sections during compilation using -flto which is used (and of course STRIPPED from then resulting binary) to perform LTO during the linker stage. If you do not add the necessary flags to the linker stage (CFLAGS/CXXFLAGS, -flto, -fwhole-program (if applicable) ) then your program will end up unoptimized AND it's elf sections filled with GIMPLE code, which means slower and bigger.

                Take a program, either edit it's makefile or ./configure it

                add -flto -fwhole-program to the CFLAGS/CXXFLAGS
                add -flto -fwhole-program AND whatever optimization flags you use to the LDFLAGS

                example:
                CFLAGS = -O3 -march=native -flto -fwhole-program
                LDFLAGS = -Wall -O3 -march=native -flto -fwhole-program

                compile
                compare program executable size
                benchmark

                Comment


                • #9
                  It'd be helpful if you included the exact configure line used to build GCC. In particular, did you use --enable-checking="release" for 4.6? If not then it's defaulting to "yes" because it's a snapshot and you're comparing apples to oranges.

                  Also, I'm a little confused about what optimization options were used with the earlier GCC versions. You say "This was then followed by building out our test library [...] with the core2, corei7, and corei7-avx options". I'm assuming you're talking about 4.6 since the latter two flags don't exist < 4.6. I also see a 4.6.0 entry with no arch listed. Does this mean that no -march flag was used for 4.3.5, 4.4.5, 4.5.2, and the bare 4.6.0? If so, your results are going to vary between versions because they have different -march/-mtune values when none are given on the command line.

                  Code:
                  $ echo "int main() { return 0; }" | gcc-4.3.5 -v -E - 2>&1 | grep cc1
                   /usr/libexec/gcc/x86_64-unknown-linux-gnu/4.3.5/cc1 -E -quiet -v - -mtune=generic
                  $ echo "int main() { return 0; }" | gcc-4.4.5 -v -E - 2>&1 | grep cc1
                   /usr/libexec/gcc/x86_64-unknown-linux-gnu/4.4.5/cc1 -E -quiet -v - -mtune=generic
                  $ echo "int main() { return 0; }" | gcc-4.5.2 -v -E - 2>&1 | grep cc1
                   /usr/libexec/gcc/x86_64-unknown-linux-gnu/4.5.2/cc1 -E -quiet -v - -mtune=generic -march=x86-64
                  $ echo "int main() { return 0; }" | gcc-4.6.0-pre9999 -v -E - 2>&1 | grep cc1
                   /usr/libexec/gcc/x86_64-unknown-linux-gnu/4.6.0-pre9999/cc1 -E -quiet -v - -mtune=generic -march=x86-64
                  (as of 4.5 the default for -march is based of the target tuple, in this case x86_64-unknown-linux-gnu)

                  While it may be that you're demonstrating the differences between the "defaults" of different GCC versions, it would be far more interesting IMO to see how 4.{3..6} -march=x86-64 (or even core2) stack up.

                  Comment


                  • #10
                    Unless they were using -flto when compiling there will be no difference between 4.6 configured with --enable-lto or not.

                    Comment


                    • #11
                      Originally posted by elanthis View Post
                      Many of these results are very unsurprising.

                      (...)

                      I'm fairly sure that will mostly boil down to scientific applications and a handful of unsupported and previously slow as crap codec libraries.
                      You're fully pointing out the issues: AVX is just for spots of code where it can use it's double wide bandwidth. Also at least AMD said that first gen AVX will be implemented internally in microcode as two SSE calls, and as I do not have any Intel info about how they did it, probably even hitting AVX optimizations will not show that dramatic gains.
                      At the end I just hope that benchmarks will focus more to extrapolate those gains using to maximum those gains.
                      For example FFMPEG permits to be compiled with no ASM, and probably if it will touch some autovectorize compiler patterns, will likely get some speedup. Similar with a renderer or scientific code.
                      As Phoronix uses Linux, I think that the main speedup will unlikely be noticed that whole desktop works with just SSE2 that Atom CPU support, as even some components are written in Python and so on.
                      Also, as results get fairly predictable, it will be better just to benchmark for example when a kernel will pick a new scheduling strategy (as was BFS), to test it. Elsewhere most of those results will be just noise and at large I personally think that will hurt the compiling and the hardwork of GCC team.
                      I found lately much more fun to test for myself the JS performance of Firefox that those benchmarks. And much more people will be impacted to see how a real browser will work.
                      Mono have an LLVM JITting support. How much the start-time of a big app (MonoDevelop comes in my mind) is impacted. What about to test its raw number performance compared with GCC/C++ port of some code or other kind of code like this.

                      Comment


                      • #12
                        "GCC 4.6 also can be built with the --with-fpmath=avx flag, which will allow the GNU compiler to use AVX floating-point arithmetic."

                        IIRC, binaries made by compilers compiled themselves by a new compiler benefit too.

                        The idea is as outlined below.

                        Binary B1 made by compiler C1 itself compiled with C1 gives performance P1.

                        Binary B2 made by compiler C2 itself compiled with C1 gives performance P2, where P2 > P1.

                        Binary B3 made by compiler C2 itself compiled with C2 gives performance P3, where P3 > P2 > P1.

                        Is that true, or is P2 = P3?

                        Thanks for any insightful comments!

                        Comment


                        • #13
                          Originally posted by sabriah View Post
                          "GCC 4.6 also can be built with the --with-fpmath=avx flag, which will allow the GNU compiler to use AVX floating-point arithmetic."

                          IIRC, binaries made by compilers compiled themselves by a new compiler benefit too.

                          The idea is as outlined below.

                          Binary B1 made by compiler C1 itself compiled with C1 gives performance P1.

                          Binary B2 made by compiler C2 itself compiled with C1 gives performance P2, where P2 > P1.

                          Binary B3 made by compiler C2 itself compiled with C2 gives performance P3, where P3 > P2 > P1.

                          Is that true, or is P2 = P3?

                          Thanks for any insightful comments!
                          You make "quote mining" and Michael statement was not about performance (either that compiler will support AVX or not) but simply that if you at configure step from configure, make, make install step, you will set a flag of compilation, the resulting compiler can enable AVX instruction generation.
                          Also it does not state anything about performance, as you can do cross compiling so you can compile supposedly on an Atom CPU or 486 or PowerPC CPU and you will get the same performing binary.
                          Also I think you misunderstand not only the final binary performance, but if compiler will be using or not some instruction what are the benefit. Mostly the compiler struggle to make your binary to use minimum registers and the code to enter in L1 cache. Here are the main benefits that your compiler may benefit.
                          AVX (starting from MMX era) are SIMD instruction, which means that if you have somewhat parallel processing data that you have to do it in a block, like for example a matrix multiplication, you can get benefits there. So if you can feed your program that your compiler will see those patterns, the resulting instructions will benefit of this explicit parallelism and here are the gains. They mostly combine with loop-unroll optimization.
                          Also most of those instructions benefit in floating point code, which is also an interesting point, because you get fairly good performance without any AVX for final binaries in regular applications.

                          Comment


                          • #14
                            Originally posted by ciplogic View Post
                            You make "quote mining"
                            That wasn't the intention, but I get the point. Thanks for the explanation!

                            Comment


                            • #15
                              Originally posted by sabriah View Post
                              IIRC, binaries made by compilers compiled themselves by a new compiler benefit too.

                              The idea is as outlined below.

                              Binary B1 made by compiler C1 itself compiled with C1 gives performance P1.

                              Binary B2 made by compiler C2 itself compiled with C1 gives performance P2, where P2 > P1.

                              Binary B3 made by compiler C2 itself compiled with C2 gives performance P3, where P3 > P2 > P1.

                              Is that true, or is P2 = P3?

                              Thanks for any insightful comments!
                              GCC is built in a three-stage bootstrap, meaning the compiler that gets installed is _always_ compiled with itself.

                              Let's say you have 4.5.2 installed and are building 4.6.0. In the first stage, the 4.6.0 compiler is built with GCC 4.5.2. The second stage then rebuilds 4.6.0 with the compiler that was just built in stage 1. Finally, the third stage uses the compiler built in stage 2 to build itself one more time. The stage 2 and stage 3 compilers are then compared to ensure they are identical.

                              Comment

                              Working...
                              X