Announcement

Collapse
No announcement yet.

GCC 4.6 Compiler Performance With AVX

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GCC 4.6 Compiler Performance With AVX

    Phoronix: GCC 4.6 Compiler Performance With AVX

    While we are still battling issues with the Intel Linux graphics driver in getting that running properly with Intel's new Sandy Bridge CPUs (at least Intel's Jesse Barnes is now able to reproduce the most serious problem we've been facing, but we'll save the new graphics information for another article), the CPU performance continues to be very compelling. Two weeks ago we published the Intel Core i5 2500K Linux benchmarks that showed just how well this quad-core CPU that costs a little more than $200 USD is able to truly outperform previous generations of Intel hardware. That was just with running the standard open-source benchmarks and other Linux software, which has not been optimized for Intel's latest micro-architecture. Version 4.6 of the GNU Compiler Collection (GCC) though is gearing up for release and it will bring support for the AVX extensions. In this article, we are benchmarking GCC 4.6 on a Sandy Bridge system to see what benefits there are to enabling the Core i7 AVX optimizations.

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    GCC 4.6.x Gcrypt, GraphicsMagick and HMMer results are so abysmal, bugs must be filed immediately.

    Comment


    • #3
      Many of these results are very unsurprising.

      One would not expect any kind of change in an HTTP server from AVX/SSE/MMX/AltiVec/NEON, except perhaps in SSL performance or gzip compression performance (which may not be tested by that benchmark, I suspect).

      In many other cases, AVX is basically going to perform identically to SSE2. In a few cases with auto-vectorization of code it's possible for the compiler to cut the number of vector instructions down dramatically (twice as many components per vector in AVX as in SSE).

      In most cases with highly optimized code bases, they are using hand-rolled SSE code. So compiling with AVX turned on will have no effect because the code is explicitly using 128-bit SSE.

      For a lot of things like graphics where the code is very explicitly written around four-component vectors, the most AVX is going to offer is the ability to use double-precision floats instead of single-precision floats, but nobody is actually using double-precision because that eats up twice as much memory, twice as much bandwidth to the GPU, and it just makes things even more slower because the GPUs don't use double-precision floats internally so the driver has to manually convert those buffers from double to single precision before uploading to the GPU. A particularly clever programmer could manage to combine many vector operations to using AVX to perform two such operations simultaneously, but that code will be complex and trying to write/maintain it will be pure hell compared to using an SSE-based vector class.

      The apps that will benefit the most from AVX extensions with this option are applications that (a) have no been hand-optimized to already use SSE primitives and (b) which make use of large arrays processed in loops which can actually be auto-vectorized.

      I'm fairly sure that will mostly boil down to scientific applications and a handful of unsupported and previously slow as crap codec libraries.

      Comment


      • #4
        Well, Intel hired CodeSourcery (long time contributors to GCC) to work specifically on GCC optimizations for the CoreiX range and afaik those optimizations where not ready in time to be included in GCC 4.6 so chances are there's alot more performance coming our way in future GCC releases.

        Comment


        • #5
          Please can you try these tests but without lto switched on

          From my testing (even on 4.6) it causes huge regressions in the speed and size of all executables, it isn't as bad when using gold as the linker but glibc can't be compiled with that yet

          Also what flags are being used to compile the software you're running?

          Comment


          • #6
            Originally posted by FireBurn View Post
            Please can you try these tests but without lto switched on

            From my testing (even on 4.6) it causes huge regressions in the speed and size of all executables, it isn't as bad when using gold as the linker but glibc can't be compiled with that yet
            That is obviously because you are not using flto properly. You need to pass the optimization flags (CFLAGS, CXXFLAGS) to the linker aswell (LDFLAGS) else the resulting code will not be optimized at all and thus be larger and slower. Read the documentation on flto. Afaik this won't be necessary in GCC 4.6 when it is released, but it certainly is on 4.5.x. As for performance improvements with flto, it very much depends on the program as per usual, but it pretty much always manages to cut down the executable by a good margin.

            Comment


            • #7
              Actually I was using LTO properly. The reason the packages were so bit was mostly down to the .a static archives containing all the extra information

              The resulting binaries were still larger and noticeably slower when using LTO - I tried recompiling my whole system using it

              Comment


              • #8
                Originally posted by FireBurn View Post
                Actually I was using LTO properly. The reason the packages were so bit was mostly down to the .a static archives containing all the extra information.
                What extra information? If you are using LTO (-flto, -fwhole-program) correctly there is NO extra information ending up in either binaries or libraries (libs will not work with -fwhole-program). What you are most likely talking about are the GIMPLE code that is placed in the elf sections during compilation using -flto which is used (and of course STRIPPED from then resulting binary) to perform LTO during the linker stage. If you do not add the necessary flags to the linker stage (CFLAGS/CXXFLAGS, -flto, -fwhole-program (if applicable) ) then your program will end up unoptimized AND it's elf sections filled with GIMPLE code, which means slower and bigger.

                Take a program, either edit it's makefile or ./configure it

                add -flto -fwhole-program to the CFLAGS/CXXFLAGS
                add -flto -fwhole-program AND whatever optimization flags you use to the LDFLAGS

                example:
                CFLAGS = -O3 -march=native -flto -fwhole-program
                LDFLAGS = -Wall -O3 -march=native -flto -fwhole-program

                compile
                compare program executable size
                benchmark

                Comment


                • #9
                  It'd be helpful if you included the exact configure line used to build GCC. In particular, did you use --enable-checking="release" for 4.6? If not then it's defaulting to "yes" because it's a snapshot and you're comparing apples to oranges.

                  Also, I'm a little confused about what optimization options were used with the earlier GCC versions. You say "This was then followed by building out our test library [...] with the core2, corei7, and corei7-avx options". I'm assuming you're talking about 4.6 since the latter two flags don't exist < 4.6. I also see a 4.6.0 entry with no arch listed. Does this mean that no -march flag was used for 4.3.5, 4.4.5, 4.5.2, and the bare 4.6.0? If so, your results are going to vary between versions because they have different -march/-mtune values when none are given on the command line.

                  Code:
                  $ echo "int main() { return 0; }" | gcc-4.3.5 -v -E - 2>&1 | grep cc1
                   /usr/libexec/gcc/x86_64-unknown-linux-gnu/4.3.5/cc1 -E -quiet -v - -mtune=generic
                  $ echo "int main() { return 0; }" | gcc-4.4.5 -v -E - 2>&1 | grep cc1
                   /usr/libexec/gcc/x86_64-unknown-linux-gnu/4.4.5/cc1 -E -quiet -v - -mtune=generic
                  $ echo "int main() { return 0; }" | gcc-4.5.2 -v -E - 2>&1 | grep cc1
                   /usr/libexec/gcc/x86_64-unknown-linux-gnu/4.5.2/cc1 -E -quiet -v - -mtune=generic -march=x86-64
                  $ echo "int main() { return 0; }" | gcc-4.6.0-pre9999 -v -E - 2>&1 | grep cc1
                   /usr/libexec/gcc/x86_64-unknown-linux-gnu/4.6.0-pre9999/cc1 -E -quiet -v - -mtune=generic -march=x86-64
                  (as of 4.5 the default for -march is based of the target tuple, in this case x86_64-unknown-linux-gnu)

                  While it may be that you're demonstrating the differences between the "defaults" of different GCC versions, it would be far more interesting IMO to see how 4.{3..6} -march=x86-64 (or even core2) stack up.

                  Comment


                  • #10
                    Unless they were using -flto when compiling there will be no difference between 4.6 configured with --enable-lto or not.

                    Comment

                    Working...
                    X