Announcement

Collapse
No announcement yet.

GCC 4.6 Compiler Performance With AVX

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • phoronix
    started a topic GCC 4.6 Compiler Performance With AVX

    GCC 4.6 Compiler Performance With AVX

    Phoronix: GCC 4.6 Compiler Performance With AVX

    While we are still battling issues with the Intel Linux graphics driver in getting that running properly with Intel's new Sandy Bridge CPUs (at least Intel's Jesse Barnes is now able to reproduce the most serious problem we've been facing, but we'll save the new graphics information for another article), the CPU performance continues to be very compelling. Two weeks ago we published the Intel Core i5 2500K Linux benchmarks that showed just how well this quad-core CPU that costs a little more than $200 USD is able to truly outperform previous generations of Intel hardware. That was just with running the standard open-source benchmarks and other Linux software, which has not been optimized for Intel's latest micro-architecture. Version 4.6 of the GNU Compiler Collection (GCC) though is gearing up for release and it will bring support for the AVX extensions. In this article, we are benchmarking GCC 4.6 on a Sandy Bridge system to see what benefits there are to enabling the Core i7 AVX optimizations.

    http://www.phoronix.com/vr.php?view=15665

  • baryluk
    replied
    Lots of regressions, but it have big potential of providigin some really good improvements in some areas. Even 10% in some tests which is very good result!

    Leave a comment:


  • sabriah
    replied
    Originally posted by dirtyepic View Post
    the compiler that gets installed is _always_ compiled with itself.
    Thanks for that!

    Leave a comment:


  • dirtyepic
    replied
    Originally posted by sabriah View Post
    IIRC, binaries made by compilers compiled themselves by a new compiler benefit too.

    The idea is as outlined below.

    Binary B1 made by compiler C1 itself compiled with C1 gives performance P1.

    Binary B2 made by compiler C2 itself compiled with C1 gives performance P2, where P2 > P1.

    Binary B3 made by compiler C2 itself compiled with C2 gives performance P3, where P3 > P2 > P1.

    Is that true, or is P2 = P3?

    Thanks for any insightful comments!
    GCC is built in a three-stage bootstrap, meaning the compiler that gets installed is _always_ compiled with itself.

    Let's say you have 4.5.2 installed and are building 4.6.0. In the first stage, the 4.6.0 compiler is built with GCC 4.5.2. The second stage then rebuilds 4.6.0 with the compiler that was just built in stage 1. Finally, the third stage uses the compiler built in stage 2 to build itself one more time. The stage 2 and stage 3 compilers are then compared to ensure they are identical.

    Leave a comment:


  • sabriah
    replied
    Originally posted by ciplogic View Post
    You make "quote mining"
    That wasn't the intention, but I get the point. Thanks for the explanation!

    Leave a comment:


  • ciplogic
    replied
    Originally posted by sabriah View Post
    "GCC 4.6 also can be built with the --with-fpmath=avx flag, which will allow the GNU compiler to use AVX floating-point arithmetic."

    IIRC, binaries made by compilers compiled themselves by a new compiler benefit too.

    The idea is as outlined below.

    Binary B1 made by compiler C1 itself compiled with C1 gives performance P1.

    Binary B2 made by compiler C2 itself compiled with C1 gives performance P2, where P2 > P1.

    Binary B3 made by compiler C2 itself compiled with C2 gives performance P3, where P3 > P2 > P1.

    Is that true, or is P2 = P3?

    Thanks for any insightful comments!
    You make "quote mining" and Michael statement was not about performance (either that compiler will support AVX or not) but simply that if you at configure step from configure, make, make install step, you will set a flag of compilation, the resulting compiler can enable AVX instruction generation.
    Also it does not state anything about performance, as you can do cross compiling so you can compile supposedly on an Atom CPU or 486 or PowerPC CPU and you will get the same performing binary.
    Also I think you misunderstand not only the final binary performance, but if compiler will be using or not some instruction what are the benefit. Mostly the compiler struggle to make your binary to use minimum registers and the code to enter in L1 cache. Here are the main benefits that your compiler may benefit.
    AVX (starting from MMX era) are SIMD instruction, which means that if you have somewhat parallel processing data that you have to do it in a block, like for example a matrix multiplication, you can get benefits there. So if you can feed your program that your compiler will see those patterns, the resulting instructions will benefit of this explicit parallelism and here are the gains. They mostly combine with loop-unroll optimization.
    Also most of those instructions benefit in floating point code, which is also an interesting point, because you get fairly good performance without any AVX for final binaries in regular applications.

    Leave a comment:


  • sabriah
    replied
    "GCC 4.6 also can be built with the --with-fpmath=avx flag, which will allow the GNU compiler to use AVX floating-point arithmetic."

    IIRC, binaries made by compilers compiled themselves by a new compiler benefit too.

    The idea is as outlined below.

    Binary B1 made by compiler C1 itself compiled with C1 gives performance P1.

    Binary B2 made by compiler C2 itself compiled with C1 gives performance P2, where P2 > P1.

    Binary B3 made by compiler C2 itself compiled with C2 gives performance P3, where P3 > P2 > P1.

    Is that true, or is P2 = P3?

    Thanks for any insightful comments!

    Leave a comment:


  • ciplogic
    replied
    Originally posted by elanthis View Post
    Many of these results are very unsurprising.

    (...)

    I'm fairly sure that will mostly boil down to scientific applications and a handful of unsupported and previously slow as crap codec libraries.
    You're fully pointing out the issues: AVX is just for spots of code where it can use it's double wide bandwidth. Also at least AMD said that first gen AVX will be implemented internally in microcode as two SSE calls, and as I do not have any Intel info about how they did it, probably even hitting AVX optimizations will not show that dramatic gains.
    At the end I just hope that benchmarks will focus more to extrapolate those gains using to maximum those gains.
    For example FFMPEG permits to be compiled with no ASM, and probably if it will touch some autovectorize compiler patterns, will likely get some speedup. Similar with a renderer or scientific code.
    As Phoronix uses Linux, I think that the main speedup will unlikely be noticed that whole desktop works with just SSE2 that Atom CPU support, as even some components are written in Python and so on.
    Also, as results get fairly predictable, it will be better just to benchmark for example when a kernel will pick a new scheduling strategy (as was BFS), to test it. Elsewhere most of those results will be just noise and at large I personally think that will hurt the compiling and the hardwork of GCC team.
    I found lately much more fun to test for myself the JS performance of Firefox that those benchmarks. And much more people will be impacted to see how a real browser will work.
    Mono have an LLVM JITting support. How much the start-time of a big app (MonoDevelop comes in my mind) is impacted. What about to test its raw number performance compared with GCC/C++ port of some code or other kind of code like this.

    Leave a comment:


  • dirtyepic
    replied
    Unless they were using -flto when compiling there will be no difference between 4.6 configured with --enable-lto or not.

    Leave a comment:


  • dirtyepic
    replied
    It'd be helpful if you included the exact configure line used to build GCC. In particular, did you use --enable-checking="release" for 4.6? If not then it's defaulting to "yes" because it's a snapshot and you're comparing apples to oranges.

    Also, I'm a little confused about what optimization options were used with the earlier GCC versions. You say "This was then followed by building out our test library [...] with the core2, corei7, and corei7-avx options". I'm assuming you're talking about 4.6 since the latter two flags don't exist < 4.6. I also see a 4.6.0 entry with no arch listed. Does this mean that no -march flag was used for 4.3.5, 4.4.5, 4.5.2, and the bare 4.6.0? If so, your results are going to vary between versions because they have different -march/-mtune values when none are given on the command line.

    Code:
    $ echo "int main() { return 0; }" | gcc-4.3.5 -v -E - 2>&1 | grep cc1
     /usr/libexec/gcc/x86_64-unknown-linux-gnu/4.3.5/cc1 -E -quiet -v - -mtune=generic
    $ echo "int main() { return 0; }" | gcc-4.4.5 -v -E - 2>&1 | grep cc1
     /usr/libexec/gcc/x86_64-unknown-linux-gnu/4.4.5/cc1 -E -quiet -v - -mtune=generic
    $ echo "int main() { return 0; }" | gcc-4.5.2 -v -E - 2>&1 | grep cc1
     /usr/libexec/gcc/x86_64-unknown-linux-gnu/4.5.2/cc1 -E -quiet -v - -mtune=generic -march=x86-64
    $ echo "int main() { return 0; }" | gcc-4.6.0-pre9999 -v -E - 2>&1 | grep cc1
     /usr/libexec/gcc/x86_64-unknown-linux-gnu/4.6.0-pre9999/cc1 -E -quiet -v - -mtune=generic -march=x86-64
    (as of 4.5 the default for -march is based of the target tuple, in this case x86_64-unknown-linux-gnu)

    While it may be that you're demonstrating the differences between the "defaults" of different GCC versions, it would be far more interesting IMO to see how 4.{3..6} -march=x86-64 (or even core2) stack up.

    Leave a comment:

Working...
X