Page 2 of 2 FirstFirst 12
Results 11 to 17 of 17

Thread: GCC vs. LLVM Clang Is Mixed On The Ivy Bridge Extreme

  1. #11
    Join Date
    Oct 2009
    Posts
    845

    Default

    Quote Originally Posted by WorBlux View Post
    Where in the wild do we actually see O3 though?
    Practically all performance oriented software, like encoders, games, archivers/compressors, emulators, 3d renderers etc

  2. #12
    Join Date
    Oct 2009
    Posts
    845

    Default

    Quote Originally Posted by name99 View Post
    LLVM (and so XCode) also has link time (ie whole program) optimization, enabled by -O4. I imagine GCC has the same.
    GCC activates link time optimization using -flto (which is also supported as a flag by Clang), -O4 is not recognized at all on GCC afaik.

    Quote Originally Posted by name99 View Post
    Apple slides showed that LTO made a substantial difference (5% to 20%) in performance, but of course that is against real world code that is split over a large number of files; it may have much less impact on these sorts of microbenchmarks.
    Yes it's extremely code-base dependent, you can basically achieve the same effect by manually declaring non exported functions as static (or like sqlite did, join together all source files into one big file before compiling), personally I've seen little performance gain from LTO on my own code and the code I've benchmarked, but again it really depends on the code in question, also the binary often ends up quite a bit smaller with LTO which is of course nice.

    Quote Originally Posted by name99 View Post
    What's not clear to me is the extent to which either LLVM or GCC have fully optimized their LTO pass. Apple had (PPC specific) tools fifteen years ago that could run whole program optimization and rearrange the function layout so that functions that called each other were packed together (and so took up less TLB coverage and shared overlapping cache lines).
    Well GCC has a '-fwhole-program' option which enables more aggressive interprocedural optimizations (as in, moving blocks of code around to improve cache use, eliminate/consolidate code blocks etc).

    Quote Originally Posted by name99 View Post
    but could be run with a profiling pass to get a better understanding of the hot call chains. But as far as I know, the LLVM LTO does not (yet?) do this sort of thing, and I have zero idea about GCC.
    Yes, this (profile guided optimization) is the by far best performance giving optimization I've used which is outside of the -On levels, GCC has this implemented as -fprofile-generate/-fprofile-use and it can deliver some great improvements, typically I get between ~4-8% on performance oriented code, sometimes up to 20%.

    I know there was a Google Summer of Code project to implement profile guided optimization into LLVM but I haven't heard anything about it since so I fear it didn't amount to anything.

  3. #13
    Join Date
    Oct 2009
    Posts
    845

    Default

    Quote Originally Posted by name99 View Post
    Oh, one thing to add to my earlier comment.
    LLVM (and maybe GCC, but I don't know there) will not automatically vectorize many FP loops if fast-math is not enabled because getting the loop to vectorize requires re-ordering FP operations. This means that using fast-math, if your code allows it, can affect performance by quite a bit more than you might imagine.
    I did a quick rundown test on C-Ray using GCC and Clang with and without -ffast-math:

    GCC version: 4.8.1 20130725
    Clang version: 3.3 (tags/RELEASE_33/final)
    Arch Linux 64-bit, core i5
    Benchmark: cat scene | ./c-ray-mt -t 4 -s 7500x3500 > foo.ppm

    results are in milliseconds, and is the average of 5 benchmark-runs (exluding a varm-up run)

    gcc -O3
    5840

    gcc -O3 -funroll-loops
    5704

    gcc -O3 -ffast-math -funroll-loops
    4374

    gcc -Ofast -funroll-loops
    4368

    gcc -Ofast -funroll-loops -march=native
    4351

    On GCC we can see that -ffast-math greatly improves the result, now let's look at Clang:

    clang -O3
    6403

    clang -O3 -funroll-loops
    6396

    clang -O3 -ffast-math -funroll-loops
    7137

    clang -Ofast -funroll-loops
    7122

    clang -Ofast -funroll-loops -march=native
    7153

    On Clang however, we see that -ffast-math _degrades_ performance markedly on C-Ray, so had Michael used it for his Phoronix C-Ray test then Clang would have come out looking MUCH worse than it does now since GCC got a great boost from -ffast-math.

    Apart from that it seems that -funroll-loops does nothing performance-wise on Clang, and same goes for -march=native.

  4. #14
    Join Date
    Feb 2008
    Location
    Linuxland
    Posts
    5,199

    Default

    Just noting, gcc accepts -O[any positive number], it's just that high numbers get clamped to 3. It's been this way for ages, gcc 4.2 accepts -O666 just fine.

  5. #15
    Join Date
    Oct 2009
    Posts
    845

    Default

    Quote Originally Posted by curaga View Post
    Just noting, gcc accepts -O[any positive number], it's just that high numbers get clamped to 3. It's been this way for ages, gcc 4.2 accepts -O666 just fine.
    Good to know, I thought it just ignored anything above -O3 and used the default -O0, clamping at -O3 makes more sense though as the user likely wanted aggressive optimization when attempting a higher value than -O3.

  6. #16
    Join Date
    Mar 2013
    Posts
    49

    Default

    Quote Originally Posted by XorEaxEax View Post
    I did a quick rundown test on C-Ray using GCC and Clang with and without -ffast-math:

    GCC version: 4.8.1 20130725
    Clang version: 3.3 (tags/RELEASE_33/final)
    Arch Linux 64-bit, core i5
    Benchmark: cat scene | ./c-ray-mt -t 4 -s 7500x3500 > foo.ppm

    results are in milliseconds, and is the average of 5 benchmark-runs (exluding a varm-up run)

    gcc -O3
    5840

    gcc -O3 -funroll-loops
    5704

    gcc -O3 -ffast-math -funroll-loops
    4374

    gcc -Ofast -funroll-loops
    4368

    gcc -Ofast -funroll-loops -march=native
    4351

    On GCC we can see that -ffast-math greatly improves the result, now let's look at Clang:

    clang -O3
    6403

    clang -O3 -funroll-loops
    6396

    clang -O3 -ffast-math -funroll-loops
    7137

    clang -Ofast -funroll-loops
    7122

    clang -Ofast -funroll-loops -march=native
    7153

    On Clang however, we see that -ffast-math _degrades_ performance markedly on C-Ray, so had Michael used it for his Phoronix C-Ray test then Clang would have come out looking MUCH worse than it does now since GCC got a great boost from -ffast-math.

    Apart from that it seems that -funroll-loops does nothing performance-wise on Clang, and same goes for -march=native.

    Interesting (and remarkable) that we get such a regression from -ffast-math. It'd be interesting (if it's not a hassle) to learn why.
    One possibility (which may or may not be the case) is that vectorization is at fault here. A big push in 3.3 was to ensure that the vectorization cost model was accurate, so that your vectorized code didn't make things worse by spending so much time just shuffling data. But the hope of having an accurate cost model doesn't mean that you ACTUALLY have one. It's possible that there's something severely broken in the cost model (at least for FP vectors) which is giving us these results.

    The unroll-loops does not surprise me. The LLVM guys probably believe they have good heuristics for when (or not) to do this and are likely correct.
    The architecture specific stuff may be linked to the inaccurate cost model issue? It would be amusing if we learned there was an off-by-one error or something in the micro-architecture specs table that drove the compiler!

    I guess 3.4 will be released in the next month or two, and it would be interesting to revisit this at that point.

  7. #17
    Join Date
    Oct 2009
    Posts
    845

    Default

    Quote Originally Posted by name99 View Post
    One possibility (which may or may not be the case) is that vectorization is at fault here.
    Yes it's quite obvious that something in the heuristics failed for this code atleast, there's no reason why less precise floating point math should result in much slower code.

    So it likely ties back to what you said earlier that Clang/LLVM will only try to vectorize with -ffast-math enabled, and that it is indeed the vectorization that fails.

    Quote Originally Posted by name99 View Post
    The architecture specific stuff may be linked to the inaccurate cost model issue?
    You mean -march=native ? I don't see how, both '-Ofast -funroll-loops' and '-Ofast -funroll-loops -march=native' where equally slow on Clang.

    On the other hand '-march=native' didn't seem to do anything for GCC either, I think results around 50 milliseconds or less can be discared as noise.

    It's funny though that Michael obviously knew about this problem with -ffast-math on Clang/LLVM, as the upstream (original) version of C-Ray 1.1 ships with '-O3 -ffast-math' in the makefile, and thus Michael modified the makefile to remove -ffast-math for his tests so as to make Clang/LLVM not look so bad. Good old agenda-biased Michael.

    This is why I take all 'tests' done here on Phoronix with a large grain of salt, particularly those between GCC and Clang/LLVM as I know he is extremely pro Clang/LLVM and has an agenda against FSF (which seems to spill over on GCC and other FSF/GNU software).

    Quote Originally Posted by name99 View Post
    I guess 3.4 will be released in the next month or two, and it would be interesting to revisit this at that point.
    Indeed, the competition between GCC and Clang/LLVM is a perfect situation for us as end users (and I think for the projects themselves). I use both, though GCC is definitely what I use for release binaries due to it's better optimizations, if this changes and Clang/LLVM delivers better code performance, I will use that toolchain for release builds.

    So here's looking forward to Clang 3.4 and GCC 4.9 and the advances they bring.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •