Announcement

Collapse
No announcement yet.

GCC 7.0 vs. LLVM Clang 4.0 Performance (January 2017)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • mlau
    replied
    Something broke LLVM/Clang over the weekend: a git snapshot of llvm/clang from 08.02.2017 is now easily outperformed by gcc-6 in _all_ scimark2 tests...

    Leave a comment:


  • name99
    replied
    Originally posted by zboson View Post
    One things that could explain why Clang is performing so much better (factors of two or three) than GCC in some cases is due to dependency chains and loop unrolling. One example of a dependency chain is a reduction sum

    for(int i=0; i<n; i++) sum += x;

    Clang will unroll the loop to four independent sums (for float you need -Ofast) whereas GCC does not unroll at all. The problem is that if you don't unroll it's latency bound. You have to unroll the loop into intendant operations to get the full throughput. ICC incidentally unrolls twice from what I can tell.

    The option `-funroll-loops` unrolls the loop eight times but does not break the dependency which is really stupid. This is one of the main reasons that GCC's auto-vectorization is not good enough and why you still need intrinsics with GCC.
    But Michael's tests for that sort of code (all the matrix stuff) are also sub-optimal. My guess is he is not using the version of LLVM that has Polly built-in, and is not using Graphite on GCC. Both of these analyze certain types of deeply nested loops and restructure them for better performance (better vectorization, better parallelization, better memory access patterns). It would be nice to see both these compilers run at turbo speed (which means also using LTO) rather than just at their basic level.

    (It would be even nicer if the testing infrastructure also supported PGO. Apple is clearly moving towards making this more-or-less automatic in XCode, though only the outline is there right now, not the full picture. Maybe at this year's WWDC? Basically, along with your standard unit tests and performance tests run after each compile, you'll be able to provide "training" tests that create profiling data for subsequent PGO. I'd guess Dev Studio already has a work flow like this. Eclipse probably, but I have no idea.)

    Leave a comment:


  • name99
    replied
    Why is LTO not yet a standard optimization flag you use? Sin't it robust enough now (and uses small enough amounts of memory) that it's feasible for use on both compilers (certainly on the LLVM side)?

    Leave a comment:


  • JakubJelinek
    replied
    Originally posted by JakubJelinek View Post

    Michael, it seems e.g. EP-DGEMM spents pretty much all the runtime in dgemm_ function from libblas. Do you use different libblas versions (built by different compilers) for this testing?
    I'm asking especially because dgemm_ is written in Fortran, so it isn't clear if it has been for the clang case built with gfortran, flag or PGI FE for LLVM or whatever else.

    Leave a comment:


  • JakubJelinek
    replied
    Originally posted by Michael View Post

    With PTS7 git: phoronix-test-suite winners-and-losers <result file> if any other metrics/stats you would find interesting, can easily add. If it ends up adding more stuff to it, will probably rename from winners-and-losers.
    Michael, it seems e.g. EP-DGEMM spents pretty much all the runtime in dgemm_ function from libblas. Do you use different libblas versions (built by different compilers) for this testing?
    Last edited by JakubJelinek; 01 February 2017, 09:31 AM.

    Leave a comment:


  • zboson
    replied
    Originally posted by atomsymbol

    In my opinion, from a certain viewpoint both gcc and clang are flawed.

    Code:
    $ [B]cat a.c[/B]
    #include <stddef.h>
    #include <stdlib.h>
    int main() {
    float sum = random();
    const float x = random();
    const size_t n = random();
    for(size_t i=0; i<n; i++) sum += x;
    return sum;
    }
    $ [B]clang --version[/B]
    clang version 3.9.1 (tags/RELEASE_391/final)
    Target: x86_64-pc-linux-gnu
    $ [B]clang -O3 -S a.c[/B]
    $ [B]cat a.s[/B]
    .LBB0_9:
    addss %xmm0, %xmm1
    addss %xmm0, %xmm1
    addss %xmm0, %xmm1
    addss %xmm0, %xmm1
    addss %xmm0, %xmm1
    addss %xmm0, %xmm1
    addss %xmm0, %xmm1
    addss %xmm0, %xmm1
    addq $8, %rdx
    jne .LBB0_9
    $ [B]clang -O3 -S a.c [COLOR=#006400]-march=native (or: -march=bdver3)[/COLOR][/B]
    $ [B]cat a.s[/B]
    .LBB0_4:
    vaddss %xmm1, %xmm0, %xmm1
    decq %rax
    jne .LBB0_4
    More options can be tried at https://godbolt.org
    Note that I said "for float you need -Ofast". Floating point math is not associative. In order to do reductions you have to tell the compiler to assume floating point associative math. I think with ICC it does this by default but not with GCC or Clang (Clang is even stricter than GCC because GCC allows fma contractions be default by not Clang).

    Leave a comment:


  • zboson
    replied
    One things that could explain why Clang is performing so much better (factors of two or three) than GCC in some cases is due to dependency chains and loop unrolling. One example of a dependency chain is a reduction sum

    for(int i=0; i<n; i++) sum += x;

    Clang will unroll the loop to four independent sums (for float you need -Ofast) whereas GCC does not unroll at all. The problem is that if you don't unroll it's latency bound. You have to unroll the loop into intendant operations to get the full throughput. ICC incidentally unrolls twice from what I can tell.

    The option `-funroll-loops` unrolls the loop eight times but does not break the dependency which is really stupid. This is one of the main reasons that GCC's auto-vectorization is not good enough and why you still need intrinsics with GCC.

    Leave a comment:


  • rubdos
    replied
    Originally posted by Steffo View Post
    It seems, soon GCC will be irrelevant. All new tools are based on llvm and now it can beat GCC on many benchmarks.
    Which is a shame; rms was - for once - wrong when deciding not to make gcc modular. (see things like: https://gcc.gnu.org/ml/gcc/2007-11/msg00193.html)

    Leave a comment:


  • Michael
    replied
    Originally posted by Marc Driftmeyer View Post
    Update Bullet to 2.85. Your test is ancient.
    Last check (probably like 6 months ago) the benchmark built-ins was broken with their latest release at the time.

    Leave a comment:


  • Marc Driftmeyer
    replied
    Update Bullet to 2.86. Your test is ancient.

    Leave a comment:

Working...
X