Announcement

**Michael** · 29 January 2017, 08:07 PM

Originally posted by Marc Driftmeyer View Post

Update Bullet to 2.85. Your test is ancient.

Last check (probably like 6 months ago) the benchmark built-ins was broken with their latest release at the time.

**rubdos** · 30 January 2017, 04:06 AM

Originally posted by Steffo View Post

It seems, soon GCC will be irrelevant. All new tools are based on llvm and now it can beat GCC on many benchmarks.

Which is a shame; rms was - for once - wrong when deciding not to make gcc modular. (see things like: https://gcc.gnu.org/ml/gcc/2007-11/msg00193.html)

**zboson** · 30 January 2017, 06:03 AM

One things that could explain why Clang is performing so much better (factors of two or three) than GCC in some cases is due to dependency chains and loop unrolling. One example of a dependency chain is a reduction sum

for(int i=0; i<n; i++) sum += x;

Clang will unroll the loop to four independent sums (for float you need -Ofast) whereas GCC does not unroll at all. The problem is that if you don't unroll it's latency bound. You have to unroll the loop into intendant operations to get the full throughput. ICC incidentally unrolls twice from what I can tell.

The option `-funroll-loops` unrolls the loop eight times but does not break the dependency which is really stupid. This is one of the main reasons that GCC's auto-vectorization is not good enough and why you still need intrinsics with GCC.

**zboson** · 30 January 2017, 08:02 AM

Originally posted by atomsymbol

In my opinion, from a certain viewpoint both gcc and clang are flawed.

Code:

$ [B]cat a.c[/B]
#include <stddef.h>
#include <stdlib.h>
int main() {
float sum = random();
const float x = random();
const size_t n = random();
for(size_t i=0; i<n; i++) sum += x;
return sum;
}
$ [B]clang --version[/B]
clang version 3.9.1 (tags/RELEASE_391/final)
Target: x86_64-pc-linux-gnu
$ [B]clang -O3 -S a.c[/B]
$ [B]cat a.s[/B]
.LBB0_9:
addss %xmm0, %xmm1
addss %xmm0, %xmm1
addss %xmm0, %xmm1
addss %xmm0, %xmm1
addss %xmm0, %xmm1
addss %xmm0, %xmm1
addss %xmm0, %xmm1
addss %xmm0, %xmm1
addq $8, %rdx
jne .LBB0_9
$ [B]clang -O3 -S a.c [COLOR=#006400]-march=native (or: -march=bdver3)[/COLOR][/B]
$ [B]cat a.s[/B]
.LBB0_4:
vaddss %xmm1, %xmm0, %xmm1
decq %rax
jne .LBB0_4

More options can be tried at https://godbolt.org

Note that I said "for float you need -Ofast". Floating point math is not associative. In order to do reductions you have to tell the compiler to assume floating point associative math. I think with ICC it does this by default but not with GCC or Clang (Clang is even stricter than GCC because GCC allows fma contractions be default by not Clang).

**JakubJelinek** · 01 February 2017, 08:19 AM

Originally posted by Michael View Post

With PTS7 git: phoronix-test-suite winners-and-losers <result file> if any other metrics/stats you would find interesting, can easily add. If it ends up adding more stuff to it, will probably rename from winners-and-losers.

Michael, it seems e.g. EP-DGEMM spents pretty much all the runtime in dgemm_ function from libblas. Do you use different libblas versions (built by different compilers) for this testing?

**JakubJelinek** · 01 February 2017, 09:39 AM

Originally posted by JakubJelinek View Post

Michael, it seems e.g. EP-DGEMM spents pretty much all the runtime in dgemm_ function from libblas. Do you use different libblas versions (built by different compilers) for this testing?

I'm asking especially because dgemm_ is written in Fortran, so it isn't clear if it has been for the clang case built with gfortran, flag or PGI FE for LLVM or whatever else.

**name99** · 03 February 2017, 12:35 PM

Why is LTO not yet a standard optimization flag you use? Sin't it robust enough now (and uses small enough amounts of memory) that it's feasible for use on both compilers (certainly on the LLVM side)?

**name99** · 03 February 2017, 12:50 PM

Originally posted by zboson View Post

One things that could explain why Clang is performing so much better (factors of two or three) than GCC in some cases is due to dependency chains and loop unrolling. One example of a dependency chain is a reduction sum

for(int i=0; i<n; i++) sum += x;

Clang will unroll the loop to four independent sums (for float you need -Ofast) whereas GCC does not unroll at all. The problem is that if you don't unroll it's latency bound. You have to unroll the loop into intendant operations to get the full throughput. ICC incidentally unrolls twice from what I can tell.

The option `-funroll-loops` unrolls the loop eight times but does not break the dependency which is really stupid. This is one of the main reasons that GCC's auto-vectorization is not good enough and why you still need intrinsics with GCC.

But Michael's tests for that sort of code (all the matrix stuff) are also sub-optimal. My guess is he is not using the version of LLVM that has Polly built-in, and is not using Graphite on GCC. Both of these analyze certain types of deeply nested loops and restructure them for better performance (better vectorization, better parallelization, better memory access patterns). It would be nice to see both these compilers run at turbo speed (which means also using LTO) rather than just at their basic level.

(It would be even nicer if the testing infrastructure also supported PGO. Apple is clearly moving towards making this more-or-less automatic in XCode, though only the outline is there right now, not the full picture. Maybe at this year's WWDC? Basically, along with your standard unit tests and performance tests run after each compile, you'll be able to provide "training" tests that create profiling data for subsequent PGO. I'd guess Dev Studio already has a work flow like this. Eclipse probably, but I have no idea.)

**mlau** · 10 February 2017, 06:30 AM

Something broke LLVM/Clang over the weekend: a git snapshot of llvm/clang from 08.02.2017 is now easily outperformed by gcc-6 in _all_ scimark2 tests...

Announcement

GCC 7.0 vs. LLVM Clang 4.0 Performance (January 2017)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment