Intel Broadwell: GCC 4.9 vs. LLVM Clang 3.5 Compiler Benchmarks

gens replied

02 February 2015, 06:35 AM
Originally posted by mirza View Post

other possibility: in the most inner loop there is 12 to 8 difference in the memory accessing instruction count.

y, probably it's mostly that

and the FMA multiply add instruction probably has a small part in it
Leave a comment:
mirza replied

02 February 2015, 06:15 AM
That code looks OK, GCC5 will be probably some percents faster then LLVM on this test (guess).
Leave a comment:
hubicka replied

02 February 2015, 01:45 AM
Originally posted by mirza View Post

other possibility: in the most inner loop there is 12 to 8 difference in the memory accessing instruction count.

Yes, that nailed it. It is not unrolling, but predictive commoning that makes difference here. Just checked GCC 5 and that also gets 8 accesses:

.L8:
vmulsd %xmm2, %xmm4, %xmm3
vmovsd 8(%r10,%r9), %xmm2
addl $2, %r8d
vaddsd 8(%r11,%r9), %xmm2, %xmm2
vmovsd 16(%rax,%r9), %xmm6
vaddsd %xmm0, %xmm2, %xmm0
vmovsd 24(%rax,%r9), %xmm2
vaddsd %xmm6, %xmm0, %xmm0
vmulsd %xmm6, %xmm4, %xmm6
vfmadd132sd %xmm5, %xmm3, %xmm0
vmovsd %xmm0, 8(%rax,%r9)
vmovsd 16(%r11,%r9), %xmm1
vaddsd 16(%r10,%r9), %xmm1, %xmm1
vaddsd %xmm1, %xmm0, %xmm0
vaddsd %xmm0, %xmm2, %xmm1
vfmadd132sd %xmm5, %xmm6, %xmm1
vmovsd %xmm1, 16(%rax,%r9)
addq $16, %r9
vmovapd %xmm1, %xmm0
cmpq %r9, %rsi
jne .L8

vfmadd132sd is preferred over the multiply+add sequence.
Leave a comment:
mirza replied

01 February 2015, 08:03 PM
other possibility: in the most inner loop there is 12 to 8 difference in the memory accessing instruction count.
Leave a comment:
mirza replied

01 February 2015, 07:13 PM
could it be that vfmadd132sd instruction that gcc uses here, as opose to LLVM, is slow as a hell?
Leave a comment:
hubicka replied

01 February 2015, 04:51 PM
Originally posted by duby229 View Post

I was under the impression that GCC -funroll-loops was unstable and produced unpredictable code. I'm not sure if it's still true, but it was for a long time. At least gentoo documentation still advises not to use it.

I do not see any wrong code bugs for -funroll-loops in GCC 4.9. The unroller is not particularly difficult pass and it does not really changed much recently. It tends to expose some bugs related to aliasing. It is enabled by default with -fprofile-use, it is disabled with -O3 more or less for historical reasons (the time when -O3 was -O2 + automatic inlining). In fact I want to raise discussion about enabling it for GCC 5/GCC 6. Main problem is that the default unrolling limits are probably bit too aggressive even for -O3, so we may want to come with conservative defaults. This needs bit of benchmarking to be done.
Leave a comment:
duby229 replied

01 February 2015, 03:45 PM
Originally posted by hubicka View Post

Scimark can be downloaded here http://math.nist.gov/scimark2/

GCC code (with unrolling) http://pastebin.com/mwMQEbzy

LLVM code http://pastebin.com/D36mA169

I was under the impression that GCC -funroll-loops was unstable and produced unpredictable code. I'm not sure if it's still true, but it was for a long time. At least gentoo documentation still advises not to use it.
Leave a comment:
hubicka replied

01 February 2015, 02:00 PM
Originally posted by gens View Post

i'm interested
could you post the asm of this function with bout compilers
or the whole C thing

bdw http://gcc.godbolt.org/#

Scimark can be downloaded here http://math.nist.gov/scimark2/

GCC code (with unrolling) http://pastebin.com/mwMQEbzy

LLVM code http://pastebin.com/D36mA169
Leave a comment:
gens replied

01 February 2015, 11:27 AM
Originally posted by hubicka View Post

So GCC seems to win in both cases, more with unrolling. The difference in SOR seems interesting, looking at assembly I do not see why GCC loop should run slower. It is very trivial benchmark:

Code:

for (p=0; p<num_iterations; p++) { for (i=1; i<Mm1; i++) { Gi = G[i]; Gim1 = G[i-1]; Gip1 = G[i+1]; for (j=1; j<Nm1; j++) Gi[j] = omega_over_four * (Gim1[j] + Gip1[j] + Gi[j-1] + Gi[j+1]) + one_minus_omega * Gi[j]; } }

i'm interested
could you post the asm of this function with bout compilers
or the whole C thing

bdw http://gcc.godbolt.org/#
Leave a comment:
hubicka replied

31 January 2015, 03:40 PM
Scimark

Clang's -O3 now packages limited loop unrolling, while GCC needs -funroll-loops. I tried compiling with -O3 -march=native -funroll-loops --param max-unroll-times=2 and on my Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz

GCC 4.8 with unrolling:

Using 2.00 seconds min time per kenel.
Composite Score: 2016.48
FFT Mflops: 1516.51 (N=1024)
SOR Mflops: 1155.08 (100 x 100)
MonteCarlo: Mflops: 440.49
Sparse matmult Mflops: 1871.72 (N=1000, nz=5000)
LU Mflops: 5098.63 (M=100, N=100)

GCC 4.8 without:
Using 2.00 seconds min time per kenel.
Composite Score: 1936.17
FFT Mflops: 1430.86 (N=1024)
SOR Mflops: 1118.35 (100 x 100)
MonteCarlo: Mflops: 438.66
Sparse matmult Mflops: 2035.70 (N=1000, nz=5000)
LU Mflops: 4657.27 (M=100, N=100)

Clang 3.5:

Using 2.00 seconds min time per kenel.
Composite Score: 1877.84
FFT Mflops: 1071.73 (N=1024)
SOR Mflops: 1350.53 (100 x 100)
MonteCarlo: Mflops: 452.64
Sparse matmult Mflops: 2180.01 (N=1000, nz=5000)
LU Mflops: 4334.29 (M=100, N=100)

So GCC seems to win in both cases, more with unrolling. The difference in SOR seems interesting, looking at assembly I do not see why GCC loop should run slower. It is very trivial benchmark:
for (p=0; p<num_iterations; p++)
{
for (i=1; i<Mm1; i++)
{
Gi = G[i];
Gim1 = G[i-1];
Gip1 = G[i+1];
for (j=1; j<Nm1; j++)
Gi[j] = omega_over_four * (Gim1[j] + Gip1[j] + Gi[j-1]
+ Gi[j+1]) + one_minus_omega * Gi[j];
}
}
Leave a comment:

Announcement

Intel Broadwell: GCC 4.9 vs. LLVM Clang 3.5 Compiler Benchmarks

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: