I await the day the entire Linux stack (from kernel to libraries and applications) can be built on Clang.
Announcement
Collapse
No announcement yet.
Intel Broadwell: GCC 4.9 vs. LLVM Clang 3.5 Compiler Benchmarks
Collapse
X
-
Scimark
Clang's -O3 now packages limited loop unrolling, while GCC needs -funroll-loops. I tried compiling with -O3 -march=native -funroll-loops --param max-unroll-times=2 and on my Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz
GCC 4.8 with unrolling:
Using 2.00 seconds min time per kenel.
Composite Score: 2016.48
FFT Mflops: 1516.51 (N=1024)
SOR Mflops: 1155.08 (100 x 100)
MonteCarlo: Mflops: 440.49
Sparse matmult Mflops: 1871.72 (N=1000, nz=5000)
LU Mflops: 5098.63 (M=100, N=100)
GCC 4.8 without:
Using 2.00 seconds min time per kenel.
Composite Score: 1936.17
FFT Mflops: 1430.86 (N=1024)
SOR Mflops: 1118.35 (100 x 100)
MonteCarlo: Mflops: 438.66
Sparse matmult Mflops: 2035.70 (N=1000, nz=5000)
LU Mflops: 4657.27 (M=100, N=100)
Clang 3.5:
Using 2.00 seconds min time per kenel.
Composite Score: 1877.84
FFT Mflops: 1071.73 (N=1024)
SOR Mflops: 1350.53 (100 x 100)
MonteCarlo: Mflops: 452.64
Sparse matmult Mflops: 2180.01 (N=1000, nz=5000)
LU Mflops: 4334.29 (M=100, N=100)
So GCC seems to win in both cases, more with unrolling. The difference in SOR seems interesting, looking at assembly I do not see why GCC loop should run slower. It is very trivial benchmark:
for (p=0; p<num_iterations; p++)
{
for (i=1; i<Mm1; i++)
{
Gi = G[i];
Gim1 = G[i-1];
Gip1 = G[i+1];
for (j=1; j<Nm1; j++)
Gi[j] = omega_over_four * (Gim1[j] + Gip1[j] + Gi[j-1]
+ Gi[j+1]) + one_minus_omega * Gi[j];
}
}
Comment
-
Originally posted by hubicka View PostSo GCC seems to win in both cases, more with unrolling. The difference in SOR seems interesting, looking at assembly I do not see why GCC loop should run slower. It is very trivial benchmark:
Code:for (p=0; p<num_iterations; p++) { for (i=1; i<Mm1; i++) { Gi = G[i]; Gim1 = G[i-1]; Gip1 = G[i+1]; for (j=1; j<Nm1; j++) Gi[j] = omega_over_four * (Gim1[j] + Gip1[j] + Gi[j-1] + Gi[j+1]) + one_minus_omega * Gi[j]; } }
could you post the asm of this function with bout compilers
or the whole C thing
bdw http://gcc.godbolt.org/#
Comment
-
Originally posted by gens View Posti'm interested
could you post the asm of this function with bout compilers
or the whole C thing
bdw http://gcc.godbolt.org/#
GCC code (with unrolling) http://pastebin.com/mwMQEbzy
LLVM code http://pastebin.com/D36mA169
Comment
-
Originally posted by hubicka View PostScimark can be downloaded here http://math.nist.gov/scimark2/
GCC code (with unrolling) http://pastebin.com/mwMQEbzy
LLVM code http://pastebin.com/D36mA169
Comment
-
Originally posted by duby229 View PostI was under the impression that GCC -funroll-loops was unstable and produced unpredictable code. I'm not sure if it's still true, but it was for a long time. At least gentoo documentation still advises not to use it.
Comment
-
Originally posted by mirza View Postother possibility: in the most inner loop there is 12 to 8 difference in the memory accessing instruction count.
.L8:
vmulsd %xmm2, %xmm4, %xmm3
vmovsd 8(%r10,%r9), %xmm2
addl $2, %r8d
vaddsd 8(%r11,%r9), %xmm2, %xmm2
vmovsd 16(%rax,%r9), %xmm6
vaddsd %xmm0, %xmm2, %xmm0
vmovsd 24(%rax,%r9), %xmm2
vaddsd %xmm6, %xmm0, %xmm0
vmulsd %xmm6, %xmm4, %xmm6
vfmadd132sd %xmm5, %xmm3, %xmm0
vmovsd %xmm0, 8(%rax,%r9)
vmovsd 16(%r11,%r9), %xmm1
vaddsd 16(%r10,%r9), %xmm1, %xmm1
vaddsd %xmm1, %xmm0, %xmm0
vaddsd %xmm0, %xmm2, %xmm1
vfmadd132sd %xmm5, %xmm6, %xmm1
vmovsd %xmm1, 16(%rax,%r9)
addq $16, %r9
vmovapd %xmm1, %xmm0
cmpq %r9, %rsi
jne .L8
vfmadd132sd is preferred over the multiply+add sequence.
Comment
Comment