Announcement

Collapse
No announcement yet.

Intel Broadwell: GCC 4.9 vs. LLVM Clang 3.5 Compiler Benchmarks

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • gens
    replied
    Originally posted by mirza View Post
    other possibility: in the most inner loop there is 12 to 8 difference in the memory accessing instruction count.
    y, probably it's mostly that

    and the FMA multiply add instruction probably has a small part in it

    Leave a comment:


  • mirza
    replied
    That code looks OK, GCC5 will be probably some percents faster then LLVM on this test (guess).

    Leave a comment:


  • hubicka
    replied
    Originally posted by mirza View Post
    other possibility: in the most inner loop there is 12 to 8 difference in the memory accessing instruction count.
    Yes, that nailed it. It is not unrolling, but predictive commoning that makes difference here. Just checked GCC 5 and that also gets 8 accesses:

    .L8:
    vmulsd %xmm2, %xmm4, %xmm3
    vmovsd 8(%r10,%r9), %xmm2
    addl $2, %r8d
    vaddsd 8(%r11,%r9), %xmm2, %xmm2
    vmovsd 16(%rax,%r9), %xmm6
    vaddsd %xmm0, %xmm2, %xmm0
    vmovsd 24(%rax,%r9), %xmm2
    vaddsd %xmm6, %xmm0, %xmm0
    vmulsd %xmm6, %xmm4, %xmm6
    vfmadd132sd %xmm5, %xmm3, %xmm0
    vmovsd %xmm0, 8(%rax,%r9)
    vmovsd 16(%r11,%r9), %xmm1
    vaddsd 16(%r10,%r9), %xmm1, %xmm1
    vaddsd %xmm1, %xmm0, %xmm0
    vaddsd %xmm0, %xmm2, %xmm1
    vfmadd132sd %xmm5, %xmm6, %xmm1
    vmovsd %xmm1, 16(%rax,%r9)
    addq $16, %r9
    vmovapd %xmm1, %xmm0
    cmpq %r9, %rsi
    jne .L8

    vfmadd132sd is preferred over the multiply+add sequence.

    Leave a comment:


  • mirza
    replied
    other possibility: in the most inner loop there is 12 to 8 difference in the memory accessing instruction count.

    Leave a comment:


  • mirza
    replied
    could it be that vfmadd132sd instruction that gcc uses here, as opose to LLVM, is slow as a hell?

    Leave a comment:


  • hubicka
    replied
    Originally posted by duby229 View Post
    I was under the impression that GCC -funroll-loops was unstable and produced unpredictable code. I'm not sure if it's still true, but it was for a long time. At least gentoo documentation still advises not to use it.
    I do not see any wrong code bugs for -funroll-loops in GCC 4.9. The unroller is not particularly difficult pass and it does not really changed much recently. It tends to expose some bugs related to aliasing. It is enabled by default with -fprofile-use, it is disabled with -O3 more or less for historical reasons (the time when -O3 was -O2 + automatic inlining). In fact I want to raise discussion about enabling it for GCC 5/GCC 6. Main problem is that the default unrolling limits are probably bit too aggressive even for -O3, so we may want to come with conservative defaults. This needs bit of benchmarking to be done.

    Leave a comment:


  • duby229
    replied
    Originally posted by hubicka View Post
    Scimark can be downloaded here http://math.nist.gov/scimark2/

    GCC code (with unrolling) http://pastebin.com/mwMQEbzy

    LLVM code http://pastebin.com/D36mA169
    I was under the impression that GCC -funroll-loops was unstable and produced unpredictable code. I'm not sure if it's still true, but it was for a long time. At least gentoo documentation still advises not to use it.

    Leave a comment:


  • hubicka
    replied
    Originally posted by gens View Post
    i'm interested
    could you post the asm of this function with bout compilers
    or the whole C thing

    bdw http://gcc.godbolt.org/#
    Scimark can be downloaded here http://math.nist.gov/scimark2/

    GCC code (with unrolling) http://pastebin.com/mwMQEbzy

    LLVM code http://pastebin.com/D36mA169

    Leave a comment:


  • gens
    replied
    Originally posted by hubicka View Post
    So GCC seems to win in both cases, more with unrolling. The difference in SOR seems interesting, looking at assembly I do not see why GCC loop should run slower. It is very trivial benchmark:
    Code:
            for (p=0; p<num_iterations; p++)
            {
                for (i=1; i<Mm1; i++)
                {
                    Gi = G[i];
                    Gim1 = G[i-1];
                    Gip1 = G[i+1];
                    for (j=1; j<Nm1; j++)
                        Gi[j] = omega_over_four * (Gim1[j] + Gip1[j] + Gi[j-1] 
                                    + Gi[j+1]) + one_minus_omega * Gi[j];
                }
            }
    i'm interested
    could you post the asm of this function with bout compilers
    or the whole C thing

    bdw http://gcc.godbolt.org/#

    Leave a comment:


  • hubicka
    replied
    Scimark

    Clang's -O3 now packages limited loop unrolling, while GCC needs -funroll-loops. I tried compiling with -O3 -march=native -funroll-loops --param max-unroll-times=2 and on my Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz

    GCC 4.8 with unrolling:

    Using 2.00 seconds min time per kenel.
    Composite Score: 2016.48
    FFT Mflops: 1516.51 (N=1024)
    SOR Mflops: 1155.08 (100 x 100)
    MonteCarlo: Mflops: 440.49
    Sparse matmult Mflops: 1871.72 (N=1000, nz=5000)
    LU Mflops: 5098.63 (M=100, N=100)

    GCC 4.8 without:
    Using 2.00 seconds min time per kenel.
    Composite Score: 1936.17
    FFT Mflops: 1430.86 (N=1024)
    SOR Mflops: 1118.35 (100 x 100)
    MonteCarlo: Mflops: 438.66
    Sparse matmult Mflops: 2035.70 (N=1000, nz=5000)
    LU Mflops: 4657.27 (M=100, N=100)


    Clang 3.5:

    Using 2.00 seconds min time per kenel.
    Composite Score: 1877.84
    FFT Mflops: 1071.73 (N=1024)
    SOR Mflops: 1350.53 (100 x 100)
    MonteCarlo: Mflops: 452.64
    Sparse matmult Mflops: 2180.01 (N=1000, nz=5000)
    LU Mflops: 4334.29 (M=100, N=100)

    So GCC seems to win in both cases, more with unrolling. The difference in SOR seems interesting, looking at assembly I do not see why GCC loop should run slower. It is very trivial benchmark:
    for (p=0; p<num_iterations; p++)
    {
    for (i=1; i<Mm1; i++)
    {
    Gi = G[i];
    Gim1 = G[i-1];
    Gip1 = G[i+1];
    for (j=1; j<Nm1; j++)
    Gi[j] = omega_over_four * (Gim1[j] + Gip1[j] + Gi[j-1]
    + Gi[j+1]) + one_minus_omega * Gi[j];
    }
    }

    Leave a comment:

Working...
X