Announcement

Collapse
No announcement yet.

Intel Broadwell: GCC 4.9 vs. LLVM Clang 3.5 Compiler Benchmarks

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    I await the day the entire Linux stack (from kernel to libraries and applications) can be built on Clang.

    Comment


    • #12
      Scimark

      Clang's -O3 now packages limited loop unrolling, while GCC needs -funroll-loops. I tried compiling with -O3 -march=native -funroll-loops --param max-unroll-times=2 and on my Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz

      GCC 4.8 with unrolling:

      Using 2.00 seconds min time per kenel.
      Composite Score: 2016.48
      FFT Mflops: 1516.51 (N=1024)
      SOR Mflops: 1155.08 (100 x 100)
      MonteCarlo: Mflops: 440.49
      Sparse matmult Mflops: 1871.72 (N=1000, nz=5000)
      LU Mflops: 5098.63 (M=100, N=100)

      GCC 4.8 without:
      Using 2.00 seconds min time per kenel.
      Composite Score: 1936.17
      FFT Mflops: 1430.86 (N=1024)
      SOR Mflops: 1118.35 (100 x 100)
      MonteCarlo: Mflops: 438.66
      Sparse matmult Mflops: 2035.70 (N=1000, nz=5000)
      LU Mflops: 4657.27 (M=100, N=100)


      Clang 3.5:

      Using 2.00 seconds min time per kenel.
      Composite Score: 1877.84
      FFT Mflops: 1071.73 (N=1024)
      SOR Mflops: 1350.53 (100 x 100)
      MonteCarlo: Mflops: 452.64
      Sparse matmult Mflops: 2180.01 (N=1000, nz=5000)
      LU Mflops: 4334.29 (M=100, N=100)

      So GCC seems to win in both cases, more with unrolling. The difference in SOR seems interesting, looking at assembly I do not see why GCC loop should run slower. It is very trivial benchmark:
      for (p=0; p<num_iterations; p++)
      {
      for (i=1; i<Mm1; i++)
      {
      Gi = G[i];
      Gim1 = G[i-1];
      Gip1 = G[i+1];
      for (j=1; j<Nm1; j++)
      Gi[j] = omega_over_four * (Gim1[j] + Gip1[j] + Gi[j-1]
      + Gi[j+1]) + one_minus_omega * Gi[j];
      }
      }

      Comment


      • #13
        Originally posted by hubicka View Post
        So GCC seems to win in both cases, more with unrolling. The difference in SOR seems interesting, looking at assembly I do not see why GCC loop should run slower. It is very trivial benchmark:
        Code:
                for (p=0; p<num_iterations; p++)
                {
                    for (i=1; i<Mm1; i++)
                    {
                        Gi = G[i];
                        Gim1 = G[i-1];
                        Gip1 = G[i+1];
                        for (j=1; j<Nm1; j++)
                            Gi[j] = omega_over_four * (Gim1[j] + Gip1[j] + Gi[j-1] 
                                        + Gi[j+1]) + one_minus_omega * Gi[j];
                    }
                }
        i'm interested
        could you post the asm of this function with bout compilers
        or the whole C thing

        bdw http://gcc.godbolt.org/#

        Comment


        • #14
          Originally posted by gens View Post
          i'm interested
          could you post the asm of this function with bout compilers
          or the whole C thing

          bdw http://gcc.godbolt.org/#
          Scimark can be downloaded here http://math.nist.gov/scimark2/

          GCC code (with unrolling) http://pastebin.com/mwMQEbzy

          LLVM code http://pastebin.com/D36mA169

          Comment


          • #15
            Originally posted by hubicka View Post
            Scimark can be downloaded here http://math.nist.gov/scimark2/

            GCC code (with unrolling) http://pastebin.com/mwMQEbzy

            LLVM code http://pastebin.com/D36mA169
            I was under the impression that GCC -funroll-loops was unstable and produced unpredictable code. I'm not sure if it's still true, but it was for a long time. At least gentoo documentation still advises not to use it.

            Comment


            • #16
              Originally posted by duby229 View Post
              I was under the impression that GCC -funroll-loops was unstable and produced unpredictable code. I'm not sure if it's still true, but it was for a long time. At least gentoo documentation still advises not to use it.
              I do not see any wrong code bugs for -funroll-loops in GCC 4.9. The unroller is not particularly difficult pass and it does not really changed much recently. It tends to expose some bugs related to aliasing. It is enabled by default with -fprofile-use, it is disabled with -O3 more or less for historical reasons (the time when -O3 was -O2 + automatic inlining). In fact I want to raise discussion about enabling it for GCC 5/GCC 6. Main problem is that the default unrolling limits are probably bit too aggressive even for -O3, so we may want to come with conservative defaults. This needs bit of benchmarking to be done.

              Comment


              • #17
                could it be that vfmadd132sd instruction that gcc uses here, as opose to LLVM, is slow as a hell?

                Comment


                • #18
                  other possibility: in the most inner loop there is 12 to 8 difference in the memory accessing instruction count.

                  Comment


                  • #19
                    Originally posted by mirza View Post
                    other possibility: in the most inner loop there is 12 to 8 difference in the memory accessing instruction count.
                    Yes, that nailed it. It is not unrolling, but predictive commoning that makes difference here. Just checked GCC 5 and that also gets 8 accesses:

                    .L8:
                    vmulsd %xmm2, %xmm4, %xmm3
                    vmovsd 8(%r10,%r9), %xmm2
                    addl $2, %r8d
                    vaddsd 8(%r11,%r9), %xmm2, %xmm2
                    vmovsd 16(%rax,%r9), %xmm6
                    vaddsd %xmm0, %xmm2, %xmm0
                    vmovsd 24(%rax,%r9), %xmm2
                    vaddsd %xmm6, %xmm0, %xmm0
                    vmulsd %xmm6, %xmm4, %xmm6
                    vfmadd132sd %xmm5, %xmm3, %xmm0
                    vmovsd %xmm0, 8(%rax,%r9)
                    vmovsd 16(%r11,%r9), %xmm1
                    vaddsd 16(%r10,%r9), %xmm1, %xmm1
                    vaddsd %xmm1, %xmm0, %xmm0
                    vaddsd %xmm0, %xmm2, %xmm1
                    vfmadd132sd %xmm5, %xmm6, %xmm1
                    vmovsd %xmm1, 16(%rax,%r9)
                    addq $16, %r9
                    vmovapd %xmm1, %xmm0
                    cmpq %r9, %rsi
                    jne .L8

                    vfmadd132sd is preferred over the multiply+add sequence.

                    Comment


                    • #20
                      That code looks OK, GCC5 will be probably some percents faster then LLVM on this test (guess).

                      Comment

                      Working...
                      X