Announcement

Collapse
No announcement yet.

GCC 7.0 vs. LLVM Clang 4.0 Performance (January 2017)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by Marc Driftmeyer View Post
    Update Bullet to 2.85. Your test is ancient.
    Last check (probably like 6 months ago) the benchmark built-ins was broken with their latest release at the time.
    Michael Larabel
    https://www.michaellarabel.com/

    Comment


    • #22
      Originally posted by Steffo View Post
      It seems, soon GCC will be irrelevant. All new tools are based on llvm and now it can beat GCC on many benchmarks.
      Which is a shame; rms was - for once - wrong when deciding not to make gcc modular. (see things like: https://gcc.gnu.org/ml/gcc/2007-11/msg00193.html)

      Comment


      • #23
        One things that could explain why Clang is performing so much better (factors of two or three) than GCC in some cases is due to dependency chains and loop unrolling. One example of a dependency chain is a reduction sum

        for(int i=0; i<n; i++) sum += x;

        Clang will unroll the loop to four independent sums (for float you need -Ofast) whereas GCC does not unroll at all. The problem is that if you don't unroll it's latency bound. You have to unroll the loop into intendant operations to get the full throughput. ICC incidentally unrolls twice from what I can tell.

        The option `-funroll-loops` unrolls the loop eight times but does not break the dependency which is really stupid. This is one of the main reasons that GCC's auto-vectorization is not good enough and why you still need intrinsics with GCC.

        Comment


        • #24
          Originally posted by atomsymbol

          In my opinion, from a certain viewpoint both gcc and clang are flawed.

          Code:
          $ [B]cat a.c[/B]
          #include <stddef.h>
          #include <stdlib.h>
          int main() {
          float sum = random();
          const float x = random();
          const size_t n = random();
          for(size_t i=0; i<n; i++) sum += x;
          return sum;
          }
          $ [B]clang --version[/B]
          clang version 3.9.1 (tags/RELEASE_391/final)
          Target: x86_64-pc-linux-gnu
          $ [B]clang -O3 -S a.c[/B]
          $ [B]cat a.s[/B]
          .LBB0_9:
          addss %xmm0, %xmm1
          addss %xmm0, %xmm1
          addss %xmm0, %xmm1
          addss %xmm0, %xmm1
          addss %xmm0, %xmm1
          addss %xmm0, %xmm1
          addss %xmm0, %xmm1
          addss %xmm0, %xmm1
          addq $8, %rdx
          jne .LBB0_9
          $ [B]clang -O3 -S a.c [COLOR=#006400]-march=native (or: -march=bdver3)[/COLOR][/B]
          $ [B]cat a.s[/B]
          .LBB0_4:
          vaddss %xmm1, %xmm0, %xmm1
          decq %rax
          jne .LBB0_4
          More options can be tried at https://godbolt.org
          Note that I said "for float you need -Ofast". Floating point math is not associative. In order to do reductions you have to tell the compiler to assume floating point associative math. I think with ICC it does this by default but not with GCC or Clang (Clang is even stricter than GCC because GCC allows fma contractions be default by not Clang).

          Comment


          • #25
            Originally posted by Michael View Post

            With PTS7 git: phoronix-test-suite winners-and-losers <result file> if any other metrics/stats you would find interesting, can easily add. If it ends up adding more stuff to it, will probably rename from winners-and-losers.
            Michael, it seems e.g. EP-DGEMM spents pretty much all the runtime in dgemm_ function from libblas. Do you use different libblas versions (built by different compilers) for this testing?
            Last edited by JakubJelinek; 01 February 2017, 09:31 AM.

            Comment


            • #26
              Originally posted by JakubJelinek View Post

              Michael, it seems e.g. EP-DGEMM spents pretty much all the runtime in dgemm_ function from libblas. Do you use different libblas versions (built by different compilers) for this testing?
              I'm asking especially because dgemm_ is written in Fortran, so it isn't clear if it has been for the clang case built with gfortran, flag or PGI FE for LLVM or whatever else.

              Comment


              • #27
                Why is LTO not yet a standard optimization flag you use? Sin't it robust enough now (and uses small enough amounts of memory) that it's feasible for use on both compilers (certainly on the LLVM side)?

                Comment


                • #28
                  Originally posted by zboson View Post
                  One things that could explain why Clang is performing so much better (factors of two or three) than GCC in some cases is due to dependency chains and loop unrolling. One example of a dependency chain is a reduction sum

                  for(int i=0; i<n; i++) sum += x;

                  Clang will unroll the loop to four independent sums (for float you need -Ofast) whereas GCC does not unroll at all. The problem is that if you don't unroll it's latency bound. You have to unroll the loop into intendant operations to get the full throughput. ICC incidentally unrolls twice from what I can tell.

                  The option `-funroll-loops` unrolls the loop eight times but does not break the dependency which is really stupid. This is one of the main reasons that GCC's auto-vectorization is not good enough and why you still need intrinsics with GCC.
                  But Michael's tests for that sort of code (all the matrix stuff) are also sub-optimal. My guess is he is not using the version of LLVM that has Polly built-in, and is not using Graphite on GCC. Both of these analyze certain types of deeply nested loops and restructure them for better performance (better vectorization, better parallelization, better memory access patterns). It would be nice to see both these compilers run at turbo speed (which means also using LTO) rather than just at their basic level.

                  (It would be even nicer if the testing infrastructure also supported PGO. Apple is clearly moving towards making this more-or-less automatic in XCode, though only the outline is there right now, not the full picture. Maybe at this year's WWDC? Basically, along with your standard unit tests and performance tests run after each compile, you'll be able to provide "training" tests that create profiling data for subsequent PGO. I'd guess Dev Studio already has a work flow like this. Eclipse probably, but I have no idea.)

                  Comment


                  • #29
                    Something broke LLVM/Clang over the weekend: a git snapshot of llvm/clang from 08.02.2017 is now easily outperformed by gcc-6 in _all_ scimark2 tests...

                    Comment

                    Working...
                    X