Announcement

Collapse
No announcement yet.

GCC vs. LLVM Clang Is Mixed On The Ivy Bridge Extreme

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • GCC vs. LLVM Clang Is Mixed On The Ivy Bridge Extreme

    Phoronix: GCC vs. LLVM Clang Is Mixed On The Ivy Bridge Extreme

    Our latest Linux benchmarks from the Intel Core i7 4960X Ivy Bridge Extreme Edition processor are compiler tests on this $1000 USD processor. The last two stable releases of GCC and LLVM's Clang C/C++ compilers were compared: GCC 4.7.2, GCC 4.8.1, LLVM Clang 3.2, and LLVM Clang 3.3.

    http://www.phoronix.com/vr.php?view=19192

  • #2
    The Botan tests seem to be missing "-march=native".

    Comment


    • #3
      Originally posted by s_j_newbury View Post
      The Botan tests seem to be missing "-march=native".
      What's really bad with the Botan tests is that they use -O2 which makes those tests worthless when it comes to comparing compiler vs compiler code performance. As I've explained to Michael over and over, there are no rules for which optimizations a compiler should add at -O2, meaning that if compiler A decides to enable more optimizations than compiler B at that level, A will win which in turn says nothing of how the compilers compare when set to their HIGHEST optimization level which is what you use when you want the FASTEST code, which again is what all tests except the 'build time' test here was measuring.

      Again, this is why you compare the compilers at their highest optimization level (-O3, and when it was used GCC won all performance tests) because at this level the compilers compete on fair ground, which is to create the fastest binary they can. Now -O2 can be interesting as sometimes -O2 beats -O3 due to optimization heuristics failing, but only in the context of actually having -O3 results to compare with.

      I don't know why Michael persist at doing this, unless he is consciously trying to cook the results in order to get some pointless Clang/LLVM wins.

      Comment


      • #4
        I dont think Michael is cooking the books.
        Perhaps he went with o2 because most people use it as default?

        Comment


        • #5
          Originally posted by steverweber View Post
          I dont think Michael is cooking the books.
          Perhaps he went with o2 because most people use it as default?
          He is comparing the optimization between two compilers. Again, -O2 on one compiler has nothing to do with -O2 on another other than name and overall goal, it's just a loose balance the respective compiler developers choose between optimization and compilation time for -O2 on THEIR compiler, thus it's absolutely pointless to compare those when measuring the best performance the compiler can get out of a certain piece of code (which is what all tests but one did here).

          Meanwhile -O3 is standard highest optimization level between these compilers, here compilers say -'never mind compile time, enable all the appropriate optimizations to create the fastest code we can', which means that this option it the only one worth using when comparing code performance (again what is done here) UNLESS you compare both -O2 and -O3 to see if this particular test is one of the few where -O2 beats -O3.

          So as it stands this test is pointless in terms of which compiler can generate the fastest code for Botan, as we won't know that until he does this test using -O3.

          Unlike you I'm leaning more and more towards Michael's well known Clang/LLVM bias being the reason for throwing in tests using -O2, as I can't see any other logical explanation as to why he continues to do this.

          Now the Botan tests are worthless from a compiler vs compiler performance standpoint, which is sad because it would be interesting to see where CLang/LLVM stands against GCC in those tests aswell.

          Comment


          • #6
            Don't forget, for gcc there's also:
            -Ofast
            Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.

            Comment


            • #7
              s_j_newbury -Ofast sounds neat but unsafe as a default

              If o3 does not create unsafe code, I have to side with XorEaxEax.

              It would be nice to see the results of all 3 optimizations levels with benchmarks of performance/compile time.
              Last edited by steverweber; 10-13-2013, 12:16 PM.

              Comment


              • #8
                Originally posted by s_j_newbury View Post
                Don't forget, for gcc there's also:
                -Ofast
                Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.
                XCode 5 also has a -Ofast (whose main effect is fast math, but also sets one or two other settings that I forget). I don't know if that has made it into mainline LLVM yet.

                [XCode's defaults are also now for new projects to be very aggressive about assuming that pointers don't alias. It's likely that this is an Apple developer mandate (they provide guidelines about how you should write your code to not make use of aliased pointers, eg through use of unions rather than pointer casting), and that mainline LLVM is not as aggressive. So this is an interesting point, but may not be relevant to the Phoronix community which, I am guessing, is more interested in having crappy old code compile properly than in following a company mandate about how to write better performant new code.]

                LLVM (and so XCode) also has link time (ie whole program) optimization, enabled by -O4. I imagine GCC has the same. It seems only sensible that this should be activated when both compilers are tested against each other. Apple slides showed that LTO made a substantial difference (5% to 20%) in performance, but of course that is against real world code that is split over a large number of files; it may have much less impact on these sorts of microbenchmarks.
                What's not clear to me is the extent to which either LLVM or GCC have fully optimized their LTO pass. Apple had (PPC specific) tools fifteen years ago that could run whole program optimization and rearrange the function layout so that functions that called each other were packed together (and so took up less TLB coverage and shared overlapping cache lines). They used heuristics in the absence of anything better, but could be run with a profiling pass to get a better understanding of the hot call chains. But as far as I know, the LLVM LTO does not (yet?) do this sort of thing, and I have zero idea about GCC.
                You can do even better with LTO (either heuristics or profile-directed). Rather than rearranging and repacking the code based on functions, you can do so based on basic-blocks so that if(!error){...}else(/*handle error*/...} moves the error handling code far away to the end of your binary. Now your actual binary (on disk and in RAM) looks like a series of basic blocks that jump between each other, not a series of functions --- looks weird yes, and more difficult to reverse-engineer from binary, but perfectly legit and not in any way fragile. Obviously this gives you even better utilization of both your I$ and your TLB. This has been done academically but I don't know if any commercial products (ICC? Dev Studio?) use it. You can also do the same sort of repacking with global/static data layout to try to get data that is frequently used together on the same cache line.

                Comment


                • #9
                  Oh, one thing to add to my earlier comment.
                  LLVM (and maybe GCC, but I don't know there) will not automatically vectorize many FP loops if fast-math is not enabled because getting the loop to vectorize requires re-ordering FP operations. This means that using fast-math, if your code allows it, can affect performance by quite a bit more than you might imagine.

                  Comment


                  • #10
                    Originally posted by XorEaxEax View Post
                    What's really bad with the Botan tests is that they use -O2 which makes those tests worthless when it comes to comparing compiler vs compiler code performance. As I've explained to Michael over and over, there are no rules for which optimizations a compiler should add at -O2, meaning that if compiler A decides to enable more optimizations than compiler B at that level, A will win which in turn says nothing of how the compilers compare when set to their HIGHEST optimization level which is what you use when you want the FASTEST code, which again is what all tests except the 'build time' test here was measuring.

                    Again, this is why you compare the compilers at their highest optimization level (-O3, and when it was used GCC won all performance tests) because at this level the compilers compete on fair ground, which is to create the fastest binary they can. Now -O2 can be interesting as sometimes -O2 beats -O3 due to optimization heuristics failing, but only in the context of actually having -O3 results to compare with.

                    I don't know why Michael persist at doing this, unless he is consciously trying to cook the results in order to get some pointless Clang/LLVM wins.
                    Where in the wild do we actually see O3 though?

                    Comment


                    • #11
                      Originally posted by WorBlux View Post
                      Where in the wild do we actually see O3 though?
                      Practically all performance oriented software, like encoders, games, archivers/compressors, emulators, 3d renderers etc

                      Comment


                      • #12
                        Originally posted by name99 View Post
                        LLVM (and so XCode) also has link time (ie whole program) optimization, enabled by -O4. I imagine GCC has the same.
                        GCC activates link time optimization using -flto (which is also supported as a flag by Clang), -O4 is not recognized at all on GCC afaik.

                        Originally posted by name99 View Post
                        Apple slides showed that LTO made a substantial difference (5% to 20%) in performance, but of course that is against real world code that is split over a large number of files; it may have much less impact on these sorts of microbenchmarks.
                        Yes it's extremely code-base dependent, you can basically achieve the same effect by manually declaring non exported functions as static (or like sqlite did, join together all source files into one big file before compiling), personally I've seen little performance gain from LTO on my own code and the code I've benchmarked, but again it really depends on the code in question, also the binary often ends up quite a bit smaller with LTO which is of course nice.

                        Originally posted by name99 View Post
                        What's not clear to me is the extent to which either LLVM or GCC have fully optimized their LTO pass. Apple had (PPC specific) tools fifteen years ago that could run whole program optimization and rearrange the function layout so that functions that called each other were packed together (and so took up less TLB coverage and shared overlapping cache lines).
                        Well GCC has a '-fwhole-program' option which enables more aggressive interprocedural optimizations (as in, moving blocks of code around to improve cache use, eliminate/consolidate code blocks etc).

                        Originally posted by name99 View Post
                        but could be run with a profiling pass to get a better understanding of the hot call chains. But as far as I know, the LLVM LTO does not (yet?) do this sort of thing, and I have zero idea about GCC.
                        Yes, this (profile guided optimization) is the by far best performance giving optimization I've used which is outside of the -On levels, GCC has this implemented as -fprofile-generate/-fprofile-use and it can deliver some great improvements, typically I get between ~4-8% on performance oriented code, sometimes up to 20%.

                        I know there was a Google Summer of Code project to implement profile guided optimization into LLVM but I haven't heard anything about it since so I fear it didn't amount to anything.

                        Comment


                        • #13
                          Originally posted by name99 View Post
                          Oh, one thing to add to my earlier comment.
                          LLVM (and maybe GCC, but I don't know there) will not automatically vectorize many FP loops if fast-math is not enabled because getting the loop to vectorize requires re-ordering FP operations. This means that using fast-math, if your code allows it, can affect performance by quite a bit more than you might imagine.
                          I did a quick rundown test on C-Ray using GCC and Clang with and without -ffast-math:

                          GCC version: 4.8.1 20130725
                          Clang version: 3.3 (tags/RELEASE_33/final)
                          Arch Linux 64-bit, core i5
                          Benchmark: cat scene | ./c-ray-mt -t 4 -s 7500x3500 > foo.ppm

                          results are in milliseconds, and is the average of 5 benchmark-runs (exluding a varm-up run)

                          gcc -O3
                          5840

                          gcc -O3 -funroll-loops
                          5704

                          gcc -O3 -ffast-math -funroll-loops
                          4374

                          gcc -Ofast -funroll-loops
                          4368

                          gcc -Ofast -funroll-loops -march=native
                          4351

                          On GCC we can see that -ffast-math greatly improves the result, now let's look at Clang:

                          clang -O3
                          6403

                          clang -O3 -funroll-loops
                          6396

                          clang -O3 -ffast-math -funroll-loops
                          7137

                          clang -Ofast -funroll-loops
                          7122

                          clang -Ofast -funroll-loops -march=native
                          7153

                          On Clang however, we see that -ffast-math _degrades_ performance markedly on C-Ray, so had Michael used it for his Phoronix C-Ray test then Clang would have come out looking MUCH worse than it does now since GCC got a great boost from -ffast-math.

                          Apart from that it seems that -funroll-loops does nothing performance-wise on Clang, and same goes for -march=native.

                          Comment


                          • #14
                            Just noting, gcc accepts -O[any positive number], it's just that high numbers get clamped to 3. It's been this way for ages, gcc 4.2 accepts -O666 just fine.

                            Comment


                            • #15
                              Originally posted by curaga View Post
                              Just noting, gcc accepts -O[any positive number], it's just that high numbers get clamped to 3. It's been this way for ages, gcc 4.2 accepts -O666 just fine.
                              Good to know, I thought it just ignored anything above -O3 and used the default -O0, clamping at -O3 makes more sense though as the user likely wanted aggressive optimization when attempting a higher value than -O3.

                              Comment

                              Working...
                              X