Announcement

Collapse
No announcement yet.

LLVM Clang 12 Leading Over GCC 11 Compiler Performance On Intel Xeon Scalable Ice Lake

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by jrch2k8 View Post
    ... which is why is very common that most project take the "Good Enough" approach instead of the "Optimal" approach
    There is no single optimal approach when you have countless of x86-compatible CPUs. You indeed have to make a compromise and choose the "Good Enough" approach as you have put it. Or you would need to produce a binary for every CPU and memory type out there, because even the main memory speed and timings impact your code's performance and this just is not known by compilers during the optimisation.

    Comment


    • #12
      Originally posted by ezekrb5 View Post
      I love competition on this space. Glad to see we have multiple options. I'm still on GCC and won't probably move away from it anytime soon. But good for the clang guys!
      CLANG now has support for LTO in the Linux kernel, while GCC does not. Work for GCC was done a long time ago by Andi Kleen, but it was not accepted into the kernel back then. So now can one compile the Linux kernel with LTO optimisation for x86 and Arm thanks to Google, which is great news and makes CLANG into the only option for an LTO-kernel. However, just checking now did I see that Andi Kleen is working on it again. So we might see it coming for GCC, too:

      https://git.kernel.org/pub/scm/linux...inux-misc.git/

      I have been using CLANG for a while now on Arm myself, but I still prefer GCC as a stable workhorse, because many projects compile out-of-the-box with GCC while with CLANG is this not always the case (but it has gotten easier).
      Last edited by sdack; 04 June 2021, 02:15 PM.

      Comment


      • #13
        Originally posted by discordian View Post
        Is -O3 actually faster than -O2? Everytime I test this it's even or slower. (But I actually test this on power efficient cores like arm and Apollo Lake)
        As with anything, it depends.

        Back in 2017 when I was working on this library project written in C++ we built it with GCC for several different versions of Linux. On my Fedora development machine O3 was 5% faster than O2, and using Profile Guided Optimization on top of that made O3 15% faster than straight O2. We didn't use LTO, instead we used an older technique that combined C++ source files together into bigger units.

        I find that O3 works especially well in combination with LTO and PGO.

        It does tend to make the code size larger which can be a problem for mobile CPU with small cache. We were targeting Xeon. Your mileage may vary.

        Comment


        • #14
          Originally posted by sdack View Post
          There is no single optimal approach when you have countless of x86-compatible CPUs. You indeed have to make a compromise and choose the "Good Enough" approach as you have put it. Or you would need to produce a binary for every CPU and memory type out there, because even the main memory speed and timings impact your code's performance and this just is not known by compilers during the optimisation.
          Yeap you are right but there are things you can do that are generic enough to let -O3 optimize the hell out of it but still run on most CPU(not as efficient as going full per CPU manual optimization as you correctly imply) like for example find ways to never branch inside a loop(with enough iterations) or if you do test it can be splitted, verify your types are aligned, never allocate inside a loop if it is not absolutely needed, be smart about thread life times, use templates when they are absolutely needed(this is a morbo most noob C++ devs have to prove their code is super C++sy and in most cases hurt performance and the compiler like hell), STL implementations are generic not performant but most noob C++ devs think is both (spoiler alert is 95% of the cases is not, use custom allocators if you need performance ), etc. etc.

          Comment


          • #15
            Originally posted by jrch2k8 View Post
            Yeap you are right but there are things you can do that are generic enough to let -O3 optimize the hell out of it but still run on most CPU(not as efficient as going full per CPU manual optimization as you correctly imply) like for example find ways to never branch inside a loop(with enough iterations) or if you do test it can be splitted, verify your types are aligned, never allocate inside a loop if it is not absolutely needed, be smart about thread life times, use templates when they are absolutely needed(this is a morbo most noob C++ devs have to prove their code is super C++sy and in most cases hurt performance and the compiler like hell), STL implementations are generic not performant but most noob C++ devs think is both (spoiler alert is 95% of the cases is not, use custom allocators if you need performance ), etc. etc.
            Well, the consent with GCC (and clang, too, I believe) is that -O and -O2 do not trade speed for code size, whereas -O3 does. So the moment you use -O3 can you run into the problem that if your CPU's instruction cache is small or gives you high penalties for cache misses, or penalises you in other ways for the trade-off, can it shoot you in your foot. In your example, to not branch within a loop, can this mean that the loop code got duplicated and the branch was moved outside the loop so the code branches now between two loops instead of branching within a single loop. And when this happens inside nested loops will this amplify the effect of the trade-off, and can backfire noticeably. So -O3 always requires greater care than -O2. It also amplifies the differences between GCC and clang when it comes to how much each compiler performs the trade-off for a target, but the extreme results with some of the benchmarks shown here hint at some other issue. Maybe the GCC devs just stopped caring for Intel and are all on the AMD bandwagon now. *lol*

            Comment


            • #16
              Originally posted by discordian View Post
              Is -O3 actually faster than -O2? Everytime I test this it's even or slower. (But I actually test this on power efficient cores like arm and Apollo Lake)
              I'd say that in the vast majority of cases, yes, -O3 beats -O2 both on Clang/LLVM and GCC. There will of course be outliers, as optimizations can backfire if the compiler 'guesstimates' wrong, and since -O3 enables more optimizations than -O2, the chance of an optimization 'misfiring' increases. One solution to this problem is PGO (profile guided optimization) which gives the compiler runtime data with which to base optimization choices on, and barring a compiler bug, I can't see a scenario where this would not generate faster code.

              Having said that, again, just straight up -O3 will most likely beat -O2 in the vast majority of cases, Phoronix recently compared -O2 against -O3 on GCC 11 and Clang/LLVM 12 here: https://www.phoronix.com/scan.php?pa...12-gcc11&num=1

              With the overall results showing a clear gain for -O3 over -O2 both for GCC and Clang/LLVM on both AMD and Intel hardware, admittedly this test also added -march=native for the -O3 comparison, which will skew the result to some degree.

              Comment


              • #17
                Originally posted by Steffo View Post
                Holy sh*t! I saw it coming: In the long term, Clang/LLVM wins over GCC.
                Maybe, maybe not (only time will tell). Anytime a new optimization is brought into one of the compilers that seems to be a big win, the other team thinks about how they can take those ideas, and try to do something even better (both teams stand upon the shoulders of the other, which is a complex aerobatic maneuver sort of like the Escher drawings of endless stairs). The friendly competition will result in both a better GCC and a better LLVM.

                Comment


                • #18
                  Really interesting evaluation, thanks. This assumes, however, that the developer doesn't change the default flags for his/her specific application.
                  For performance-sensitive applications, developers usually spend some time testing different flags. So another interesting question would be: what is the best performance that can be achieved AFTER performing compiler flag mining.

                  Comment

                  Working...
                  X