Announcement

Collapse
No announcement yet.

LLVM Clang Shows Off Great Performance Advantage On NVIDIA GH200's Neoverse-V2 Cores

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • LLVM Clang Shows Off Great Performance Advantage On NVIDIA GH200's Neoverse-V2 Cores

    Phoronix: LLVM Clang Shows Off Great Performance Advantage On NVIDIA GH200's Neoverse-V2 Cores

    With my recent NVIDIA GH200 Grace CPU benchmarks carried out remotely via GPTshop.ai, besides looking at areas like the 64K kernel page size performance benefits I also ran some fresh benchmarks looking at the performance difference when the binaries were generated by LLVM Clang rather than the default GCC compiler on Ubuntu Linux. This article shows off the performance difference for the 72-core Neoverse-V2 server/HPC processor when leveraging LLVM Clang rather than the GNU Compiler Collection.

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Amazing

    Are there any full Clang distros apart from Gentoo, OpenMandriva & Serpent OS?

    Comment


    • #3
      Originally posted by Kjell View Post
      Amazing

      Are there any full Clang distros apart from Gentoo, OpenMandriva & Serpent OS?
      Public distros, not so much. There are strongly rumored to be internal distributions using Clang from the hypervisor crowd (where even a 1% improvement can mean tens of millions of dollars of savings). Of course, they also use other optimizations (PGO, BOLT, etc.) too.

      Comment


      • #4
        I wonder if Clang's performance improvements over the last couple of years is one of the reasons why the vast majority of browsers are built with Clang these days. AFAIK, the only browser that is still built with GCC is Firefox (only the distro packages, official Mozilla builds for Linux are built with Clang). Also, is it still even possible to built Chromium with GCC?

        Comment


        • #5
          Originally posted by CommunityMember View Post

          where even a 1% improvement can mean tens of millions of dollars of savings.
          So 8,6% is a big deal then. Compiler magic seems to be the main reason for Intel's clear linux outperforming everything else. Clear linux for arm would be a thing.

          Comment


          • #6
            If I were the developer of any of the programs that showed a significant difference between gcc and clang, I would first confirm the results by compiling my program with each compiler and running the tests myself and if the results were repeatable and the results identical, I would check to see what the clang compiler was doing to my code and use that as a guide to optimize my program so that its performance was consistent no matter what compiler was used.

            There is no excuse why Graphics Magick Sharpen should be more than twice as fast with a simple recompile.

            This shows either lazy coding or incompetent coding and I don't know which us worse.

            Comment


            • #7
              Originally posted by sophisticles View Post
              There is no excuse why Graphics Magick Sharpen should be more than twice as fast with a simple recompile.

              This shows either lazy coding or incompetent coding and I don't know which us worse.
              Or it's a gcc-13 issue in this case (missed vectorization perhaps). Maybe Michael has time to redo this test with clang-18 and gcc-14.
              Last edited by mlau; 19 March 2024, 02:59 AM.

              Comment


              • #8
                Originally posted by sophisticles View Post
                There is no excuse why Graphics Magick Sharpen should be more than twice as fast with a simple recompile.

                This shows either lazy coding or incompetent coding and I don't know which us worse.
                Good autovectorization isn't easy, but can deliver such substantial improvements. It could be that GCC is just being more cautious with some of its optimizations, but perhaps would preform similarly if PGO or if coaxed via additional commandline options.

                As you say, the only way to know is probably to dig in and see what's going on. However, it could be somewhat telling just to know how the code sizes compare. I'm also curious about compile times, for that matter.

                Comment


                • #9
                  Originally posted by mlau View Post
                  Or it's a gcc-13 issue in this case (missed vectorization perhaps).
                  It could even be something like cache management, where you might not see much difference on single-threaded benchmarks, but which really matter in a 72-core config.

                  I'd be interested to know just how dependent the performance differences are on SVE/SVE2. If you disabled vector math, altogether, would LLVM still hold a commanding lead?

                  Comment


                  • #10
                    Originally posted by sophisticles View Post
                    This shows either lazy coding or incompetent coding and I don't know which us worse.
                    Not necessarily. Performance optimized code is very often hard to read, sometimes borderline unreadable. Developers may have chosen to honor clarity instead maximum performance. Not to mention that optimized code is also less portable.

                    To be honest, 2x gain by switching compiler and/or optimization flags is an outlier. Most of the time it won't yield such a great performance improvement.

                    Comment

                    Working...
                    X