Announcement

Collapse
No announcement yet.

GCC's Profile Guided Optimization Performance With The Ryzen 9 5950X

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GCC's Profile Guided Optimization Performance With The Ryzen 9 5950X

    Phoronix: GCC's Profile Guided Optimization Performance With The Ryzen 9 5950X

    Given the talk in prior days around patches for PGO'ing the Linux kernel and some readers not being familiar with Profile Guided Optimizations by code compilers, here are some fresh benchmarks on a Ryzen 9 5950X looking at the benefits of applying PGO optimizations to various benchmarks...

    http://www.phoronix.com/scan.php?pag...nchmarks-5950X

  • #2
    Any chance you can benchmark LLVM Propeller ? https://github.com/google/llvm-propeller

    Comment


    • #3
      Interesting results. Nice how everything seemed to gain performance, even if it wasn't by much. In the world of mitigations and security every little bit helps.

      Michael, about how long did it take you to generate the profiled kernel? If that was in the article, I missed it.

      Comment


      • #4
        What a disappointing and dreadful benchmark. It's the year 2021 and Phoronix is testing PGO on its own ... It's yet another opportunity missed to show the gains when combining LTO with PGO.

        LTO is about optimising code on the global scale, but on its own needs to guess how code pieces interact with one another. PGO can deliver this information. Not only does PGO help small optimisations to yield better results as was shown here and has been known to do so for many years, but PGO can tell LTO precisely how the code pieces interact and so enables the full potential of LTO.

        This is why LTO and PGO are currently a topic for the kernel. It's not about one or the other, it's about using both in combination. How much longer until Phoronix understands this and starts making interesting and memorable benchmarks, which reflect compiler development?

        Comment


        • #5
          Originally posted by dirlewanger88
          I highly doubt it's being used to good effect in these benchmarks, especially since the article barely bothers to mention it.
          Why not ? These are mostly narrow purpose programs, compressor, chess engine, video encoding, neural network, the codepaths will be very similar between different runs.

          And it's not as if you need 'the perfect training run' to gain benefits, when I do PGO builds of Blender for example, I have a very small pool of render test files, yet I get good and often equal performance boosts for renders that are not representative of said pool

          So even on a very complex 3d renderer (Cycles), you'll be able to draw overall performance benefits with a small sample pool as long that pool touches the most important codepaths during rendering, which I'd wager most semi-complex scenes do.

          Some people seem to think that 'Oh, I swerved left during the training run of this car game, that means it will only be optimized for swerving left', these people don't understand how code works.

          In short, you will practically always gain overall performance with a PGO build, simply by running it through a typical workload, and this will most often still give you a gain, if smaller, with atypical workloads, because the codepaths very seldom vary that much.

          Comment


          • #6
            I'd be interested in seeing some benchmarks to see if there is a difference with newer hardware when running x86-64-v1 vs x86-64-v2. For older compilers like in GCC v10 or v9 (still relatively recent), it's easy for one to enable the applicable flags manually, but does it make a difference?

            Comment


            • #7
              I believe it offers a bigger difference at -O2, basically only -O3 optimizing the hot paths.

              Comment


              • #8
                It is true that LTO and PGO combine well together, but they are also inteded (and tested) to be useful separately.

                It takes a while to load but https://lnt.opensuse.org/db_default/..._report/tuning
                reports regular benchmarks with different optimization flags.

                With GCC 10 and Zen specint results looks as follows:
                benchmark
                SPEC/SPEC2017/INT/total
                -Ofast -march=native
                4.205
                -Ofast -march=native -flto
                5.80%
                -Ofast -march=native -flto -fprofile-use
                6.11%
                -Ofast -march=native -flto -fprofile-use -flto
                12.50%
                SPEC/SPEC2017/total 6.096 4.42% 4.97% 9.61%
                SPEC/SPEC2017/FP/total 8.110 3.37% 4.11% 7.44%
                Benefits of LTO and PGO highly depends on the particular codebase compiled. More bigger programs with more complicated control flow (such as parsers) benefits more than simple things (such as programs spending most time by matrix multiplication).

                Comment


                • #9
                  This compares GCC built firefox with LTO+non-PGO to non-LTO+non-PGO (-O3 optimization level)

                  and this compares GCC built firefox with LTO+PGO to non-LTO+non-PGO (-O3 optimization level)

                  With tp5o benchmarks (which I think is most relevant number on rendering popular webpages) one gets 3% improvement for LTO, and 12% for LTO+PGO. Responsiveness goes up by 30% with LTO+PGO.

                  LTO reduces code size by 11% and LTO+PGO by 29%.
                  Last edited by hubicka; 18 January 2021, 08:09 AM.

                  Comment


                  • #10
                    It'd be useful if there was an actual guide (or a link to one) in the article on how y'all compiled the kernel with PGO.

                    Comment

                    Working...
                    X