Announcement

Collapse
No announcement yet.

GCC 10 PGO Benchmarks On AMD Ryzen Threadripper 3960X + Ubuntu 19.10

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • GCC 10 PGO Benchmarks On AMD Ryzen Threadripper 3960X + Ubuntu 19.10

    Phoronix: GCC 10 PGO Benchmarks On AMD Ryzen Threadripper 3960X + Ubuntu 19.10

    For those looking for some fresh reference numbers on the impact of using GCC's Profile Guided Optimizations (PGO), here are some benchmark runs looking at the GCC 10 PGO performance on an Ubuntu 19.10 workstation built around the Ryzen Threadripper 3960X...

    http://www.phoronix.com/scan.php?pag...3960X-Xmas-Eve

  • #2
    Linux kernel bechmarks for gcc/clang?

    Comment


    • #3
      Originally posted by pyler View Post
      Linux kernel bechmarks for gcc/clang?
      https://www.phoronix.com/scan.php?pa...linux-53&num=1 maybe will do some more with 5.5 Git but don't expect much in the way of changes.
      Michael Larabel
      http://www.michaellarabel.com/

      Comment


      • #4
        Nice that there are no significant regressions. 13% is significant enough that distros should start to look at making PGO default.

        Comment


        • #5
          Originally posted by wizard69 View Post
          Nice that there are no significant regressions. 13% is significant enough that distros should start to look at making PGO default.
          But will this help universally for all common CPUs?

          Comment


          • #6
            Originally posted by oleid View Post

            But will this help universally for all common CPUs?
            Most of PGO optimizations are independent of target CPU and will help in general. One thing to look for is the situation where compiled program contains different code paths for different hardware (such as SSE/AVX/AVX2 loops). If training is done on one CPU and program is run on CPU with different ISA code may end up optimized for size. It is not hard to exclude those part of program from profile feedback (and GCC 10 now implement -fprofile-partial-training for this).

            Comment


            • #7
              PGO is more useful with bigger codebases with complex control flow. Compiler implements static branch prediction and it is often good enough to guess hot spots of simple benchmarks (such as for matrix multiplication and friends you are pretty much safe to guess that the innermost loop nest is the important spot).

              https://lnt.opensuse.org/db_default/...report/options compares normal build to one with PGO, LTO and LTO+PGO
              Spec2006 on zen (with GCC10 trunk)

              SPEC/SPEC2006/INT/total 38.165 4.62% ~ 7.44%
              SPEC/SPEC2006/FP/total 60.327 -2.74% 2.92% ~
              SPEC/SPEC2006/total 49.915 ~ 2.45% 3.24%
              spec2017 on zen (with GCC10 trunk)

              SPEC/SPEC2017/INT/total 4.189 5.72% 6.89% 12.56%
              SPEC/SPEC2017/total 6.115 4.66% 4.67% 9.11%
              SPEC/SPEC2017/FP/total 8.181 3.85% 3.00% 6.53%

              Comment


              • #8
                Originally posted by wizard69 View Post
                Nice that there are no significant regressions. 13% is significant enough that distros should start to look at making PGO default.
                To build with PGO you need to design train run and your build times will double. Also you often lose ability of reproducible builds. I think is unrealistic to build all packages with PGO (building everything with LTO is much easier), but it makes sense to use PGO on those where it matters (like compilers, interpreters, databases etc.)

                Comment


                • #9
                  How to compile packages with PGO? I already have enabled on GCC 9.2 but I don't have a minimal idea of how to compile packages with this.

                  Comment


                  • #10
                    Originally posted by Mario Junior View Post
                    How to compile packages with PGO?
                    On GCC, you first compile your program like normal, but you add '-fprofile-generate' to your C/CXXFLAGS, after compilation you run your program which will be slower than normal due to the '-fprofile-generate' option having inserted functionality to log runtime information. Once you're done running the program (ideally touching all the important code paths), the gathered runtime data will be saved as .gcda files. By default these are placed in the same directory as their corresponding .o files, but I would suggest that you instead place them outside of the build tree, you can do this by adding a path to the generate command, like this '-fprofile-generate=/mnt/foo' which will save the .gcda files to /foo .

                    Now it's time to compile again, do a make clean (this is the main reason I suggest that you have the .gcda files placed outside of the build tree because sometimes 'clean' will delete *.gcda files as well), this time you exchange '-fprofile-generate' with '-fprofile-use' and again if you've placed the .gcda files somewhere other than the default, use '-fprofile-use=/mnt/foo' . Also if your program is using multi-threading then the gathered runtime data might have sync problems, if you have such errors when compiling you add the '-fprofile-correction' option.

                    With Clang/LLVM it is almost the same, instead of '-fprofile-generate' you use '-fprofile-instr-generate', and subsequently '-fprofile-instr-use', however there is one added step, which is that after you've done the first run and gathered runtime data, you need to convert it to another format before you can do the final compile. You do this with the 'llvm-profdata' command, an example: 'llvm-profdata merge /mnt/foo/bar -output /mnt/foo/baz', and when you do the final compile you will point to the new format data like '-fprofile-instr-use=/mnt/foo/baz'

                    As for the results, GCC deliver a bit better performance with PGO than Clang/LLVM for me, but that typically holds true without PGO as well. For multithreaded programs, the Clang/LLVM profile run of the program is much slower than with GCC, but on the other hand it doesn't seem to have any errors and thus Clang/LLVM don't need a '-fprofile-correction' option.

                    Comment

                    Working...
                    X