Announcement

Collapse
No announcement yet.

GCC 9 Compiler Tuning Benchmarks On Intel Skylake AVX-512

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Nowadays I compile most apps with
    Code:
    -O2 -march=native -flto
    - this seems like the best combo for maximum performance.

    Comment


    • #12
      If you are an application developer and want to use LTO without suffering from the gigantic link times, you can help it a bit by using clang with thinlto + caching, fyi.

      Comment


      • #13
        Originally posted by Herem View Post
        Michael I was wondering do the geometric mean results include the compilation time results or only the application performance results?
        It includes all benchmark results. (Of course, on lower is better results, the numbers are inverted so all data points are HIB for the geometric mean to make sense.)
        Michael Larabel
        https://www.michaellarabel.com/

        Comment


        • #14
          Originally posted by Michael View Post
          It includes all benchmark results. (Of course, on lower is better results, the numbers are inverted so all data points are HIB for the geometric mean to make sense.)
          It is to be expected that it will take longer to perform more in-depth optimisation. By including the compilation time results it is disguising the benefits which are being realised from these optimisations.

          Can the compilation timing results be excluded from the geometric mean or maybe create two mean charts, one including compilation time and the other without?

          Comment


          • #15
            Originally posted by Michael View Post
            It includes all benchmark results. (Of course, on lower is better results, the numbers are inverted so all data points are HIB for the geometric mean to make sense.)
            Michael,
            I think it would be more meaningful to do compile time and performance separately. Those are not really metrics that can be combined easily.

            I also wonder, does the compile time benchmark include parallel make? With LTO it would make sense to use -flto=<parallelism> so the link-time is parallel as well. Otherwise you are comparing single-threaded optimization relative to multi-threaded. The number you add to -flto parameter has no effect on generated code.

            Comment


            • #16
              Originally posted by AsuMagic View Post
              If you are an application developer and want to use LTO without suffering from the gigantic link times, you can help it a bit by using clang with thinlto + caching, fyi.
              GCC has -flto=n that performs similarly to clang (it seems to be faster for large programs)

              Comment


              • #17
                Originally posted by hubicka View Post

                GCC has -flto=n that performs similarly to clang (it seems to be faster for large programs)
                Missed the reference to caching. Indeed on GCC side caching is not implemented yed, so with LTO one needs to rebuild whole binary each time. It is about 8 minutes for Firefox for -flto=16 on my 16 thread buldozer (compared to about 50 minutes -flto). Caching is something I plan to look into for GCC10 - GCC already streams the translation units for parallel compilation and adding a cache there is just a matter of tooling and a bit of work on stabilizing the partitioning algorithm so it does produce similar results after small source changes. That should get Firefox re-linking down to about a minute.

                Clang thin-LTO is having bit different design from GCC lto and it trades more code quality for re-linking time improvements. It is also someting possible to do on GCC side (have two-level LTO) but I am not sure how useful it would be.

                I will check why FFTW seems to run faster w/o LTO. Generally small stand-alone benchmarks like scimark are not testing LTO very realistically because whole hot loop is in one translation unit anyway so LTO is not very useful.

                Comment


                • #18
                  Originally posted by -MacNuke- View Post
                  -Ofast introduces unsafe math operations and can break applications and results. It is not a good candidate for comparison at all.
                  And floating point code is absolutely trash without -Ofast. FFTW doesn't count because it uses auto-generated (assembly) code.

                  Comment


                  • #19
                    Originally posted by -MacNuke- View Post

                    -Ofast introduces unsafe math operations and can break applications and results. It is not a good candidate for comparison at all.
                    I've been using -Ofast for a while now and I have yet to come across an app that's broken...

                    Comment


                    • #20
                      Originally posted by hubicka View Post

                      GCC has -flto=n that performs similarly to clang (it seems to be faster for large programs)
                      If this doesn't have caching thinlto will still stomp gcc for incremental builds.

                      Comment

                      Working...
                      X