Announcement

Collapse
No announcement yet.

GCC 10 Link-Time Optimization Benchmarks On AMD Threadripper

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    I would be really interested in seeing these benchmarks all done with -O2 instead of -O3 cause I typically use -O2 and in some instances -O3 proved slower

    Comment


    • #22
      Well, basically LTO do two things
      1) enables cross-module optimization. This applies only to benchmarks that actually contain more than one source file that matters.
      In this test I guess it is only postgressSQL
      2) via linker plugin tells compiler what symbols are externally used. So it does same as -fwhole-program but does not have limitation to one source file.
      This is benefical mostly when privatizing symbols enable more inlining or localization. This happens only for more complex codebases.
      So to see what LTO does you really need different set of benchmarks than presented here.

      In addition for compile time one wants to parallelizze with -flto=n (or -flto=auto for GCC 10)

      Honza

      Comment


      • #23
        BTW the mesa packages in my oibaf PPA are built with lto. I noticed a reduction in binary size (5% and 10% with classic and gallium drivers), but the performance increase was barely noticeable.

        Comment


        • #24
          Just grabbing some data from our auto-testers https://lnt.opensuse.org/db_default/...report/options this table compares -Ofast, -Ofast + profile feedback, -Ofast -flto and -Ofast -flto + profile feedback on Zen hardware, SPEC2017:
          SPEC/SPEC2017/INT/total 4.179 5.64% 7.12% 12.81%
          SPEC/SPEC2017/total 6.109 4.63% 4.86% 9.41%
          SPEC/SPEC2017/FP/total 8.180 3.86% 3.14% 6.86%

          Comment


          • #25
            Originally posted by Grinch View Post

            Yes that would be an interesting benchmark,
            #14

            Any idea how to get '-flto=jobserver' (-flto=8 in my case) running with 'meson'?

            there's also MESA support for PGO (profile guided optimization) which in my experience is typically a more impactful optmization. The variable is -Db_pgo= and the parameters are off/generate/use . Perhaps something for Michael to try out when he does a new PGO benchmark.
            I'll try this with 'next' build...

            Comment


            • #26
              Originally posted by hubicka View Post
              Well, basically LTO do two things
              1) enables cross-module optimization. This applies only to benchmarks that actually contain more than one source file that matters.
              In this test I guess it is only postgressSQL
              2) via linker plugin tells compiler what symbols are externally used. So it does same as -fwhole-program but does not have limitation to one source file.
              This is benefical mostly when privatizing symbols enable more inlining or localization. This happens only for more complex codebases.
              So to see what LTO does you really need different set of benchmarks than presented here.

              In addition for compile time one wants to parallelizze with -flto=n (or -flto=auto for GCC 10)

              Honza
              Hello Honza,

              any hints to mimic

              -flto=n (or -flto=auto for GCC 10)

              running GCC 9.2.1 20191209 on TW with meson?

              -Db_lto=number do not work.

              Some numbers for Mesa git @ TW (mostly NO speedup if any):

              -Db_lto=true
              -rwxr-xr-x 4 root root 8078368 8. Jan 23:34 libvdpau_r600.so.1.0.0
              -rwxr-xr-x 4 root root 8078368 8. Jan 23:34 libvdpau_radeonsi.so.1.0.0

              -rwxr-xr-x 8 root root 16874272 8. Jan 23:34 kms_swrast_dri.so
              -rwxr-xr-x 8 root root 16874272 8. Jan 23:34 r600_dri.so
              -rwxr-xr-x 4 root root 8078408 8. Jan 23:34 r600_drv_video.so
              -rwxr-xr-x 8 root root 16874272 8. Jan 23:34 radeonsi_dri.so
              -rwxr-xr-x 4 root root 8078408 8. Jan 23:34 radeonsi_drv_video.so
              -rwxr-xr-x 8 root root 16874272 8. Jan 23:34 swrast_dri.so

              without
              -rwxr-xr-x 2 root root 9525520 8. Jan 20:12 libvdpau_r600.so.1.0.0
              -rwxr-xr-x 2 root root 9525520 8. Jan 20:12 libvdpau_radeonsi.so.1.0.0

              -rwxr-xr-x 4 root root 18448288 8. Jan 20:12 kms_swrast_dri.so
              -rwxr-xr-x 4 root root 18448288 8. Jan 20:12 r600_dri.so
              -rwxr-xr-x 2 root root 9505072 8. Jan 20:12 r600_drv_video.so
              -rwxr-xr-x 4 root root 18448288 8. Jan 20:12 radeonsi_dri.so
              -rwxr-xr-x 2 root root 9505072 8. Jan 20:12 radeonsi_drv_video.so
              -rwxr-xr-x 4 root root 18448288 8. Jan 20:12 swrast_dri.so

              Size reduction IS fine.

              Comment


              • #27
                Originally posted by nuetzel View Post

                #14

                Any idea how to get '-flto=jobserver' (-flto=8 in my case) running with 'meson'?



                I'll try this with 'next' build...
                You usually just need to set the compiler to having the right prefix so that it will be launched with a jobserver pipe by make. If it generates ninja files, I doubt it will work.

                Comment


                • #28
                  What about binary sizes? The whole point of LTO is that it IIRC eliminates unused code or so. While it can eventually translate to better performance due to e.g. improved CPU cache use, it still about binary sizes.

                  p.s. on side note, when you need something most, you don't get it for sure. On microcontroller, lto is most needed, yet it proven to optimize out ... whole firmware?! Gah, I think it a bit too aggressive, it even seems to ignore explicit __attribute__((used)) hints, dammit. Anyone knows how to get this thing right in "runtimeless" environment? Where you may need it most thanks to limited flash ROM sizes, etc. Works like charm on "pc" side, shaving off about 25% of binary size for (fairly large) ~5Mb program.
                  Last edited by SystemCrasher; 27 February 2020, 08:22 AM.

                  Comment

                  Working...
                  X