Announcement

Collapse
No announcement yet.

GCC 10 Link-Time Optimization Benchmarks On AMD Threadripper

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SystemCrasher
    replied
    What about binary sizes? The whole point of LTO is that it IIRC eliminates unused code or so. While it can eventually translate to better performance due to e.g. improved CPU cache use, it still about binary sizes.

    p.s. on side note, when you need something most, you don't get it for sure. On microcontroller, lto is most needed, yet it proven to optimize out ... whole firmware?! Gah, I think it a bit too aggressive, it even seems to ignore explicit __attribute__((used)) hints, dammit. Anyone knows how to get this thing right in "runtimeless" environment? Where you may need it most thanks to limited flash ROM sizes, etc. Works like charm on "pc" side, shaving off about 25% of binary size for (fairly large) ~5Mb program.
    Last edited by SystemCrasher; 27 February 2020, 08:22 AM.

    Leave a comment:


  • carewolf
    replied
    Originally posted by nuetzel View Post

    #14

    Any idea how to get '-flto=jobserver' (-flto=8 in my case) running with 'meson'?



    I'll try this with 'next' build...
    You usually just need to set the compiler to having the right prefix so that it will be launched with a jobserver pipe by make. If it generates ninja files, I doubt it will work.

    Leave a comment:


  • nuetzel
    replied
    Originally posted by hubicka View Post
    Well, basically LTO do two things
    1) enables cross-module optimization. This applies only to benchmarks that actually contain more than one source file that matters.
    In this test I guess it is only postgressSQL
    2) via linker plugin tells compiler what symbols are externally used. So it does same as -fwhole-program but does not have limitation to one source file.
    This is benefical mostly when privatizing symbols enable more inlining or localization. This happens only for more complex codebases.
    So to see what LTO does you really need different set of benchmarks than presented here.

    In addition for compile time one wants to parallelizze with -flto=n (or -flto=auto for GCC 10)

    Honza
    Hello Honza,

    any hints to mimic

    -flto=n (or -flto=auto for GCC 10)

    running GCC 9.2.1 20191209 on TW with meson?

    -Db_lto=number do not work.

    Some numbers for Mesa git @ TW (mostly NO speedup if any):

    -Db_lto=true
    -rwxr-xr-x 4 root root 8078368 8. Jan 23:34 libvdpau_r600.so.1.0.0
    -rwxr-xr-x 4 root root 8078368 8. Jan 23:34 libvdpau_radeonsi.so.1.0.0

    -rwxr-xr-x 8 root root 16874272 8. Jan 23:34 kms_swrast_dri.so
    -rwxr-xr-x 8 root root 16874272 8. Jan 23:34 r600_dri.so
    -rwxr-xr-x 4 root root 8078408 8. Jan 23:34 r600_drv_video.so
    -rwxr-xr-x 8 root root 16874272 8. Jan 23:34 radeonsi_dri.so
    -rwxr-xr-x 4 root root 8078408 8. Jan 23:34 radeonsi_drv_video.so
    -rwxr-xr-x 8 root root 16874272 8. Jan 23:34 swrast_dri.so

    without
    -rwxr-xr-x 2 root root 9525520 8. Jan 20:12 libvdpau_r600.so.1.0.0
    -rwxr-xr-x 2 root root 9525520 8. Jan 20:12 libvdpau_radeonsi.so.1.0.0

    -rwxr-xr-x 4 root root 18448288 8. Jan 20:12 kms_swrast_dri.so
    -rwxr-xr-x 4 root root 18448288 8. Jan 20:12 r600_dri.so
    -rwxr-xr-x 2 root root 9505072 8. Jan 20:12 r600_drv_video.so
    -rwxr-xr-x 4 root root 18448288 8. Jan 20:12 radeonsi_dri.so
    -rwxr-xr-x 2 root root 9505072 8. Jan 20:12 radeonsi_drv_video.so
    -rwxr-xr-x 4 root root 18448288 8. Jan 20:12 swrast_dri.so

    Size reduction IS fine.

    Leave a comment:


  • nuetzel
    replied
    Originally posted by Grinch View Post

    Yes that would be an interesting benchmark,
    #14

    Any idea how to get '-flto=jobserver' (-flto=8 in my case) running with 'meson'?

    there's also MESA support for PGO (profile guided optimization) which in my experience is typically a more impactful optmization. The variable is -Db_pgo= and the parameters are off/generate/use . Perhaps something for Michael to try out when he does a new PGO benchmark.
    I'll try this with 'next' build...

    Leave a comment:


  • hubicka
    replied
    Just grabbing some data from our auto-testers https://lnt.opensuse.org/db_default/...report/options this table compares -Ofast, -Ofast + profile feedback, -Ofast -flto and -Ofast -flto + profile feedback on Zen hardware, SPEC2017:
    SPEC/SPEC2017/INT/total 4.179 5.64% 7.12% 12.81%
    SPEC/SPEC2017/total 6.109 4.63% 4.86% 9.41%
    SPEC/SPEC2017/FP/total 8.180 3.86% 3.14% 6.86%

    Leave a comment:


  • oibaf
    replied
    BTW the mesa packages in my oibaf PPA are built with lto. I noticed a reduction in binary size (5% and 10% with classic and gallium drivers), but the performance increase was barely noticeable.

    Leave a comment:


  • hubicka
    replied
    Well, basically LTO do two things
    1) enables cross-module optimization. This applies only to benchmarks that actually contain more than one source file that matters.
    In this test I guess it is only postgressSQL
    2) via linker plugin tells compiler what symbols are externally used. So it does same as -fwhole-program but does not have limitation to one source file.
    This is benefical mostly when privatizing symbols enable more inlining or localization. This happens only for more complex codebases.
    So to see what LTO does you really need different set of benchmarks than presented here.

    In addition for compile time one wants to parallelizze with -flto=n (or -flto=auto for GCC 10)

    Honza

    Leave a comment:


  • xception
    replied
    I would be really interested in seeing these benchmarks all done with -O2 instead of -O3 cause I typically use -O2 and in some instances -O3 proved slower

    Leave a comment:


  • Grinch
    replied
    Originally posted by archsway View Post
    Mesa seems to benefit quite a bit from LTO.

    Add this to the meson command:

    Code:
    -Db_lto=true
    Yes that would be an interesting benchmark, there's also MESA support for PGO (profile guided optimization) which in my experience is typically a more impactful optmization. The variable is -Db_pgo= and the parameters are off/generate/use . Perhaps something for Michael to try out when he does a new PGO benchmark.

    Leave a comment:


  • carewolf
    replied
    Originally posted by set135
    This is what I have been using on Gentoo for many years, for all but a few packages:
    CFLAGS=-march=native -O2 -pipe -fno-stack-protector -flto=4 -fuse-linker-plugin
    CXXFLAGS=$CFLAGS
    LDFLAGS=-Wl,-flto=4 $CFLAGS

    My goal was primarily to reduce executable size, and just to see how it works, so it is interesting to see some benchmarks.
    Why pass -flto to the linker? Just link with gcc/g++, and let it deal with the command line. Also -fuse-linker-plugin is redundant. But yes, that will improve binary size greatly, even if you would need -O3 to get the performance benefits of -flto.

    Leave a comment:

Working...
X