GCC 10 Link-Time Optimization Benchmarks On AMD Threadripper

SystemCrasher replied

27 February 2020, 08:19 AM
What about binary sizes? The whole point of LTO is that it IIRC eliminates unused code or so. While it can eventually translate to better performance due to e.g. improved CPU cache use, it still about binary sizes.

p.s. on side note, when you need something most, you don't get it for sure. On microcontroller, lto is most needed, yet it proven to optimize out ... whole firmware?! Gah, I think it a bit too aggressive, it even seems to ignore explicit __attribute__((used)) hints, dammit. Anyone knows how to get this thing right in "runtimeless" environment? Where you may need it most thanks to limited flash ROM sizes, etc. Works like charm on "pc" side, shaving off about 25% of binary size for (fairly large) ~5Mb program.

Last edited by SystemCrasher; 27 February 2020, 08:22 AM.
Leave a comment:
carewolf replied

13 January 2020, 04:30 AM
Originally posted by nuetzel View Post

#14

Any idea how to get '-flto=jobserver' (-flto=8 in my case) running with 'meson'?

I'll try this with 'next' build...

You usually just need to set the compiler to having the right prefix so that it will be launched with a jobserver pipe by make. If it generates ninja files, I doubt it will work.
Likes 1
Leave a comment:
nuetzel replied

08 January 2020, 06:51 PM
Originally posted by hubicka View Post

Well, basically LTO do two things
1) enables cross-module optimization. This applies only to benchmarks that actually contain more than one source file that matters.
In this test I guess it is only postgressSQL
2) via linker plugin tells compiler what symbols are externally used. So it does same as -fwhole-program but does not have limitation to one source file.
This is benefical mostly when privatizing symbols enable more inlining or localization. This happens only for more complex codebases.
So to see what LTO does you really need different set of benchmarks than presented here.

In addition for compile time one wants to parallelizze with -flto=n (or -flto=auto for GCC 10)

Honza

Hello Honza,

any hints to mimic

-flto=n (or -flto=auto for GCC 10)

running GCC 9.2.1 20191209 on TW with meson?

-Db_lto=number do not work.

Some numbers for Mesa git @ TW (mostly NO speedup if any):

-Db_lto=true
-rwxr-xr-x 4 root root 8078368 8. Jan 23:34 libvdpau_r600.so.1.0.0
-rwxr-xr-x 4 root root 8078368 8. Jan 23:34 libvdpau_radeonsi.so.1.0.0

-rwxr-xr-x 8 root root 16874272 8. Jan 23:34 kms_swrast_dri.so
-rwxr-xr-x 8 root root 16874272 8. Jan 23:34 r600_dri.so
-rwxr-xr-x 4 root root 8078408 8. Jan 23:34 r600_drv_video.so
-rwxr-xr-x 8 root root 16874272 8. Jan 23:34 radeonsi_dri.so
-rwxr-xr-x 4 root root 8078408 8. Jan 23:34 radeonsi_drv_video.so
-rwxr-xr-x 8 root root 16874272 8. Jan 23:34 swrast_dri.so

without
-rwxr-xr-x 2 root root 9525520 8. Jan 20:12 libvdpau_r600.so.1.0.0
-rwxr-xr-x 2 root root 9525520 8. Jan 20:12 libvdpau_radeonsi.so.1.0.0

-rwxr-xr-x 4 root root 18448288 8. Jan 20:12 kms_swrast_dri.so
-rwxr-xr-x 4 root root 18448288 8. Jan 20:12 r600_dri.so
-rwxr-xr-x 2 root root 9505072 8. Jan 20:12 r600_drv_video.so
-rwxr-xr-x 4 root root 18448288 8. Jan 20:12 radeonsi_dri.so
-rwxr-xr-x 2 root root 9505072 8. Jan 20:12 radeonsi_drv_video.so
-rwxr-xr-x 4 root root 18448288 8. Jan 20:12 swrast_dri.so

Size reduction IS fine.
Leave a comment:
nuetzel replied

08 January 2020, 06:25 PM
Originally posted by Grinch View Post

Yes that would be an interesting benchmark,

#14

Any idea how to get '-flto=jobserver' (-flto=8 in my case) running with 'meson'?

there's also MESA support for PGO (profile guided optimization) which in my experience is typically a more impactful optmization. The variable is -Db_pgo= and the parameters are off/generate/use . Perhaps something for Michael to try out when he does a new PGO benchmark.

I'll try this with 'next' build...
Leave a comment:
hubicka replied

08 January 2020, 01:06 PM
Just grabbing some data from our auto-testers https://lnt.opensuse.org/db_default/...report/options this table compares -Ofast, -Ofast + profile feedback, -Ofast -flto and -Ofast -flto + profile feedback on Zen hardware, SPEC2017:

SPEC/SPEC2017/INT/total 4.179 5.64% 7.12% 12.81%

SPEC/SPEC2017/total 6.109 4.63% 4.86% 9.41%

SPEC/SPEC2017/FP/total 8.180 3.86% 3.14% 6.86%
Likes 2
Leave a comment:
oibaf replied

08 January 2020, 12:01 PM
BTW the mesa packages in my oibaf PPA are built with lto. I noticed a reduction in binary size (5% and 10% with classic and gallium drivers), but the performance increase was barely noticeable.
Likes 1
Leave a comment:
hubicka replied

08 January 2020, 12:00 PM
Well, basically LTO do two things
1) enables cross-module optimization. This applies only to benchmarks that actually contain more than one source file that matters.
In this test I guess it is only postgressSQL
2) via linker plugin tells compiler what symbols are externally used. So it does same as -fwhole-program but does not have limitation to one source file.
This is benefical mostly when privatizing symbols enable more inlining or localization. This happens only for more complex codebases.
So to see what LTO does you really need different set of benchmarks than presented here.

In addition for compile time one wants to parallelizze with -flto=n (or -flto=auto for GCC 10)

Honza
Likes 4
Leave a comment:
xception replied

08 January 2020, 10:53 AM
I would be really interested in seeing these benchmarks all done with -O2 instead of -O3 cause I typically use -O2 and in some instances -O3 proved slower
Leave a comment:
Grinch replied

08 January 2020, 05:49 AM
Originally posted by archsway View Post

Mesa seems to benefit quite a bit from LTO.

Add this to the meson command:

Code:

-Db_lto=true

Yes that would be an interesting benchmark, there's also MESA support for PGO (profile guided optimization) which in my experience is typically a more impactful optmization. The variable is -Db_pgo= and the parameters are off/generate/use . Perhaps something for Michael to try out when he does a new PGO benchmark.
Leave a comment:
carewolf replied

08 January 2020, 04:28 AM
Originally posted by set135

This is what I have been using on Gentoo for many years, for all but a few packages:
CFLAGS=-march=native -O2 -pipe -fno-stack-protector -flto=4 -fuse-linker-plugin
CXXFLAGS=$CFLAGS
LDFLAGS=-Wl,-flto=4 $CFLAGS

My goal was primarily to reduce executable size, and just to see how it works, so it is interesting to see some benchmarks.

Why pass -flto to the linker? Just link with gcc/g++, and let it deal with the command line. Also -fuse-linker-plugin is redundant. But yes, that will improve binary size greatly, even if you would need -O3 to get the performance benefits of -flto.
Leave a comment:

SPEC/SPEC2017/INT/total	4.179	5.64%	7.12%	12.81%
SPEC/SPEC2017/total	6.109	4.63%	4.86%	9.41%
SPEC/SPEC2017/FP/total	8.180	3.86%	3.14%	6.86%

Announcement

GCC 10 Link-Time Optimization Benchmarks On AMD Threadripper

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: