Announcement
Collapse
No announcement yet.
GCC's Profile Guided Optimization Performance With The Ryzen 9 5950X
Collapse
X
-
Originally posted by dirlewanger88The problem with PGO "benchmarks" like these is that "enabling" PGO isn't just a case of flipping a switch. The benefits of using it are entirely dependant on a good training run. I highly doubt it's being used to good effect in these benchmarks, especially since the article barely bothers to mention it.
I don't mind seeing PGO in its best light. Sort of like: "here's the upper bound of what you can expect from PGO."
- Likes 2
Comment
-
Originally posted by hubicka View PostIt is true that LTO and PGO combine well together, but they are also inteded (and tested) to be useful separately.
It takes a while to load but https://lnt.opensuse.org/db_default/..._report/tuning
reports regular benchmarks with different optimization flags.
Originally posted by hubicka View PostBenefits of LTO and PGO highly depends on the particular codebase compiled. More bigger programs with more complicated control flow (such as parsers) benefits more than simple things (such as programs spending most time by matrix multiplication).
Of course, if the loops have already been subject to manual unrolling and vectorization, there's much less room left for improvement, but I'm sure the majority of loops out there have not been brutalized in those ways. Even in hot-spots, simply knowing that compilers can now do unrolling makes me a lot less likely to do it by hand, since it's so error-prone and damages code-maintainability so severely.Last edited by coder; 18 January 2021, 07:24 PM.
Comment
-
Originally posted by Grinch View PostJudging by your comment, you don't have a clue, actually I doubt you've even used it.
Some benchmarks that I've seen yield the exact same executable and thus cannot be different, however does this not get caught but ends up as different results because of normal run time deviations and turns the results into a sham.
Hence do I only skim through the headlines and have stopped reading most of it.
- Likes 1
Comment
-
Originally posted by sdack View Post-fprofile-partial-training.
Also it's a GCC-specific flag, for a cross compiler benchmark suite like this it makes perfect sense to stick to the cross compiler options, there's a reason this flag isn't default, which is that for most usecases of PGO, it's better to keep untouched code small, and again for these benchmark tests done here, I can't see how it would have an impact.
There are lots of more or less exotic/experimental options which potentially could improve performance, GCC has stuff like -fno-semantic-interposition -fdevirtualize-at-ltrans -fipa-pta etc, Clang/LLVM has stuff like -fexperimental-new-pass-manager -fforce-emit-vtables -fstrict-vtable-pointers etc.
However these are in my opinion outside the scope of these tests, as there are reasons they are not automatically folded into general optimization options, which is that they can backfire and/or take extremely long time to perform.
Originally posted by sdack View PostThe benchmark quality here on Phoronix has been lacking for a while now and it's become hardly worth paying attention to the results.
I understand his situation as well, when he gets conflicting advise, who should he listen to ? Also he needs to keep this manageable, he can't add every option under the sun, heck I'm glad to see a benchmark suite like this even touch PGO as although it is such a powerful optimization, it's also a hassle and sometimes a pain to configure correctly.
- Likes 3
Comment
-
Originally posted by Grinch View PostI don't see how this option would affect these benchmarks ?
PGO doesn't just optimise code for one specific input data set. The optimised code can still be used for a range of sets. Take the video encoders as an example. One can train PGO with one video and still get performance gains while encoding many other videos with the same executable. This is how PGO is commonly used and this will also be the case for the Linux kernel.
The option -fprofile-partial-training then can make a difference and I'm sure we will see it come up for the kernel as well. If this had been included in the article would it have been at least a decent reiteration of PGO, but not even this was achieved.
- Likes 2
Comment
-
Originally posted by hubicka View PostThis compares GCC built firefox with LTO+non-PGO to non-LTO+non-PGO (-O3 optimization level)
and this compares GCC built firefox with LTO+PGO to non-LTO+non-PGO (-O3 optimization level)
With tp5o benchmarks (which I think is most relevant number on rendering popular webpages) one gets 3% improvement for LTO, and 12% for LTO+PGO. Responsiveness goes up by 30% with LTO+PGO.
LTO reduces code size by 11% and LTO+PGO by 29%.
(my former Mesa git / radeonsi / r600 / etc / vdpau space numbers)
21bc16a723
normal
-rwxr-xr-x 4 root root 9525520 13. Jan 20:00 libvdpau_radeonsi.so.1.0.0
-rwxr-xr-x 4 root root 9525520 13. Jan 20:00 libvdpau_r600.so.1.0.0
-rwxr-xr-x 8 root root 18444192 13. Jan 20:00 swrast_dri.so
-rwxr-xr-x 8 root root 18444192 13. Jan 20:00 radeonsi_dri.so
-rwxr-xr-x 8 root root 18444192 13. Jan 20:00 r600_dri.so
-rwxr-xr-x 8 root root 18444192 13. Jan 20:00 kms_swrast_dri.so
-rwxr-xr-x 4 root root 9505072 13. Jan 20:00 radeonsi_drv_video.so
-rwxr-xr-x 4 root root 9505072 13. Jan 20:00 r600_drv_video.so
-Db_lto=true
-rwxr-xr-x 2 root root 8078368 13. Jan 21:24 libvdpau_r600.so.1.0.0
-rwxr-xr-x 2 root root 8078368 13. Jan 21:24 libvdpau_radeonsi.so.1.0.0
-rwxr-xr-x 4 root root 16878368 13. Jan 21:24 kms_swrast_dri.so
-rwxr-xr-x 4 root root 16878368 13. Jan 21:24 r600_dri.so
-rwxr-xr-x 2 root root 8074312 13. Jan 21:24 r600_drv_video.so
-rwxr-xr-x 4 root root 16878368 13. Jan 21:24 radeonsi_dri.so
-rwxr-xr-x 2 root root 8074312 13. Jan 21:24 radeonsi_drv_video.so
-rwxr-xr-x 4 root root 16878368 13. Jan 21:24 swrast_dri.so
-Db_lto=true -Db_pgo=use
-rwxr-xr-x 4 root root 5600328 14. Jan 00:11 libvdpau_radeonsi.so.1.0.0
-rwxr-xr-x 4 root root 5600328 14. Jan 00:11 libvdpau_r600.so.1.0.0
-rwxr-xr-x 8 root root 11172768 14. Jan 00:11 swrast_dri.so
-rwxr-xr-x 8 root root 11172768 14. Jan 00:11 radeonsi_dri.so
-rwxr-xr-x 8 root root 11172768 14. Jan 00:11 r600_dri.so
-rwxr-xr-x 8 root root 11172768 14. Jan 00:11 kms_swrast_dri.so
-rwxr-xr-x 4 root root 5567640 14. Jan 00:11 radeonsi_drv_video.so
-rwxr-xr-x 4 root root 5567640 14. Jan 00:11 r600_drv_video.so
- Likes 1
Comment
-
Originally posted by hubicka View PostThis compares GCC built firefox with LTO+non-PGO to non-LTO+non-PGO (-O3 optimization level)
and this compares GCC built firefox with LTO+PGO to non-LTO+non-PGO (-O3 optimization level)
With tp5o benchmarks (which I think is most relevant number on rendering popular webpages) one gets 3% improvement for LTO, and 12% for LTO+PGO. Responsiveness goes up by 30% with LTO+PGO.
LTO reduces code size by 11% and LTO+PGO by 29%.
Honza, can we have this (responsiveness on top of 5.21's KWin) for openSUSE's TW KDE, please?
Comment
Comment