Announcement

**coder** · 18 January 2021, 07:08 PM

Originally posted by sdack View Post

What a disappointing and dreadful benchmark. It's the year 2021 and Phoronix is testing PGO on its own ... It's yet another opportunity missed to show the gains when combining LTO with PGO.

If you want to get Michael 's attention, you should tag him.

And rather than being so dour, you could simply ask him to rerun the same benchmarks with PGO + LTO. It's a good point, and a nice idea for a follow-on article.

**coder** · 18 January 2021, 07:14 PM

Originally posted by dirlewanger88

The problem with PGO "benchmarks" like these is that "enabling" PGO isn't just a case of flipping a switch. The benefits of using it are entirely dependant on a good training run. I highly doubt it's being used to good effect in these benchmarks, especially since the article barely bothers to mention it.

It sounds like he ran the benchmark once, then used that profiling for PGO. In that case, his results should actually be better than one would see in real-world usage, as his training run exactly matches his test run (and we presume that, like any good benchmark, these are deterministic).

I don't mind seeing PGO in its best light. Sort of like: "here's the upper bound of what you can expect from PGO."

**coder** · 18 January 2021, 07:20 PM

Originally posted by hubicka View Post

It is true that LTO and PGO combine well together, but they are also inteded (and tested) to be useful separately.

It takes a while to load but https://lnt.opensuse.org/db_default/..._report/tuning
reports regular benchmarks with different optimization flags.

Nice! Thanks for sharing.

Originally posted by hubicka View Post

Benefits of LTO and PGO highly depends on the particular codebase compiled. More bigger programs with more complicated control flow (such as parsers) benefits more than simple things (such as programs spending most time by matrix multiplication).

While I don't dispute that complicated control flow can benefit from PGO, my hope was that it would help the compiler with loop-unrolling and auto-vectorization. Is there any truth to that, currently?

Of course, if the loops have already been subject to manual unrolling and vectorization, there's much less room left for improvement, but I'm sure the majority of loops out there have not been brutalized in those ways. Even in hot-spots, simply knowing that compilers can now do unrolling makes me a lot less likely to do it by hand, since it's so error-prone and damages code-maintainability so severely.

**Grinch** · 19 January 2021, 12:35 AM

Originally posted by dirlewanger88

I'm willing to bet I understand how "code works" and also how PGO works a damn sight better than you do.

Judging by your comment, you don't have a clue, actually I doubt you've even used it.

**nuetzel** · 19 January 2021, 01:35 AM

Michael _Mesa_ (git) LTO+PGO ?

Maybe with reenabled 'sisched' (patch from me)?

**sdack** · 19 January 2021, 09:06 AM

Originally posted by Grinch View Post

Judging by your comment, you don't have a clue, actually I doubt you've even used it.

The PGO benchmarks here didn't even include a comparison with and without -fprofile-partial-training. The benchmark quality here on Phoronix has been lacking for a while now and it's become hardly worth paying attention to the results. The benchmarks barely cover more than three runs, and when they do, then it's a mess, like here where 3 runs in one test are compared to 8 runs of another. The fact that the number of runs are shown in the graphs, but then are not kept consistent means Phoronix no longer bothers producing quality, but it's become more about the quantity of articles.

Some benchmarks that I've seen yield the exact same executable and thus cannot be different, however does this not get caught but ends up as different results because of normal run time deviations and turns the results into a sham.

Hence do I only skim through the headlines and have stopped reading most of it.

**Grinch** · 19 January 2021, 05:26 PM

Originally posted by sdack View Post

-fprofile-partial-training.

I don't see how this option would affect these benchmarks ? It will only affect code not used by the training run, these benchmarks does the exact same tests during the training as during the final benchmark.

Also it's a GCC-specific flag, for a cross compiler benchmark suite like this it makes perfect sense to stick to the cross compiler options, there's a reason this flag isn't default, which is that for most usecases of PGO, it's better to keep untouched code small, and again for these benchmark tests done here, I can't see how it would have an impact.

There are lots of more or less exotic/experimental options which potentially could improve performance, GCC has stuff like -fno-semantic-interposition -fdevirtualize-at-ltrans -fipa-pta etc, Clang/LLVM has stuff like -fexperimental-new-pass-manager -fforce-emit-vtables -fstrict-vtable-pointers etc.

However these are in my opinion outside the scope of these tests, as there are reasons they are not automatically folded into general optimization options, which is that they can backfire and/or take extremely long time to perform.

Originally posted by sdack View Post

The benchmark quality here on Phoronix has been lacking for a while now and it's become hardly worth paying attention to the results.

I certainly agree that the overall methodology has been sloppy at best, but I think it has improved, and Michael has shown a lot of willingness to improve it based upon feedback.

I understand his situation as well, when he gets conflicting advise, who should he listen to ? Also he needs to keep this manageable, he can't add every option under the sun, heck I'm glad to see a benchmark suite like this even touch PGO as although it is such a powerful optimization, it's also a hassle and sometimes a pain to configure correctly.

**sdack** · 19 January 2021, 08:07 PM

Originally posted by Grinch View Post

I don't see how this option would affect these benchmarks ?

Because these were chosen wrong in the first place. All these benchmarks showed was that "PGO makes code run faster". But since the reasoning was based on the recent introduction of LTO and PGO to the kernel, would the option -fprofile-partial-training have made sense.

PGO doesn't just optimise code for one specific input data set. The optimised code can still be used for a range of sets. Take the video encoders as an example. One can train PGO with one video and still get performance gains while encoding many other videos with the same executable. This is how PGO is commonly used and this will also be the case for the Linux kernel.

The option -fprofile-partial-training then can make a difference and I'm sure we will see it come up for the kernel as well. If this had been included in the article would it have been at least a decent reiteration of PGO, but not even this was achieved.

**nuetzel** · 19 January 2021, 08:07 PM

Originally posted by hubicka View Post

This compares GCC built firefox with LTO+non-PGO to non-LTO+non-PGO (-O3 optimization level)

and this compares GCC built firefox with LTO+PGO to non-LTO+non-PGO (-O3 optimization level)

With tp5o benchmarks (which I think is most relevant number on rendering popular webpages) one gets 3% improvement for LTO, and 12% for LTO+PGO. Responsiveness goes up by 30% with LTO+PGO.

LTO reduces code size by 11% and LTO+PGO by 29%.

Only as recap of 2020
(my former Mesa git / radeonsi / r600 / etc / vdpau space numbers)

21bc16a723

normal

-rwxr-xr-x 4 root root 9525520 13. Jan 20:00 libvdpau_radeonsi.so.1.0.0
-rwxr-xr-x 4 root root 9525520 13. Jan 20:00 libvdpau_r600.so.1.0.0

-rwxr-xr-x 8 root root 18444192 13. Jan 20:00 swrast_dri.so
-rwxr-xr-x 8 root root 18444192 13. Jan 20:00 radeonsi_dri.so
-rwxr-xr-x 8 root root 18444192 13. Jan 20:00 r600_dri.so
-rwxr-xr-x 8 root root 18444192 13. Jan 20:00 kms_swrast_dri.so
-rwxr-xr-x 4 root root 9505072 13. Jan 20:00 radeonsi_drv_video.so
-rwxr-xr-x 4 root root 9505072 13. Jan 20:00 r600_drv_video.so

-Db_lto=true

-rwxr-xr-x 2 root root 8078368 13. Jan 21:24 libvdpau_r600.so.1.0.0
-rwxr-xr-x 2 root root 8078368 13. Jan 21:24 libvdpau_radeonsi.so.1.0.0

-rwxr-xr-x 4 root root 16878368 13. Jan 21:24 kms_swrast_dri.so
-rwxr-xr-x 4 root root 16878368 13. Jan 21:24 r600_dri.so
-rwxr-xr-x 2 root root 8074312 13. Jan 21:24 r600_drv_video.so
-rwxr-xr-x 4 root root 16878368 13. Jan 21:24 radeonsi_dri.so
-rwxr-xr-x 2 root root 8074312 13. Jan 21:24 radeonsi_drv_video.so
-rwxr-xr-x 4 root root 16878368 13. Jan 21:24 swrast_dri.so

-Db_lto=true -Db_pgo=use

-rwxr-xr-x 4 root root 5600328 14. Jan 00:11 libvdpau_radeonsi.so.1.0.0
-rwxr-xr-x 4 root root 5600328 14. Jan 00:11 libvdpau_r600.so.1.0.0

-rwxr-xr-x 8 root root 11172768 14. Jan 00:11 swrast_dri.so
-rwxr-xr-x 8 root root 11172768 14. Jan 00:11 radeonsi_dri.so
-rwxr-xr-x 8 root root 11172768 14. Jan 00:11 r600_dri.so
-rwxr-xr-x 8 root root 11172768 14. Jan 00:11 kms_swrast_dri.so
-rwxr-xr-x 4 root root 5567640 14. Jan 00:11 radeonsi_drv_video.so
-rwxr-xr-x 4 root root 5567640 14. Jan 00:11 r600_drv_video.so

**nuetzel** · 19 January 2021, 08:27 PM

Originally posted by hubicka View Post

This compares GCC built firefox with LTO+non-PGO to non-LTO+non-PGO (-O3 optimization level)

and this compares GCC built firefox with LTO+PGO to non-LTO+non-PGO (-O3 optimization level)

With tp5o benchmarks (which I think is most relevant number on rendering popular webpages) one gets 3% improvement for LTO, and 12% for LTO+PGO. Responsiveness goes up by 30% with LTO+PGO.

LTO reduces code size by 11% and LTO+PGO by 29%.

BTW

Honza, can we have this (responsiveness on top of 5.21's KWin) for openSUSE's TW KDE, please?

Announcement

GCC's Profile Guided Optimization Performance With The Ryzen 9 5950X

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment