Announcement

**david-nk** · 08 July 2021, 12:47 PM

Not too much changed here in the last 10+ years. GCC attempts to improve performance and manages to improve on 5 benchmarks while regressing on 6 others.
I've developed a system that can automatically choose the best performing compiler for a given translation unit and found that sometimes you have to go all the way back to GCC 4.x releases to get the best performance. Which is unfortunately no longer possible when you compile C++ code (at least when using C++11 or later features).

To stop this wandering around in the dark, extensive automated benchmarking NEEDS to be a mandatory part of patch submission.
It's similar to how it was with Wine before they had a large test suite and automated testing, people kept submitting patches that fixed one application, but might have broken two others (though Wine would also greatly benefit from adding performance-related tests).

**avem** · 08 July 2021, 01:04 PM

Originally posted by david-nk View Post

Not too much changed here in the last 10+ years. GCC attempts to improve performance and manages to improve on 5 benchmarks while regressing on 6 others.
I've developed a system that can automatically choose the best performing compiler for a given translation unit and found that sometimes you have to go all the way back to GCC 4.x releases to get the best performance. Which is unfortunately no longer possible when you compile C++ code (at least when using C++11 or later features).

To stop this wandering around in the dark, extensive automated benchmarking NEEDS to be a mandatory part of patch submission.
It's similar to how it was with Wine before they had a large test suite and automated testing, people kept submitting patches that fixed one application, but might have broken two others (though Wine would also greatly benefit from adding performance-related tests).

I'm OK with Wine having regressions in terms of performance if patches still improve applications compatibility. For GCC it's not really excusable though. Compilers are all about performance in the first place.

**avem** · 08 July 2021, 01:10 PM

To be fair, GCC12 has fixed most of the performance regressions that GCC 11 has however it's unlikely that the Botan benchmark performance is going to be fixed any time soon: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55278 - there's been no updates for 8 years now.

**coder** · 08 July 2021, 05:48 PM

Originally posted by david-nk View Post

To stop this wandering around in the dark, extensive automated benchmarking NEEDS to be a mandatory part of patch submission.

That's not a bad idea, but it's also not necessarily a recipe for improvement. Each patch would have to be tested with multiple different benchmarks, on multiple different ISAs and microarchitectures, because what might be a big win in some cases, could be a small loss in others (or vice versa). It's even conceivable that GCC is near a local maxima, in some benchmarks.

What's really needed is a deeper analysis of why performance regresses, in what should be good optimizations, and strategies for dealing with it. For instance, aggressive loop unrolling and inlining could be getting killed by instruction cache misses. In that case, maybe the optimizer lacks the ability to model instruction cache behavior. I'm sure the GCC developers probably have better ideas about what's lacking but maybe haven't the funding to do the big refactoring to address the major deficiencies.

**F.Ultra** · 08 July 2021, 07:59 PM

Originally posted by coder View Post

That's not a bad idea, but it's also not necessarily a recipe for improvement. Each patch would have to be tested with multiple different benchmarks, on multiple different ISAs and microarchitectures, because what might be a big win in some cases, could be a small loss in others (or vice versa). It's even conceivable that GCC is near a local maxima, in some benchmarks.

What's really needed is a deeper analysis of why performance regresses, in what should be good optimizations, and strategies for dealing with it. For instance, aggressive loop unrolling and inlining could be getting killed by instruction cache misses. In that case, maybe the optimizer lacks the ability to model instruction cache behavior. I'm sure the GCC developers probably have better ideas about what's lacking but maybe haven't the funding to do the big refactoring to address the major deficiencies.

Would be interesting to see the same tables redone with O2, since this round was done O3 there might be experimental optimizations enabled clouding the numbers.

**david-nk** · 08 July 2021, 08:06 PM

Originally posted by coder View Post

That's not a bad idea, but it's also not necessarily a recipe for improvement. Each patch would have to be tested with multiple different benchmarks, on multiple different ISAs and microarchitectures, because what might be a big win in some cases, could be a small loss in others (or vice versa). It's even conceivable that GCC is near a local maxima, in some benchmarks.

Yes, there would have to be hundreds of highly varied benchmarks, across a wide variety of CPUs and memory configurations, otherwise there would be no difference to how things work now. Probably also the reason it hasn't been done yet, maintaining the test systems is not free and someone needs to fund it.

Originally posted by coder View Post

What's really needed is a deeper analysis of why performance regresses, in what should be good optimizations, and strategies for dealing with it. For instance, aggressive loop unrolling and inlining could be getting killed by instruction cache misses. In that case, maybe the optimizer lacks the ability to model instruction cache behavior. I'm sure the GCC developers probably have better ideas about what's lacking but maybe haven't the funding to do the big refactoring to address the major deficiencies.

Yes, but such work can only realistically happen with a testing framework to guide you, to see if deductions and assumptions are actually correct. Something that might seem like good logic to make a decision and indeed works well for the cases you are currently analyzing might fail in unexpected ways in other situations. Not only are CPUs complex, GCC is complex. Optimization/analysis passes can depend on each other and interfere with each other. It's a common problem with GCC that when one optimization is being applied, it sometimes breaks other ones, sometimes even really basic ones like constant propagation or strength reduction. The extensive testing (which already exists for correctness of code generation, but not performance) would tell you right away whether you overlooked something (you probably did).

Do you know Fishtest? It is what put Stockfish on the path of being the undisputed king of chess engines. When you have an idea that you think might improve the engine, you can have your change tested by the Fishtest framework, which plays thousands of games with the modified engine against the unmodified one. Only if this proves that your patch indeed improves playing strength with sufficiently high probability, it is applied. And people have clever ideas all the time, often born from analysis of a position where Stockfish played a verifiable suboptimal move. But just because a change looked good on paper doesn't mean it will have a positive effect. The subject matter is too complex for anyone to reliably predict the outcome. In the end, only a small fraction of the changes manage to pass the tests. The same would be undoubtedly be true for GCC - but it lacks this vital control mechanism.

**onlyLinuxLuvUBack** · 09 July 2021, 12:28 AM

Originally posted by david-nk View Post

Not too much changed here in the last 10+ years. GCC attempts to improve performance and manages to improve on 5 benchmarks while regressing on 6 others.
I've developed a system that can automatically choose the best performing compiler for a given translation unit and found that sometimes you have to go all the way back to GCC 4.x releases to get the best performance. Which is unfortunately no longer possible when you compile C++ code (at least when using C++11 or later features).

To stop this wandering around in the dark, extensive automated benchmarking NEEDS to be a mandatory part of patch submission.
It's similar to how it was with Wine before they had a large test suite and automated testing, people kept submitting patches that fixed one application, but might have broken two others (though Wine would also greatly benefit from adding performance-related tests).

is it available on codeberg ?

**oleid** · 09 July 2021, 02:58 AM

Originally posted by david-nk View Post

To stop this wandering around in the dark, extensive automated benchmarking NEEDS to be a mandatory part of patch submission.

That can be tricky to do as part of a CI pipeline. The variances will be high.
What the rust team does is they also count the number of instructions generated for a code snippet. It has a high correlation with the wall time of a benchmark and the variance is low.

What is perf.rust-lang.org measuring and why is "instructions:u" the default?

https://internals.rust-lang.org/t/what-is-perf-rust-lang-org-measuring-and-why-is-instructions-u-the-default/9815/3

I think we should explicitly track cache misses (https://github.com/rust-lang-nursery/rustc-perf/issues/370). The short version is that cache misses should correlate well to “expensive (high-latency) memory traffic” and “non-local access patterns”, whereas faults (ignoring memory-mapped files) correlate better with number of allocations, than anything happening during the lifetime of those allocations. By combining the number of instructions with the number of cache misses, we should have a pr...

rustc performance data

https://perf.rust-lang.org/

**binarybanana** · 09 July 2021, 10:03 AM

I really hate the meson build system. It has no way to override any variable like CFLAGS after configuring the build environment. With makefiles you can do something like

Code:

make tests CFLAGS="-O2 -march=native -pipe

and it will happily compile and run all tests with those settings even if you use more aggressive optimizations for the actual library code.

I've been experimenting with gentooLTO and added even more optimizations on top. Basically, I switched to -Ofast and added almost anything I could find in man gcc. To my surprise I could rebuild the whole system with only a few build and runtime failures. Those mostly involved auto parallelization.

But I gave up on running test suites because they are not a reliable way to test anything at those optimization levels. They sometimes fail to compile, or err out even when the actual code they're testing works correctly. I noticed that when I tried the "make test CFLAGS=$foo" trick in the build directory of some package and suddenly the tests all passed. That tells me the code being tested is OK since that's the code the tests call into, but the optimizations break the tests themselves and so they fail.

Sadly, with meson there is no way to really configure this easily. It creates a ninja build and that hardcodes everything, to it can't be overridden arbitrarily. I've been thinking about writing a filter for ninja build scripts, but now I'm using a gcc wrapper that checks the path of the file being compiled and if it finds a directory with "test" in the path it filters out all the optimization flags before calling the real gcc.

It doesn't always work and sometimes things really break, but things are fine more often than I expected.

Ultimately, I want to create automated benchmarks that try different sets of CFLAGS for each package and compare the results so that everything gets compiled with the optimizations that result in the most efficient code possible. The results could even be shared and merged so that most wouldn't have to run the benchmarks themselves. Probably the would have to be different databases of optimal flags for different CPU architecturet though.

The Phoronix benchmarks of gcc in particular have been eye opening. I've been a long-time Gentoo user but even I didn't expect the difference to be that large. I just used "-O3 -march=native" since forever and later LTO+Graphite with the gentooLTO overlay and that's it Looking at those Phoronix compiler benchmarks I expect overall performance uplift in the 15-20% range compared to basic "-O2 -march=native" when every package is compiled with the most optimal flags and that's huge.

Going by translation unit is another interesting idea, but there are already too many permutations to test when one wants to do it for each individual package on the system. But at least doing it for every binary/library object might be feasible and wouldn't require additional compilation.

Announcement

GCC 8 Through GCC 11 Stable Plus GCC 12 Compiler Benchmarks

GCC 8 Through GCC 11 Stable Plus GCC 12 Compiler Benchmarks

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment