Announcement

**Michael** · 28 July 2020, 06:12 AM

Originally posted by hubicka View Post

Michael, I tried to reproduce some of these on my setup and they don't.
I wonder if you can simply send me few binaries of those benchmarks that regress worst in your builds (both gcc9 and gcc10), say for Coremark, Himeno, X264 to [email protected]. I can see what is going on.

Sure will be in touch likely later today, got some time later on today that plan to poke at them some more.

**Michael** · 28 July 2020, 02:30 PM

Originally posted by hubicka View Post

Michael, I tried to reproduce some of these on my setup and they don't.
I wonder if you can simply send me few binaries of those benchmarks that regress worst in your builds (both gcc9 and gcc10), say for Coremark, Himeno, X264 to [email protected]. I can see what is going on.

Been running some more tests today:

- Tried on a i9-10980XE Cascade Lake and Cascade Lake Xeon systems and did not reproduce...

- I went back to the i9-10900K and picked just a few of the tests where it was impacted the hardest, but then surprisingly the results were similar that run.

- So then I restarted all the same tests on the 10900K as I originally ran and in the same order. This time, indeed for the same tests a similar performance hit was encountered at -O2. At -O3 -march=native, the results were similar to in the article. Still running a confirmation LTO run as running all the tests is quite time consuming but:

Looking like something really odd with the 10900K or Comet Lake? So can reproduce there but has something to do with the state of the 10900K system during the -O2 run that the same set of tests slowdown on GCC 10.2 but seems to be after triggered by some event/behavior of the system during the -O2 run. Since it doesn't happen on GCC 9 or at -O3 -march=native, guessing it's not thermal related and nothing out of the ordinary popped up in the dmesg when checking there.

After the LTO confirmation run happens, I'll do another run with -O2 GCC9/GCC10 on the system while also enabling some of the perf recording and sensor measurements to see if that shows any strange anomalies for GCC10 -O2 compared to GCC9 -O2. When another system frees up I'll also try retested Cascade Lake but now running the full set of tests as opposed to just a few of the tests that were showing the regressed behavior to see if on other non-CometLake hardware it triggers some odd behavior as well in that condition.

**FPScholten** · 28 July 2020, 03:33 PM

Just build my custom Linux 5.7.10-xanmod2 with GCC 10.2 and it is performing extraordinary well. No problems there, so building the Linux kernel with 10.2 does not lead to performance regressions as far as I can tell. (This is on Haswell 4700MQ) and it does use -O2

**hubicka** · 28 July 2020, 03:37 PM

Originally posted by Michael View Post

Been running some more tests today:

- Tried on a i9-10980XE Cascade Lake and Cascade Lake Xeon systems and did not reproduce...

- I went back to the i9-10900K and picked just a few of the tests where it was impacted the hardest, but then surprisingly the results were similar that run.

- So then I restarted all the same tests on the 10900K as I originally ran and in the same order. This time, indeed for the same tests a similar performance hit was encountered at -O2. At -O3 -march=native, the results were similar to in the article. Still running a confirmation LTO run as running all the tests is quite time consuming but:

Looking like something really odd with the 10900K or Comet Lake? So can reproduce there but has something to do with the state of the 10900K system during the -O2 run that the same set of tests slowdown on GCC 10.2 but seems to be after triggered by some event/behavior of the system during the -O2 run. Since it doesn't happen on GCC 9 or at -O3 -march=native, guessing it's not thermal related and nothing out of the ordinary popped up in the dmesg when checking there.

After the LTO confirmation run happens, I'll do another run with -O2 GCC9/GCC10 on the system while also enabling some of the perf recording and sensor measurements to see if that shows any strange anomalies for GCC10 -O2 compared to GCC9 -O2. When another system frees up I'll also try retested Cascade Lake but now running the full set of tests as opposed to just a few of the tests that were showing the regressed behavior to see if on other non-CometLake hardware it triggers some odd behavior as well in that condition.

What I found especially odd are small benchmarks like himeno where compiler can hardly do such a big difference (unless configuration changes substantially by enabling some hardening or ISA extensions). It indeed would make sense to try to run them separately and look for downclocking/overheating events first. Would be nice to understand this since at the first look himeno internal loops liked almost identical from both compilers.

I noticed that people attribute the regression to the -O2 retuning (done by me). It was not aiming to get code slower and compiler faster. Goal was to make heuristics fit modern C++ codebases better while not giving up on code size (I used firefox, clang, GCC itself, spec and few more tests for validating it). So GCC now auto inlines at -O2 (it used to do that at -O3 only) with reduced limits and it it takes less seriously inline keywords (because they are misplaced in C++ codebases). It does substantially more inlining for Firefox/clang at -O2 level. This actually slowed down compilation especially with LTO and -O2 on large codebases so we had to revisit inliner and speed it up a lot to get away without compile time regressions with LTO.

Inliner is drived by a heuristics and by nature it never gets perfect. I can imagine this hurting certain specific lower level codebases (where inline is placed more carefully than in usual programs) but like in kernel it is always possible to use alwaysinline attribute to override the heuristics. I would be curious to see a testcases where performance degraded and analyse them.

I also know that -O2 -fprofile-use may get slower, since it is no longer identical to -O3 -fprofile-use. The reason for this is that optimizations now have different limits for -O2 and -O3. While enabling inlining at -O2 I also decreased its limit (so -O2 does not become another -O3). This was intentional. I think it was bit of a historical mistake to have the defaults we had. -O2 -fprofile-use enables a lot of extra optimizations including auto-inlining (because those optimizations are no longer agressively guessing what transofrm makes sense, but know from feedback what to do). It however still does make sense to take a hint from developer and trade less code size for performance at -O2 compared for -O3. This is what optimizations levels are for after all: -O2 is generally a good default, -O3 is good for performance sensitive and not extremely bloated code. Current defaults are certainly not perfect and they can be improved, but I did test them compared to older GCCs and clang.

**Sadako** · 28 July 2020, 03:37 PM

Originally posted by Michael View Post

After the LTO confirmation run happens, I'll do another run with -O2 GCC9/GCC10 on the system while also enabling some of the perf recording and sensor measurements to see if that shows any strange anomalies for GCC10 -O2 compared to GCC9 -O2. When another system frees up I'll also try retested Cascade Lake but now running the full set of tests as opposed to just a few of the tests that were showing the regressed behavior to see if on other non-CometLake hardware it triggers some odd behavior as well in that condition.

May I suggest a run at -o2 with -march=native, and a run at -o3 without any -march setting too, to try narrow the cause down a little?

**arQon** · 29 July 2020, 08:27 AM

Originally posted by ypnos View Post

Could you give some examples that illustrate your anecdote?

From years ago and many projects later? My memory's not even close to that good, sorry.

One was related to inline asm where the O pass screwed up by deciding that since a C var used as the output wasn't modified by the C code it would simply be elided, despite being referenced in the asm. The O2 case I mentioned was, umm, I don't remember but it's referenced in the patches for that GCC, so...
The O3 cases were numerous enough that they basically all just blur into "don't use O3, except MAYBE for one file, as long as it's very very hot and has massive test coverage".

Originally posted by ypnos View Post

I'm really curious as I have developed my software exclusively with -O3 for about 10 years now and never ran into problems.

No offense, but that has to be the most blindly-irrelevant strawman I've seen someone use without ill intent in a long time. :P

It mostly comes down to what the code is. A simple C app and a massive multi-threaded C++ system are miles apart in terms of complexity. Hundreds of thousands of LOC are implicitly an order of magnitude more likely to have a construct or flow that causes the compiler to generate bad code than tens of thousands of LOC. The broken path may only get used 1 time in 10,000: which could be 5000 times a day in a server farm, and realistically "never" in a single-user app. The error may well happen with you simply being oblivious to it. (Which statistically is almost certainly the case for your "I never had problems" position, rather than that actually being the case). It may only exist for a couple of revisions of GCC, when your userbase is 1. And so on, and so on.

The bottom line is, O3 has been repeatedly - almost RELIABLY :P - broken since the dawn of time. Hopefully you're not seriously going to try to argue that isn't the case, right?

Originally posted by ypnos View Post

Yes I stumbled upon one or the other GCC bug and even submitted bug reports but those were not optimization-related.

Good for you. (no sarcasm). Filing bugs on compilers is really important.
Again though: YOURS "weren't optimization-related". You got lucky. Other peoples' were. That's why the errata for every release of GCC has all those "O<n> bad code generation" entries.

**sdack** · 31 July 2020, 08:25 AM

Originally posted by hubicka View Post

I noticed that people attribute the regression to the -O2 retuning (done by me).

Don't worry about it. They are guessing and trying to be helpful, like I was, too. I've called it a very likely compiler regression, but avoided calling it a definitive one. The harsh drop in performance is unusual for a regression. These are often in the range of 1%-15% or so, however not 50%. It then became something of an earthquake where everyone is waiting for the big one to come and so then one thinks this is it. Anyhow, I hope we get an update to the article.

**birdie** · 31 July 2020, 09:02 AM

Michael Have you found out anything new/interesting?

**Michael** · 31 July 2020, 09:04 AM

Originally posted by birdie View Post

Michael Have you found out anything new/interesting?

Just what I noted in the post the other day in follow up, haven't had more time yet but this weekend will hopefully have the time to get back to trying some other Comet Lake Core CPUs.

**pyler** · 01 August 2020, 07:05 PM

Take down the article until new update.

seems like gcc is ok, no drama.

Announcement

GCC Benchmarks At Varying Optimization Levels With Core i9 10900K Show An Unexpected Surprise

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment