Originally posted by hubicka
View Post
Announcement
Collapse
No announcement yet.
GCC Benchmarks At Varying Optimization Levels With Core i9 10900K Show An Unexpected Surprise
Collapse
X
-
Michael Larabel
https://www.michaellarabel.com/
- Likes 3
-
Originally posted by hubicka View Post
Michael, I tried to reproduce some of these on my setup and they don't.
I wonder if you can simply send me few binaries of those benchmarks that regress worst in your builds (both gcc9 and gcc10), say for Coremark, Himeno, X264 to [email protected]. I can see what is going on.
- Tried on a i9-10980XE Cascade Lake and Cascade Lake Xeon systems and did not reproduce...
- I went back to the i9-10900K and picked just a few of the tests where it was impacted the hardest, but then surprisingly the results were similar that run.
- So then I restarted all the same tests on the 10900K as I originally ran and in the same order. This time, indeed for the same tests a similar performance hit was encountered at -O2. At -O3 -march=native, the results were similar to in the article. Still running a confirmation LTO run as running all the tests is quite time consuming but:
Looking like something really odd with the 10900K or Comet Lake? So can reproduce there but has something to do with the state of the 10900K system during the -O2 run that the same set of tests slowdown on GCC 10.2 but seems to be after triggered by some event/behavior of the system during the -O2 run. Since it doesn't happen on GCC 9 or at -O3 -march=native, guessing it's not thermal related and nothing out of the ordinary popped up in the dmesg when checking there.
After the LTO confirmation run happens, I'll do another run with -O2 GCC9/GCC10 on the system while also enabling some of the perf recording and sensor measurements to see if that shows any strange anomalies for GCC10 -O2 compared to GCC9 -O2. When another system frees up I'll also try retested Cascade Lake but now running the full set of tests as opposed to just a few of the tests that were showing the regressed behavior to see if on other non-CometLake hardware it triggers some odd behavior as well in that condition.Michael Larabel
https://www.michaellarabel.com/
- Likes 2
Comment
-
Originally posted by Michael View Post
Been running some more tests today:
- Tried on a i9-10980XE Cascade Lake and Cascade Lake Xeon systems and did not reproduce...
- I went back to the i9-10900K and picked just a few of the tests where it was impacted the hardest, but then surprisingly the results were similar that run.
- So then I restarted all the same tests on the 10900K as I originally ran and in the same order. This time, indeed for the same tests a similar performance hit was encountered at -O2. At -O3 -march=native, the results were similar to in the article. Still running a confirmation LTO run as running all the tests is quite time consuming but:
Looking like something really odd with the 10900K or Comet Lake? So can reproduce there but has something to do with the state of the 10900K system during the -O2 run that the same set of tests slowdown on GCC 10.2 but seems to be after triggered by some event/behavior of the system during the -O2 run. Since it doesn't happen on GCC 9 or at -O3 -march=native, guessing it's not thermal related and nothing out of the ordinary popped up in the dmesg when checking there.
After the LTO confirmation run happens, I'll do another run with -O2 GCC9/GCC10 on the system while also enabling some of the perf recording and sensor measurements to see if that shows any strange anomalies for GCC10 -O2 compared to GCC9 -O2. When another system frees up I'll also try retested Cascade Lake but now running the full set of tests as opposed to just a few of the tests that were showing the regressed behavior to see if on other non-CometLake hardware it triggers some odd behavior as well in that condition.
I noticed that people attribute the regression to the -O2 retuning (done by me). It was not aiming to get code slower and compiler faster. Goal was to make heuristics fit modern C++ codebases better while not giving up on code size (I used firefox, clang, GCC itself, spec and few more tests for validating it). So GCC now auto inlines at -O2 (it used to do that at -O3 only) with reduced limits and it it takes less seriously inline keywords (because they are misplaced in C++ codebases). It does substantially more inlining for Firefox/clang at -O2 level. This actually slowed down compilation especially with LTO and -O2 on large codebases so we had to revisit inliner and speed it up a lot to get away without compile time regressions with LTO.
Inliner is drived by a heuristics and by nature it never gets perfect. I can imagine this hurting certain specific lower level codebases (where inline is placed more carefully than in usual programs) but like in kernel it is always possible to use alwaysinline attribute to override the heuristics. I would be curious to see a testcases where performance degraded and analyse them.
I also know that -O2 -fprofile-use may get slower, since it is no longer identical to -O3 -fprofile-use. The reason for this is that optimizations now have different limits for -O2 and -O3. While enabling inlining at -O2 I also decreased its limit (so -O2 does not become another -O3). This was intentional. I think it was bit of a historical mistake to have the defaults we had. -O2 -fprofile-use enables a lot of extra optimizations including auto-inlining (because those optimizations are no longer agressively guessing what transofrm makes sense, but know from feedback what to do). It however still does make sense to take a hint from developer and trade less code size for performance at -O2 compared for -O3. This is what optimizations levels are for after all: -O2 is generally a good default, -O3 is good for performance sensitive and not extremely bloated code. Current defaults are certainly not perfect and they can be improved, but I did test them compared to older GCCs and clang.
- Likes 5
Comment
-
Originally posted by Michael View PostAfter the LTO confirmation run happens, I'll do another run with -O2 GCC9/GCC10 on the system while also enabling some of the perf recording and sensor measurements to see if that shows any strange anomalies for GCC10 -O2 compared to GCC9 -O2. When another system frees up I'll also try retested Cascade Lake but now running the full set of tests as opposed to just a few of the tests that were showing the regressed behavior to see if on other non-CometLake hardware it triggers some odd behavior as well in that condition.
- Likes 2
Comment
-
Originally posted by ypnos View PostCould you give some examples that illustrate your anecdote?
One was related to inline asm where the O pass screwed up by deciding that since a C var used as the output wasn't modified by the C code it would simply be elided, despite being referenced in the asm. The O2 case I mentioned was, umm, I don't remember but it's referenced in the patches for that GCC, so...
The O3 cases were numerous enough that they basically all just blur into "don't use O3, except MAYBE for one file, as long as it's very very hot and has massive test coverage".
Originally posted by ypnos View PostI'm really curious as I have developed my software exclusively with -O3 for about 10 years now and never ran into problems.
It mostly comes down to what the code is. A simple C app and a massive multi-threaded C++ system are miles apart in terms of complexity. Hundreds of thousands of LOC are implicitly an order of magnitude more likely to have a construct or flow that causes the compiler to generate bad code than tens of thousands of LOC. The broken path may only get used 1 time in 10,000: which could be 5000 times a day in a server farm, and realistically "never" in a single-user app. The error may well happen with you simply being oblivious to it. (Which statistically is almost certainly the case for your "I never had problems" position, rather than that actually being the case). It may only exist for a couple of revisions of GCC, when your userbase is 1. And so on, and so on.
The bottom line is, O3 has been repeatedly - almost RELIABLY :P - broken since the dawn of time. Hopefully you're not seriously going to try to argue that isn't the case, right?
Originally posted by ypnos View PostYes I stumbled upon one or the other GCC bug and even submitted bug reports but those were not optimization-related.
Again though: YOURS "weren't optimization-related". You got lucky. Other peoples' were. That's why the errata for every release of GCC has all those "O<n> bad code generation" entries.
- Likes 1
Comment
-
Originally posted by hubicka View PostI noticed that people attribute the regression to the -O2 retuning (done by me).Last edited by sdack; 31 July 2020, 08:27 AM.
Comment
-
Comment