Announcement

**Alliancemd** · 16 June 2021, 05:51 PM

`-Ofast` turns on ` -ffast-math` which is really unsafe for some software... Wouldn't even entertain the idea of using that on all software... It's risky

**coder** · 16 June 2021, 09:31 PM

Originally posted by F.Ultra View Post

Why not? The whole purpose of LTO is to have the optimizer in the compiler have access to all of the source code and not having to work with one source file at the time. So if LTO produced slower code then that means that the optimizer did not in fact optimize the code -> there is a bug in the optimizer that makes it produce less optimized code.

Even having access to the entire program code is insufficient to determine program behavior, due to data-dependence and practical limits on computability.

What you need, for the compiler to make optimal speed/space tradeoffs, is profiling information that tells it the relative frequency of different code paths. PGO could be used to supply that information, although that presumes both that it would reach the LTO stage and that the runtime behavior of the code is easily characterizable by a small set of sample workloads.

**sdack** · 18 June 2021, 01:44 PM

Originally posted by coder View Post

Even having access to the entire program code is insufficient to determine program behavior, due to data-dependence and practical limits on computability.

You are missing the point. It is about the size of the scope. A larger scope means more opportunities for optimisations. At the very least should it result in the same code, when there are not more opportunities. And if there are more then it should show in more optimisations. It is not possible to suddenly have less opportunities when the scope gets larger, unless the now missing opportunities were in fact none, which means you have a faulty optimisation process.

So either some optimisations on file scope are producing fast but faulty code and LTO undoes that (I doubt that though), or, the larger scope of LTO somehow masks some opportunities and this is why it leads to worse performance. It could have other causes, too, and it is all just speculation. However, the technical reasoning of why LTO should in general not produce worse, but better code remains sound. After all, this is the core idea behind LTO and to allow for the optimisation to find more opportunities.

PGO then is very beneficial to LTO and one should try to combine these whenever possible, because it does indeed make for a great combination. But the question is, why does LTO when used on its own with GCC appear to produce worse code? If this can get fixed then it could likely lead to better performance when LTO gets combined with PGO. I see no reason why not.

**coder** · 18 June 2021, 08:30 PM

Originally posted by sdack View Post

You are missing the point. It is about the size of the scope. A larger scope means more opportunities for optimisations.

No, I get that. I'm just explaining why throwing the barn doors open to global interprocedural optimization might be a net losing proposition, and what makes it hard.

Originally posted by sdack View Post

when there are not more opportunities. And if there are more then it should show in more optimisations. It is not possible to suddenly have less opportunities when the scope gets larger, unless the now missing opportunities were in fact none, which means you have a faulty optimisation process.

I don't know how to explain it any clearer than I did. You can try rereading my explanation or maybe seek out better authorities on the matter.

Whether you accept my conjecture or pursue insight elsewhere, the data is what it is. Before deciding it's a bug, you'd do well to understand the root cause.

Originally posted by sdack View Post

However, the technical reasoning of why LTO should in general not produce worse, but better code remains sound.

If you take a step back and think about it, why do we have any higher standards for compiler performance with LTO than -O3? And there are plenty of cases where -O3 is slower!

**sdack** · 19 June 2021, 04:19 AM

Originally posted by coder View Post

If you take a step back and think about it ...

I have. As I said, the idea of LTO is to allow for more optimisations. Edge cases aside, from several benchmarks now and in the past here on Phoronix has GCC LTO shown to under perform, while LLVM/Clang LTO has not. So from the amount of benchmarks as well as from the difference observed with another compiler, not to mention the years of work that went into developing LTO, can one come to the conclusion that there seems to be a regression currently with GCC LTO.

If you want to tell yourself that all is fine, this is the way it should work, then go ahead.

**F.Ultra** · 19 June 2021, 02:57 PM

Originally posted by coder View Post

If you take a step back and think about it, why do we have any higher standards for compiler performance with LTO than -O3? And there are plenty of cases where -O3 is slower!

And cases where -O3 is slower are IMHO also due to bugs in the optimizer. PGO is no panacea either since not all applications will experience the same workload from run to run.

And we do have a higher standard for LTO than -O3 since we with LTO gives the optimizer far more material to work with instead of adding extra optimizer steps that are known to be buggy (-O3). Having access to more material should help the optimizer make a better judgement, not a worse one.

**coder** · 20 June 2021, 04:03 AM

Originally posted by sdack View Post

If you want to tell yourself that all is fine, this is the way it should work, then go ahead.

That's not what I said. I just said it's a hard problem. That doesn't preclude further improvements, but it's a wholly different diagnosis than saying it's a bug.

**coder** · 20 June 2021, 04:21 AM

Originally posted by F.Ultra View Post

And cases where -O3 is slower are IMHO also due to bugs in the optimizer.

I think it's too sloppy to simply label it as a bug. A bug is something other than a limitation. It's a mismatch between intention and implementation. And I don't mean just an intention like "it should be faster", but like specific strategies that are not working as intended. It's also fixable.

We know there are computationally hard problems in code optimization. There are also lots of heuristics involved, and it's probably difficult to optimize them all, relative to each other.

Originally posted by F.Ultra View Post

PGO is no panacea either since not all applications will experience the same workload from run to run.

I'm well aware of that, but there are plenty of cases where it is usable and helps considerably. It's nice to have the option of using it, though I think it shouldn't be leaned on as too much of a crutch.

Originally posted by F.Ultra View Post

And we do have a higher standard for LTO than -O3 since we with LTO gives the optimizer far more material to work with instead of adding extra optimizer steps that are known to be buggy (-O3). Having access to more material should help the optimizer make a better judgement, not a worse one.

Most software is designed pre-LTO, and therefore functions which would provide the greatest benefit by inlining are already defined as inline functions (or are at least somehow visible at file-scope). This limits the upside of LTO to doing inlining mostly where it can't help much (and inlining can always hurt by bloating code size).

There's another thing I'm curious about, and that's whether LTO has access to the source used to compile the original object files. If not, then it's really not the same as just giving the compiler more scope. It could be that all the original optimization decisions are baked and all LTO can do is just some additional inlining. That would mean it could do little else than remove some function call overhead, at best -- and just add code bloat, at worst.

**sdack** · 20 June 2021, 08:03 AM

Originally posted by coder View Post

... but it's a wholly different diagnosis than saying it's a bug.

We are saying it could be a possible bug or regression. Stop being the contrarian and argue with the obvious that it could also not be the case. Nobody cares for shit like that. It makes for no good discussion.

**F.Ultra** · 20 June 2021, 12:03 PM

Originally posted by coder View Post

I think it's too sloppy to simply label it as a bug. A bug is something other than a limitation. It's a mismatch between intention and implementation. And I don't mean just an intention like "it should be faster", but like specific strategies that are not working as intended. It's also fixable.

We know there are computationally hard problems in code optimization. There are also lots of heuristics involved, and it's probably difficult to optimize them all, relative to each other.

If the limitation is intended (aka by design) then it's a feature, but if the limitation is unintentional (optimized code is slower) then it's a bug, IMHO.

Originally posted by coder View Post

Most software is designed pre-LTO, and therefore functions which would provide the greatest benefit by inlining are already defined as inline functions (or are at least somehow visible at file-scope). This limits the upside of LTO to doing inlining mostly where it can't help much (and inlining can always hurt by bloating code size).

And back then we sometimes did our own LTO by having huge .c files, just like how some c++ devs like to release some of their libs as a single header file. Inlining is IMHO not a benefit of LTO, the main benefit of LTO should be that the optimizer have access to how your functions and data are interoperating (e.g being able to analyze the complete call path for a variable and determine the equivalent of "restrict" or "const" and so on).

Originally posted by coder View Post

There's another thing I'm curious about, and that's whether LTO has access to the source used to compile the original object files. If not, then it's really not the same as just giving the compiler more scope. It could be that all the original optimization decisions are baked and all LTO can do is just some additional inlining. That would mean it could do little else than remove some function call overhead, at best -- and just add code bloat, at worst.

AFAIK it does, in GCC the optimizer works on the GIMPLE data and what LTO does for gcc is that it makes gcc write the GIMPLE data to disk and then delay the optimizing step until the linker stage.

Announcement

GCC 11 Compiler Performance Benchmarks With Various Optimization Levels, LTO

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment