Originally posted by ncopa
View Post
Announcement
Collapse
No announcement yet.
Optimizing Mesa Performance With Compiler Flags
Collapse
X
-
-
Originally posted by curaga View Post@mark
It's mainly about the inlining. Yes, it can have that big an effect.
C++ templates much exacerbate that effect, when you have templates calling templates calling templates, you can get thousands of pointless function calls without inlining.
I guess I will add the inline parameter to my CXXFLAGs and for single C packages.
Leave a comment:
-
@mark
It's mainly about the inlining. Yes, it can have that big an effect.
C++ templates much exacerbate that effect, when you have templates calling templates calling templates, you can get thousands of pointless function calls without inlining.
Leave a comment:
-
This change is mainly to benefit 32-bit systems where SSE support can't be assumed by default, but with the i965 driver, more often than not it can be assumed an Intel Core 2 processor or newer is in use. (The older Intel processors are generally using the i915 driver.) By setting the -march=core2 flag, for i386 builds SSE would now be used for floating-point math and cmov instructions, plus other performance optimizations.
[...]
This patch was ultimately rejected since it turns out there's still some old Pentium 4s that could be found in an i965 driver configuration where things might break.
- Likes 1
Leave a comment:
-
Originally posted by curaga View PostThere are quite big differences between O2 and O3 with some software, especially if it's C++ with templates.
Bullet physics was close to 10x slower with O2, same result with Os, when compared to O3 last I tested.
-O3 Optimize yet more. -O3 turns on all optimizations specified by -O2
and also turns on the -finline-functions, -funswitch-loops,
-fpredictive-commoning, -fgcse-after-reload, -ftree-vectorize and
-fipa-cp-clone options.
-finline-functions
Integrate all simple functions into their callers. The compiler heuristically decides which functions are simple enough to be worth integrating in this way.
If all calls to a given function are integrated, and the function is declared static, then the function is normally not output as assembler code in its own right.
-funswitch-loops
Move branches with loop invariant conditions out of the loop, with duplicates of the loop on both branches (modified according to result of the condition).
-fpredictive-commoning
Perform predictive commoning optimization, i.e., reusing computations (especially memory loads and stores) performed in previous iterations of loops.
-fgcse-after-reload
When -fgcse-after-reload is enabled, a redundant load elimination pass is performed after reload. The purpose of this pass is to cleanup redundant spilling.
-ftree-vectorize
Perform loop vectorization on trees.
-fipa-cp-clone
Perform function cloning to make interprocedural constant propagation stronger. When enabled, interprocedural constant propagation will perform function cloning when externally visible function can be called with constant arguments. Because this optimization can create multiple copies of functions, it may significantly increase code size (see --param ipcp-unit-growth=value)
I fail to see how a factor of 10 could be reached with this...? Maybe these fipa and commoning thingies work better than they sound. The performance gain seems to come from heavier memory usage.
Leave a comment:
-
Leave a comment:
-
While you maybe can't optimize for Core2 for compatibility reasons, it is certainly safe to enable use of SSE and SSE2 in 32-bit i965. This optimization could perhaps be done.
There are indeed i965 chipsets supporting the Celeron M processor (and some motherboards may unofficially support Pentium 4 CPUs indeed). That processor does not have SSE3 and SSSE3 support which the Core 2 has. Probably it can be optimized for Pentium/Celeron M, then at least SSE2 would be enabled.
Leave a comment:
-
There are quite big differences between O2 and O3 with some software, especially if it's C++ with templates.
Bullet physics was close to 10x slower with O2, same result with Os, when compared to O3 last I tested.
Leave a comment:
-
so the flags do exactly what the manpage says: -O2 is a good, stable optimization, while -O3 needs more compile time and may or may not improve the resulting binary so it is mostly a waste of energy and time (except you like playing and consider compiling Linux with all flag permutations as a game). I would only enable it for single applications if I am not satisfied with -O2 (it seemed that ffmpeg gained a little performance from -O3 but I did not benchmark this).
In my experience in most cases -O3 does not improve the performance noticably (like in the article) and additionally the -Os and -O3 flags can break programs because of unpredicted segfaults.
So the only compile flags I use for years are -march=..., -O2 and for gcc: -pipe
For software it is better anyways to use efficient algorithms to solve a problem, no compiler optimization can improve an exponential algorithm into a linear one, it just creates a little better exponential code (or not).
Leave a comment:
Leave a comment: