Page 1 of 3 123 LastLast
Results 1 to 10 of 25

Thread: Optimizing Mesa Performance With Compiler Flags

  1. #1
    Join Date
    Jan 2007
    Posts
    14,240

    Default Optimizing Mesa Performance With Compiler Flags

    Phoronix: Optimizing Mesa Performance With Compiler Flags

    Compiler tuning can lead to performance improvements for many computational benchmarks by toying with the CFLAGS/CXXFLAGS, but is there much gain out of optimizing your Mesa build? Here's some benchmark results...

    http://www.phoronix.com/vr.php?view=MTI4NTY

  2. #2

    Default

    I would be very interested in how the -O1, -O2, -O3 compares to -Os (optimize for size). When code is smaller you get fewer cache misses which leads to faster execution. Ruby is known to run faster with -Os.

  3. #3
    Join Date
    Mar 2011
    Posts
    90

    Default

    Quote Originally Posted by ncopa View Post
    I would be very interested in how the -O1, -O2, -O3 compares to -Os (optimize for size). When code is smaller you get fewer cache misses which leads to faster execution. Ruby is known to run faster with -Os.
    my friend please provide a bench or too to back that claim up. I have found no tests for ruby using -0s. I would welcome a link for those tests.
    PS In that search I found the "Falcon patch" which regardless of the flags made ruby much faster

  4. #4
    Join Date
    Jun 2012
    Posts
    8

    Default

    I guess the bottleneck of most videogames is not OpenGL, unless the game is designed for high-end graphics card. Check this with any profiler: gl... calls are almost unnoticeable amoung game physics and logic. Compiling the actual software and main libraries instead of driver could give a very different result.

  5. #5

    Default

    Quote Originally Posted by ryszardzonk View Post
    my friend please provide a bench or too to back that claim up.
    Use the link in my post. (click on "known").

  6. #6
    Join Date
    Apr 2011
    Posts
    35

    Default

    so the flags do exactly what the manpage says: -O2 is a good, stable optimization, while -O3 needs more compile time and may or may not improve the resulting binary so it is mostly a waste of energy and time (except you like playing and consider compiling Linux with all flag permutations as a game). I would only enable it for single applications if I am not satisfied with -O2 (it seemed that ffmpeg gained a little performance from -O3 but I did not benchmark this).
    In my experience in most cases -O3 does not improve the performance noticably (like in the article) and additionally the -Os and -O3 flags can break programs because of unpredicted segfaults.
    So the only compile flags I use for years are -march=..., -O2 and for gcc: -pipe
    For software it is better anyways to use efficient algorithms to solve a problem, no compiler optimization can improve an exponential algorithm into a linear one, it just creates a little better exponential code (or not).

  7. #7
    Join Date
    Feb 2008
    Location
    Linuxland
    Posts
    4,975

    Default

    There are quite big differences between O2 and O3 with some software, especially if it's C++ with templates.

    Bullet physics was close to 10x slower with O2, same result with Os, when compared to O3 last I tested.

  8. #8
    Join Date
    May 2011
    Posts
    353

    Default

    While you maybe can't optimize for Core2 for compatibility reasons, it is certainly safe to enable use of SSE and SSE2 in 32-bit i965. This optimization could perhaps be done.

    There are indeed i965 chipsets supporting the Celeron M processor (and some motherboards may unofficially support Pentium 4 CPUs indeed). That processor does not have SSE3 and SSSE3 support which the Core 2 has. Probably it can be optimized for Pentium/Celeron M, then at least SSE2 would be enabled.

  9. #9

    Default

    Quote Originally Posted by ncopa View Post
    I would be very interested in how the -O1, -O2, -O3 compares to -Os (optimize for size). When code is smaller you get fewer cache misses which leads to faster execution. Ruby is known to run faster with -Os.
    It does not matter if this code is not a bottleneck.

  10. #10
    Join Date
    Apr 2011
    Posts
    35

    Default

    Quote Originally Posted by curaga View Post
    There are quite big differences between O2 and O3 with some software, especially if it's C++ with templates.

    Bullet physics was close to 10x slower with O2, same result with Os, when compared to O3 last I tested.
    This is an interesting observation, I checked the manpage and searched for differences between -O2 and -O3, then thought about differences between C and C++.
    -O3 Optimize yet more. -O3 turns on all optimizations specified by -O2
    and also turns on the -finline-functions, -funswitch-loops,
    -fpredictive-commoning, -fgcse-after-reload, -ftree-vectorize and
    -fipa-cp-clone options.
    Lets see what we can find there.

    -finline-functions
    Integrate all simple functions into their callers. The compiler heuristically decides which functions are simple enough to be worth integrating in this way.

    If all calls to a given function are integrated, and the function is declared static, then the function is normally not output as assembler code in its own right.
    This affects C also, it looks like a function call is replaced by the function code. This should result in less stack usage but the function has to be so simple that creating a new stack entry costs more performance than executing the function. Seems to be relatively useless.

    -funswitch-loops
    Move branches with loop invariant conditions out of the loop, with duplicates of the loop on both branches (modified according to result of the condition).
    Sounds more like the case for a warning that someone should write more efficient code. This is not C++ specific.

    -fpredictive-commoning
    Perform predictive commoning optimization, i.e., reusing computations (especially memory loads and stores) performed in previous iterations of loops.
    I guess this also depends on the algorithms; it is pretty nice for Fibonacci numbers or heavy usage of the same memory-data and stuff like that. It could be possible that object oriented code gains something from that.

    -fgcse-after-reload
    When -fgcse-after-reload is enabled, a redundant load elimination pass is performed after reload. The purpose of this pass is to cleanup redundant spilling.
    I have no idea what a load elimination pass or spilling is

    -ftree-vectorize
    Perform loop vectorization on trees.
    This basically modifies code for parallelization and is not C++ specific

    -fipa-cp-clone
    Perform function cloning to make interprocedural constant propagation stronger. When enabled, interprocedural constant propagation will perform function cloning when externally visible function can be called with constant arguments. Because this optimization can create multiple copies of functions, it may significantly increase code size (see --param ipcp-unit-growth=value)
    This sounds interesting regarding to C++ but I don't know if I understand it correctly: Let A and B be some classes, then A could call some ("externally visible" aka public?) methods b() of B so the compiler clones these methods b() from B (into A? Or what?). Sounds like the instanciation of A includes B-code in this case. If B is a static class then we would gain A.b() and don't need to call B.b() if my interpretation is right.

    I fail to see how a factor of 10 could be reached with this...? Maybe these fipa and commoning thingies work better than they sound. The performance gain seems to come from heavier memory usage.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •