Announcement

Collapse
No announcement yet.

Optimizing Mesa Performance With Compiler Flags

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Optimizing Mesa Performance With Compiler Flags

    Phoronix: Optimizing Mesa Performance With Compiler Flags

    Compiler tuning can lead to performance improvements for many computational benchmarks by toying with the CFLAGS/CXXFLAGS, but is there much gain out of optimizing your Mesa build? Here's some benchmark results...

    http://www.phoronix.com/vr.php?view=MTI4NTY

  • #2
    I would be very interested in how the -O1, -O2, -O3 compares to -Os (optimize for size). When code is smaller you get fewer cache misses which leads to faster execution. Ruby is known to run faster with -Os.

    Comment


    • #3
      Originally posted by ncopa View Post
      I would be very interested in how the -O1, -O2, -O3 compares to -Os (optimize for size). When code is smaller you get fewer cache misses which leads to faster execution. Ruby is known to run faster with -Os.
      my friend please provide a bench or too to back that claim up. I have found no tests for ruby using -0s. I would welcome a link for those tests.
      PS In that search I found the "Falcon patch" which regardless of the flags made ruby much faster

      Comment


      • #4
        I guess the bottleneck of most videogames is not OpenGL, unless the game is designed for high-end graphics card. Check this with any profiler: gl... calls are almost unnoticeable amoung game physics and logic. Compiling the actual software and main libraries instead of driver could give a very different result.

        Comment


        • #5
          Originally posted by ryszardzonk View Post
          my friend please provide a bench or too to back that claim up.
          Use the link in my post. (click on "known").

          Comment


          • #6
            so the flags do exactly what the manpage says: -O2 is a good, stable optimization, while -O3 needs more compile time and may or may not improve the resulting binary so it is mostly a waste of energy and time (except you like playing and consider compiling Linux with all flag permutations as a game). I would only enable it for single applications if I am not satisfied with -O2 (it seemed that ffmpeg gained a little performance from -O3 but I did not benchmark this).
            In my experience in most cases -O3 does not improve the performance noticably (like in the article) and additionally the -Os and -O3 flags can break programs because of unpredicted segfaults.
            So the only compile flags I use for years are -march=..., -O2 and for gcc: -pipe
            For software it is better anyways to use efficient algorithms to solve a problem, no compiler optimization can improve an exponential algorithm into a linear one, it just creates a little better exponential code (or not).

            Comment


            • #7
              There are quite big differences between O2 and O3 with some software, especially if it's C++ with templates.

              Bullet physics was close to 10x slower with O2, same result with Os, when compared to O3 last I tested.

              Comment


              • #8
                While you maybe can't optimize for Core2 for compatibility reasons, it is certainly safe to enable use of SSE and SSE2 in 32-bit i965. This optimization could perhaps be done.

                There are indeed i965 chipsets supporting the Celeron M processor (and some motherboards may unofficially support Pentium 4 CPUs indeed). That processor does not have SSE3 and SSSE3 support which the Core 2 has. Probably it can be optimized for Pentium/Celeron M, then at least SSE2 would be enabled.

                Comment


                • #9
                  Originally posted by ncopa View Post
                  I would be very interested in how the -O1, -O2, -O3 compares to -Os (optimize for size). When code is smaller you get fewer cache misses which leads to faster execution. Ruby is known to run faster with -Os.
                  It does not matter if this code is not a bottleneck.

                  Comment


                  • #10
                    Originally posted by curaga View Post
                    There are quite big differences between O2 and O3 with some software, especially if it's C++ with templates.

                    Bullet physics was close to 10x slower with O2, same result with Os, when compared to O3 last I tested.
                    This is an interesting observation, I checked the manpage and searched for differences between -O2 and -O3, then thought about differences between C and C++.
                    -O3 Optimize yet more. -O3 turns on all optimizations specified by -O2
                    and also turns on the -finline-functions, -funswitch-loops,
                    -fpredictive-commoning, -fgcse-after-reload, -ftree-vectorize and
                    -fipa-cp-clone options.
                    Lets see what we can find there.

                    -finline-functions
                    Integrate all simple functions into their callers. The compiler heuristically decides which functions are simple enough to be worth integrating in this way.

                    If all calls to a given function are integrated, and the function is declared static, then the function is normally not output as assembler code in its own right.
                    This affects C also, it looks like a function call is replaced by the function code. This should result in less stack usage but the function has to be so simple that creating a new stack entry costs more performance than executing the function. Seems to be relatively useless.

                    -funswitch-loops
                    Move branches with loop invariant conditions out of the loop, with duplicates of the loop on both branches (modified according to result of the condition).
                    Sounds more like the case for a warning that someone should write more efficient code. This is not C++ specific.

                    -fpredictive-commoning
                    Perform predictive commoning optimization, i.e., reusing computations (especially memory loads and stores) performed in previous iterations of loops.
                    I guess this also depends on the algorithms; it is pretty nice for Fibonacci numbers or heavy usage of the same memory-data and stuff like that. It could be possible that object oriented code gains something from that.

                    -fgcse-after-reload
                    When -fgcse-after-reload is enabled, a redundant load elimination pass is performed after reload. The purpose of this pass is to cleanup redundant spilling.
                    I have no idea what a load elimination pass or spilling is

                    -ftree-vectorize
                    Perform loop vectorization on trees.
                    This basically modifies code for parallelization and is not C++ specific

                    -fipa-cp-clone
                    Perform function cloning to make interprocedural constant propagation stronger. When enabled, interprocedural constant propagation will perform function cloning when externally visible function can be called with constant arguments. Because this optimization can create multiple copies of functions, it may significantly increase code size (see --param ipcp-unit-growth=value)
                    This sounds interesting regarding to C++ but I don't know if I understand it correctly: Let A and B be some classes, then A could call some ("externally visible" aka public?) methods b() of B so the compiler clones these methods b() from B (into A? Or what?). Sounds like the instanciation of A includes B-code in this case. If B is a static class then we would gain A.b() and don't need to call B.b() if my interpretation is right.

                    I fail to see how a factor of 10 could be reached with this...? Maybe these fipa and commoning thingies work better than they sound. The performance gain seems to come from heavier memory usage.

                    Comment


                    • #11
                      This change is mainly to benefit 32-bit systems where SSE support can't be assumed by default, but with the i965 driver, more often than not it can be assumed an Intel Core 2 processor or newer is in use. (The older Intel processors are generally using the i915 driver.) By setting the -march=core2 flag, for i386 builds SSE would now be used for floating-point math and cmov instructions, plus other performance optimizations.
                      [...]
                      This patch was ultimately rejected since it turns out there's still some old Pentium 4s that could be found in an i965 driver configuration where things might break.
                      Then why not use something like -march=i686 -msse -msse2? That would enable gcc to use cmov and sse/sse2 instructions and the binaries would still run on a P4.

                      Comment


                      • #12
                        @mark

                        It's mainly about the inlining. Yes, it can have that big an effect.

                        C++ templates much exacerbate that effect, when you have templates calling templates calling templates, you can get thousands of pointless function calls without inlining.

                        Comment


                        • #13
                          Originally posted by curaga View Post
                          @mark

                          It's mainly about the inlining. Yes, it can have that big an effect.

                          C++ templates much exacerbate that effect, when you have templates calling templates calling templates, you can get thousands of pointless function calls without inlining.
                          ok, makes sense. But shouldn't the programmer use inline functions or macros in this case?
                          I guess I will add the inline parameter to my CXXFLAGs and for single C packages.

                          Comment


                          • #14
                            Originally posted by ryao View Post
                            It does not matter if this code is not a bottleneck.
                            True. Modern CPUs also has bigger caches than before so I would expect the inner loop fit in cache.

                            It would still be interesting to see how -Os compares.

                            Comment


                            • #15
                              Originally posted by ncopa View Post
                              True. Modern CPUs also has bigger caches than before so I would expect the inner loop fit in cache.

                              It would still be interesting to see how -Os compares.
                              I did not benchmark -Os but used it for some months instead of -O2. I felt no difference and sometimes had some segfaults that disappeared after switching back to -O2. I guess -Os is only worth looking at if you really need it and know what you are doing.

                              Comment

                              Working...
                              X