Announcement

Collapse
No announcement yet.

Link-Time Optimizations With GCC 4.8

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Link-Time Optimizations With GCC 4.8

    Phoronix: Link-Time Optimizations With GCC 4.8

    GCC 4.8 will feature a few improvements when it comes to LTO, a.k.a. Link-Time Optimization, but will this reflect in any greater performance for the resulting binaries?..

    http://www.phoronix.com/vr.php?view=MTI5ODE

  • #2
    I saw all the linked results as well.
    Basically, a few percent improvement. About 4-5% over stock.

    Free performance is always good, but may be not at the cost of 3X time and 2.5X RAM use.

    Comment


    • #3
      Originally posted by mayankleoboy1 View Post
      I saw all the linked results as well.
      Basically, a few percent improvement. About 4-5% over stock.

      Free performance is always good, but may be not at the cost of 3X time and 2.5X RAM use.
      When it becomes a more consistent win, it will make sense to use it in release builds of binaries that are redistributed. I think it's definitely worth having packaging take 3x longer if it makes the resulting binary 5% faster and strips out lots of dead code too.

      Comment


      • #4
        I seriously doubt Michael is using LTO correctly.

        When you are using just a single command to compile, like gcc -march=native -O3 -flto -fwhole-program ... it works fine, but when you use a makefile with separate C(XX)FLAGS and LDFLAGS you need to pass the C(XX)FLAGS along to the LDFLAGS, else the optimization will suffer greatly. So you should do something like this:

        CXXFLAGS = -O3 -march=native -flto -fwhole-program
        LDFLAGS = $(CXXFLAGS) -Wall

        I've done many LTO comparisons and it's not always that there is any gain (alot of the benefits of LTO can be had by just defining functions as static when appropriate) but I've never come across such regressions as shown here in Michael's tests. Hence I'm thinking he is not passing the C(XX)FLAGS along to the linker through the LDFLAGS in the tests which uses a makefile with separate C(XX)FLAGS/LDFLAGS, which in turn means the C(XX)FLAG optimizations aren't being used when generating the final binary.

        Comment


        • #5
          Originally posted by XorEaxEax View Post
          it works fine, but when you use a makefile with separate C(XX)FLAGS and LDFLAGS you need to pass the C(XX)FLAGS along to the LDFLAGS, else the optimization will suffer greatly. So you should do something like this:

          CXXFLAGS = -O3 -march=native -flto -fwhole-program
          LDFLAGS = $(CXXFLAGS) -Wall
          Is this enought:
          CXXFLAGS = -O3 -march=native -flto -fwhole-program
          LDFLAGS = -flto -Wall

          Comment


          • #6
            Originally posted by LightBit View Post
            Is this enought:
            CXXFLAGS = -O3 -march=native -flto -fwhole-program
            LDFLAGS = -flto -Wall

            AFAIK you need to pass the optimization flags aswell, atleast I recall having to do so the last time I benchmarked LTO (which was on 4.7, not 4.8), so:

            CXXFLAGS = -O3 -march=native -flto -fwhole-program
            LDFLAGS = -O3 -march=native -flto -fwhole-program -Wall (... and whatever other linker options you have)

            or just reference the CXXFLAGS variable as I did above:
            LDFLAGS = $(CXXFLAGS) -Wall

            I believe this is necessary due to the ability of using LTO on object files written in different languages, but I may be wrong. I haven't really dived into LTO as I haven't gotten any major gains from it for my own code, particularly when compared to PGO which pretty much always yield gains, often significant.

            Comment


            • #7
              Originally posted by XorEaxEax View Post
              I believe this is necessary due to the ability of using LTO on object files written in different languages, but I may be wrong. I haven't really dived into LTO as I haven't gotten any major gains from it for my own code, particularly when compared to PGO which pretty much always yield gains, often significant.
              I've never heard of PGO until now, but would love to see some recent benchmarks. Most of the articles I saw were reporting up to ~10% gains.

              Also, from man gcc:
              Code:
              To use the link-time optimizer, -flto needs to be specified at compile time and during the final link.

              Comment


              • #8
                Originally posted by LightBit View Post
                Is this enought:
                CXXFLAGS = -O3 -march=native -flto -fwhole-program
                LDFLAGS = -flto -Wall

                No, then you get -O0 optimizations. LTO means link-time optimizations, which means the linker does the optimizations, which again means the linker needs the optimization flags, but the compiler does not.

                So
                CXXFLAGS = -flto
                LDFLAGS = -O3 -march=native -flto -fwhole-program

                Would work, but your example would not.

                Note you can also speed up the compilation even more by disabling fat object files, by default GCC produces object files that both contain the code for LTO linking and traditional object code, the later is not needed if you are going to use LTO anyway on the final link. Edit: Using -fno-fat-lto-objects as a compile time flag.
                Last edited by carewolf; 02-10-2013, 02:21 PM.

                Comment


                • #9
                  Originally posted by carewolf View Post
                  Note you can also speed up the compilation even more by disabling fat object files, by default GCC produces object files that both contain the code for LTO linking and traditional object code, the later is not needed if you are going to use LTO anyway on the final link. Edit: Using -fno-fat-lto-objects as a compile time flag.
                  That depends on your toolchain - IIRC non-fat lto requires gold instead of the usual GNU ld.

                  Comment


                  • #10
                    Additionally, the optimization flags used to compile individual files are not necessarily related to those used at link time. For instance,

                    gcc -c -O0 -flto foo.c
                    gcc -c -O0 -flto bar.c
                    gcc -o myprog -flto -O3 foo.o bar.o


                    This produces individual object files with unoptimized assembler code, but the resulting binary myprog is optimized at -O3. If, instead, the final binary is generated without -flto, then myprog is not optimized.
                    http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

                    Comment


                    • #11
                      Originally posted by curaga View Post
                      That depends on your toolchain - IIRC non-fat lto requires gold instead of the usual GNU ld.
                      But you need to link using GCC or gold anyway to get LTO? The fat object files just makes it possible to link without using LTO. So non-fat objects also helps to check that you really are getting LTO and not just the old code through the fallback.

                      Comment


                      • #12
                        Originally posted by carewolf View Post
                        No, then you get -O0 optimizations. LTO means link-time optimizations, which means the linker does the optimizations, which again means the linker needs the optimization flags, but the compiler does not.

                        So
                        CXXFLAGS = -flto
                        LDFLAGS = -O3 -march=native -flto -fwhole-program
                        Ah, yes, that makes much better sense. I just noticed that unless I passed the optimization options to the linker flags I got poor optimization (likely -O0).

                        Anyway, as I said earlier I think this is what is the problem with the regressions in Michael's tests. I doubt he has passed the optimization options to the LDFLAGS in the tests where these regressions occur. LTO often doesn't yield any 'worthwhile' gains in my benchmarks but also hasn't caused any worse performance for me. The overall benefit I've noticed is that the binaries pretty much always end up quite a bit smaller (likely due to dead/duplicate code removal, more efficient code reordering etc).

                        Yes that pretty much sums it up, good pointer.

                        Comment


                        • #13
                          Originally posted by carewolf View Post
                          But you need to link using GCC or gold anyway to get LTO? The fat object files just makes it possible to link without using LTO. So non-fat objects also helps to check that you really are getting LTO and not just the old code through the fallback.
                          I hit that with my toolchain - I couldn't use non-fat LTO, but I could use fat LTO. I definitely got the benefits (10% smaller binaries).

                          Quote from the gcc manual:
                          -ffat-lto-objects
                          Fat LTO objects are object files that contain both the intermediate language and the object code. This makes them usable for both LTO linking and normal linking. This option is effective only when compiling with -flto and is ignored at link time.

                          -fno-fat-lto-objects improves compilation time over plain LTO, but requires the complete toolchain to be aware of LTO. It requires a linker with linker plugin support for basic functionality. Additionally, nm, ar and ranlib need to support linker plugins to allow a full-featured build environment (capable of building static libraries etc).
                          (emphasis mine)

                          Comment


                          • #14
                            right, and this speeds up the compilation, the "time to compile" benchmark are completely messed up

                            Comment


                            • #15
                              The dhrystone benchmark is crap.

                              That benchmark is a derivate of the original 1988 dry.c which was composed by two separate .c files.
                              Those two files were kept separate to avoid explicitly the compiler to inline function.

                              example, assume to write a tool to benchnark the integer math, so we slipt it in mul.c and div.c with this functions:

                              Code:
                              int mul(int a, int b)
                              {
                                  return a * b;
                              }
                              Code:
                              int div(int a, int b)
                              {
                                  return a / b;
                              }
                              and from our main call:

                              Code:
                              int test(int x)
                              {
                                  for(int i = 1; i < x; i++)
                                      div(mul(i, 100), 25);
                              }
                              inlining those two functions will generate something like:

                              Code:
                              int test(int x)
                              {
                                  for(int i = 1; i < x; i++)
                                      (i * 100) / 25;
                              }
                              which the compiler optimize as

                              Code:
                              int test(int x)
                              {
                                  for(int i = 1; i < x; i++)
                                      i * 4;
                              }
                              With a huge performance gain.
                              While LTO is good in real use, it can fake many benchmarks, so I'll use it only on real world scenarios.

                              Comment

                              Working...
                              X