Announcement

Collapse
No announcement yet.

GCC 4.7 Link-Time Optimization Performance

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • GCC 4.7 Link-Time Optimization Performance

    Phoronix: GCC 4.7 Link-Time Optimization Performance

    With the recent interest regarding Link-Time Optimization support within the Linux kernel by GCC, here are some benchmarks of the latest stable release of GCC (v4.7.1) when benchmarking several common open-source projects with and without the performance-enhancing LTO compiler support.

    http://www.phoronix.com/vr.php?view=17785

  • #2
    The GCC developers always mention building Firefox with LTO now uses less ram, etc.

    But, what differences are there with large and complicated software like Firefox?

    Comment


    • #3
      PHP takes three times as long to compile? Someone made a bad assumption there, or else something's wrong.

      But I thought the kernel generally had pretty low CPU usage anyway (I guess the BYTE benchmark being an exception). I'm unsure as to why big speed ups in general software usage would be expected from optimisations the compiler can do to the kernel.

      Comment


      • #4
        some PGO benchmarks

        Along with LTO, i am interested in seeing some PGO benchmarks.

        Comment


        • #5
          Originally posted by Cyborg16 View Post
          PHP takes three times as long to compile? Someone made a bad assumption there, or else something's wrong.

          But I thought the kernel generally had pretty low CPU usage anyway (I guess the BYTE benchmark being an exception). I'm unsure as to why big speed ups in general software usage would be expected from optimisations the compiler can do to the kernel.
          Where did you get the kernel stuff from? The test is about using -lto compile flag on application performance (and compile time).



          I wish the apache benchmark had more infomration. Although it may be good enough for relative performance gain it says very lilttle.
          On one of my machines I got from 5000req/s to 26000req/s on 6KB file by using different options like keepalive and concurrecncy level.

          Moreover there is lettle lto can do if apache uses dynamic modules. All the hooks and codepaths still need to be there in case a module needs them.
          It would be more interesting to benchmark apache compiled with static-modules option.

          Comment


          • #6
            Originally posted by orome View Post
            Where did you get the kernel stuff from? The test is about using -lto compile flag on application performance (and compile time).
            Ah. Speed-reading and noticing the first link in the article. Thanks for pointing out my mistake!

            Comment


            • #7
              Originally posted by mayankleoboy1 View Post
              Along with LTO, i am interested in seeing some PGO benchmarks.
              PGO is trickier. for any given app that you want to compile with PGO then you must write a script that runs the program with some representative task, in order to make a profile. If the profile tasks are not representative you may end up optimising the wrong paths.

              Comment


              • #8
                I could be wrong, but...

                If you are going to be using LTO, I don't think you are suppose to care about how long it takes to compile. Of course it's going to take longer because you are optimizing the code on a global rather than local scale. If you use LTO, you are already saying you care more about the speed of the binary rather than how long it takes to compile. It is good to know the compile time hit that LTO takes, but I don't think it's really that relevant when using LTO. It is definitely a finished build option and not a debugging/develop option.

                I'd be very interested in how the binary size changes when using LTO. While there is more in-lining, there may be some dead-code drop and other GCC magic. Just curious if you can say: "when using LTO, the binary size will in general (in/de)crease".

                Comment


                • #9
                  Originally posted by FourDMusic View Post
                  I could be wrong, but...

                  If you are going to be using LTO, I don't think you are suppose to care about how long it takes to compile.
                  agreed. I am impressed that its only 3 times longer. if there are 100 source files then the optimiser has 100 times as much stuff to think about at once. I guess the real slow down is when that makes you hit swap.

                  Comment


                  • #10
                    Originally posted by ssam View Post
                    agreed. I am impressed that its only 3 times longer. if there are 100 source files then the optimiser has 100 times as much stuff to think about at once. I guess the real slow down is when that makes you hit swap.
                    Well that's of course only valid if you're building the source yourself. If this is built into pre-compiled binaries, compile time becomes very irrelevant to the user.

                    Comment


                    • #11
                      Originally posted by ssam View Post
                      agreed. I am impressed that its only 3 times longer. if there are 100 source files then the optimiser has 100 times as much stuff to think about at once. I guess the real slow down is when that makes you hit swap.
                      Sorry, your math doesn't add up. Let's say you have 100 files that compiled without LTO will take 100 seconds and with LTO a theoretical 300 seconds.
                      So for how many times should be as slow?
                      From the 100 seconds, the preprocessor of the C/C++ compiler will have to expand the includes, the scanner have to tokenize the file, the parser generates an AST (a tree that describe the source code). The AST is visited and is written into GIMPLE. The GCC optimizer optimizes the GIMPLE representation (using every file individually) and later it uses registry allocation and generates the .o (object file) for every source file. At the end, the linking step is made.
                      All added up to 100 seconds.
                      In LTO mode, things happen a bit differently: the compiler writes to GIMPLE, and there is no optimization upfront. But later, in the linking step.
                      As header expansion, template expansion, parsing, linking happen in both cases, it seems that optimizing the 100 files individually (from GIMPLE to .o files) will take let's say 50 seconds. So the non-optimizing step is the other 50 seconds.
                      And with LTO will take the 300 seconds - parsing/expanding,etc. (=50 seconds) = 250 seconds.
                      So a program with LTO will basically work like for 5 times as slow (the numbers are fictitious but close to reality).
                      Anyway, why just 5 times as slow and not 100!? Because LTO is not a naive implementation: before optimizing, is very easy to generate a call-graph based on GIMPLE (it is done by both LTO and non-LTO optimizer). This call graph is the scale of how much LTO "knows many files in the same time". So if a function does its own math routine, and at the end executes a "printf", it doesn't need to think to any other external function than printf. Another part is that LTO has more information to do inlining, so it will do inlining in an extensive scale. Also, it can consolidate more functions (like static constructors), as it knows that at runtime which constructors are and which are their dependency graph.
                      So LTO is it some times slower than the entire compilation process (I think that the optimizing time, is around 30%, but I don't have real numbers, but if it would be 30%, it would mean that LTO will be a 7 times slower optimizer, with sometimes great effect, sometimes not so much).
                      At the end I think that LTO has a great value for desktop applications: they have a lot of code that is unused, because most of big applications are loading a lot of frameworks at startup. And I hope that LTO will be a combination that reduce the startup time (as LTO can prove that parts of a function are not used at all, so can be safely removed, and combining of the static constructors told earlier). Performance wise, is a hard thing to say, excluding the developer is not aware of how C++ works. Because in most times the code is using templates (like STL) and this expanded code is already well inlined, so there is no way that LTO gives more information than already specialized templates are.

                      Comment


                      • #12
                        Originally posted by grigi View Post
                        The GCC developers always mention building Firefox with LTO now uses less ram, etc.

                        But, what differences are there with large and complicated software like Firefox?
                        Firefox gets a lot smaller, not much smaller on common benchmarks it is hand optimized for. See
                        http://arxiv.org/abs/1010.2196

                        Comment


                        • #13
                          Originally posted by grigi View Post
                          The GCC developers always mention building Firefox with LTO now uses less ram, etc.

                          But, what differences are there with large and complicated software like Firefox?
                          See http://arxiv.org/abs/1010.2196 . Firefox gets a lot smaller (it is easy to build -O3 binary with LTO that has size of -Os binary w/o LTO) and performance is pretty much the same on common benchmarks since they are hand optimized (and spend a lot of time in JIT optimized code). Things got better since 2010, but the overall picture is still similar.

                          http://http://gcc.opensuse.org/ has some SPEC results with normal compilation, LTO and LTO+FDO. This gives you some more idea on what kind of speedups you can expect. LTO can be big hit when codebase is not really hand tuned or it is too complex to hand tune for all workloads. It is smaller performance hit for hand tuned programs (such as those in SPEC . This is not much of surprise though: most of optimizations enabled by LTO can be also done manually by the developers so you can think of LTO as a tool making development easier.

                          Code size is harder to hand tune than speed, since you need to optimize whole program, not just hot spot. Consequently LTO is almost always big win for code size, unless you enable a lot of inlining or other code expanding optimizations.

                          Comment


                          • #14
                            Note that compile time benchmarks should really use -fno-fat-lto. This makes LTO compile times much closer to non-LTO ones. Without this flag the object files actually contains both non-LTO optimized assembly and LTO data, so all compilation happens twice.
                            This is done to not break build systems that are still not quite ready to expect LTO only object files. To get that, you need plugin enabled binutils with ar/nm/etc. support so you need to make simple wrappers around these tools passing appropriate --with-plugin flags. This is the reason why -fno-fat-lto is not the default, yet. (Both GCC and Firefox builds hapilly with it)

                            Comment


                            • #15
                              The effect on Firefox are summarized in http://arxiv.org/abs/1010.2196 . In short you can easily get -O3 performing binary with LTO that has size of -Os binary w/o LTO, so there are huge code size benefits. Performance benefits are smaller on common benchmarks since they have been hand optimized.

                              You can also look at http://http://gcc.opensuse.org/ . It has SPEC scores w/o LTO, with LTO and with LTO+FDO.

                              In general LTO can have huge performance impact on codebases that was not hand tuned, or are too large to be hand tuned for all workloads, or was developed with LTO enabled compiler (many HPC apps). Optimizations enabled by LTO can also be done by the developers, so LTO is kind of tool making development easier. LTO also enable benchmark compilers to enable various tricks (change datastructures, etc.), but that is not too important in real world.

                              For code size the applications are harder to tune: instead of tuning small hot spot you need to optimize whole application. This is why LTO is almost always win code size wide unless you enable a lot of inlining or other code expanding optimizations.

                              Comment

                              Working...
                              X