Announcement

Collapse
No announcement yet.

Following LTO, Linux Kernel Patches Updated For PGO To Yield Faster Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Following LTO, Linux Kernel Patches Updated For PGO To Yield Faster Performance

    Phoronix: Following LTO, Linux Kernel Patches Updated For PGO To Yield Faster Performance

    Clang LTO for the Linux kernel to provide link-time optimizations for yielding more performant kernel binaries (plus Clang CFI support) looks like it will land for Linux 5.12. With that compiler optimization feature appearing squared away, Google engineers are also working on Clang PGO support for the Linux kernel to exploit profile guided optimizations for further enhancing the kernel performance...

    http://www.phoronix.com/scan.php?pag...O-Linux-Kernel

  • #2
    But PGO overall isn't too widely used by upstream open-source projects due to effectively requiring two compiler builds...
    Could someone explain to me like I'm five?


    Overrall pretty cool! Ihope GCC will support these features soon or that Fedora will consider using Clang to build the kernel (if those perf increase are actually notable).

    Comment


    • #3
      So now I'll have to build a kernel, run my desktop, run some games, and rebuild my kernel. Sounds like a perfect job for the PTS to help automate.

      Comment


      • #4
        Originally posted by kvuj View Post

        Could someone explain to me like I'm five?


        Overrall pretty cool! Ihope GCC will support these features soon or that Fedora will consider using Clang to build the kernel (if those perf increase are actually notable).
        With most compiler features, you flip it on and compile the code...

        With PGO being about profiling the code to see what code paths are actually taken in the real-world, the hottest code paths, etc, the compiler uses that information to make more accurate optimization decisions.

        So first you need to: build the kernel/code to be able to profile it -> run the generated program/kernel with workloads you actually use while collecting the profile -> rebuild the kernel/program while passing the previously generated profile to the compiler.
        Michael Larabel
        http://www.michaellarabel.com/

        Comment


        • #5
          Originally posted by kvuj View Post

          Could someone explain to me like I'm five?


          Overrall pretty cool! Ihope GCC will support these features soon or that Fedora will consider using Clang to build the kernel (if those perf increase are actually notable).
          You build the code leaving profiling info around, then you run it, collecting the profiling info, which is things like how hot a piece of data is, what is the chance of a branch to be taken, etc...
          Then use that data to build your code to make the hot paths a priority.

          PGO has been supported by GCC for a long time. Its more a case of you need to instrument something, and instrumenting a kernel is totally different.
          I suppose that much of this work is how to actually profile the linux kernel, and that the person doing this knew the clang system better?

          Comment


          • #6
            Originally posted by grigi View Post
            I suppose that much of this work is how to actually profile the linux kernel, and that the person doing this knew the clang system better?
            I would conjecture that it is more about having an itch, and scratching it (as much FOSS contribution is all about). PGO (and the resulting improvements) is likely important at the high end (hyperscalers) and low end (typically mobile) device makers. As Google has been clang'ing their Android kernels for quite some time (for CFI), and have been providing kernel LTO patches, I expect it was only natural to do the work on PGO where they saw it can be beneficial and with the compiler they have been working with for their other activities. Those that want/need to do the same with gcc are free to review and extend the work.
            Last edited by CommunityMember; 14 January 2021, 02:02 PM.

            Comment


            • #7
              Originally posted by skeevy420 View Post
              So now I'll have to build a kernel, run my desktop, run some games, and rebuild my kernel. Sounds like a perfect job for the PTS to help automate.
              It's not something most folks will need or want to do. Good enough is good enough, and it's unlikely to yield significant performance advantages to make it worth it.

              The value is more for large scale environments, HPC clusters, supercomputers, etc. where they run very specific dedicated workloads and every single tiny bit of performance matters, even down to a quarter of a percent improvement that your average end user wouldn't even notice.

              Comment


              • #8
                Originally posted by kvuj View Post
                Could someone explain to me like I'm five?
                Without profile guided optimization, the compiler makes a ton of educated guesses in regards to what parts of the code is 'hot' (as in called very often), which branches are more often taken (branch prediction) and how many times a loop is taken (loop unrolling) etc. When the compiler does accurate guesses, something like PGO is unecessary, however as we see when we use PGO, we often get between 10-20% performance increase, meaning the compiler often makes inaccurate guesses.

                So how does PGO work ? First you compile your program with a special setting that inserts a lot of extra code which in turn gathers runtime statistics which includes all of the above (hot / cold code, branches, loops), you then run this program in what would be your typical usage pattern (it will be a bit slower at this point given the data gathering), once you exit it will generate files containing all the runtime data.

                Now you compile it again using another setting, instead of guessing, the compiler will now look at the gathered runtime data and actually know which code is called often, which branches are most likely to be taken, how many times a loop is taken, cache information etc and will choose the way it optimizes the code based upon this data, again often yielding a 10-20% performance increase over the 'guesswork' the compiler would do otherwise.

                Now in order to get the best performance for your usage patterns, the gathering stage should represent that pattern as much as possible, but it can be efficiently automated to cover most usage patterns, software like Firefox automates the PGO gathering stage and the PGO builds are much faster than non-PGO builds.

                For example, I use PGO for my Blender builds, the difference between using the stock distro (Arch) package of Blender and a PGO built version when rendering using the cpu is ~15%, that's a decent chunk of performance to leave on the table if you do a lot of renders, and of course this impacts other things like simulations, 3d sculpting etc.

                Comment


                • #9
                  Originally posted by kvuj View Post

                  Could someone explain to me like I'm five?


                  Overrall pretty cool! Ihope GCC will support these features soon or that Fedora will consider using Clang to build the kernel (if those perf increase are actually notable).
                  GCC an Clang have supported "Profile Guided Optimisation" (PGO) for quite a while. This is about better shaping the linux kernel for PGO.

                  Explain it like I'm 5. Hmm. Okay. Here goes, this'll be a little rough but hopefully helps:

                  The headmaster at a school wants to make his school's classrooms better for the kids in them, have them waste less time doing the stuff that allows them to then get on with learning, e.g. sharpening pencils, getting supplies.

                  For the purpose of this analogy, there's three ways they can do this:

                  1) Look at each classroom one at a time. Forget all prior information you've already gathered about improving each classroom, and make a series of known good improvements, e.g. re-arrange the chairs to a better layout than having them all over the place, or make sure pencil sharpeners aren't up on the highest shelves. That's standard compilation. You can do a very good job when you look at things this way.

                  2) Gather information about every classroom layout, consider everything. Improve the classrooms, leveraging that bigger picture perspective. Maybe there are some improvements you could make that you wouldn't have seen otherwise until you looked at the bigger picture, e.g. you notice that the kids classrooms aren't near their playground, so you rearrange the classes to use classrooms nearer to their playground. That's LTO "Link Time Optimisation". LTO gives the compiler a much bigger view of the code being compiled, and provides opportunities for more efficient code.

                  3) Do all of the above, and carry out a time and motion study over a period of time. Pay attention to how long kids are spending doing what activities every day across the entirety of the school, and how many times they do that, recording all of this information. Then you go back to the start of the exercise, and re-optimise every classroom based on what you now know the students do a lot of. You'll have lots of data to help you see bigger trends, but also small scale trends like this classroom does painting more often than that one, therefore make sure the paining supplies are more conveniently located, and more regularly refreshed, maybe even keep the general painting supplies nearer to that classroom. That optimisation would slow things down for other classrooms when they need to refresh their paint supplies, but if they do it less than other classrooms do it's an overall win. That's "Profile Guided Optimisation", or PGO.



                  What they're trying to do with the kernel is best enable the latter, and improve profile gathering for the code. The hope would be to produce kernels that are even more efficient for their purpose. It's not something your average desktop user is going to care about, for example, but someone running high performance long running code, like the major supercomputers do, every little bit helps. There are a bunch of optmisations that only make sense to apply when you can see how everything actually gets used. For example, you can spot hot code paths and make sure they're stored adjacent to each other in the final compiled binary, so that it's more likely to be cached, or read in to memory, at the same time as the code that calls it.

                  Comment


                  • #10
                    I don't know if this project is using it, but there's another option besides compiling the program with profiling support included. You can use "perf" instead. The CPU profiling counters can be recorded and used to generate a profile.

                    It may not be as accurate as an instruction-precise instrumented profile recording, but it has almost no overhead and can be run on any program or the kernel itself.

                    Comment

                    Working...
                    X