Announcement

Collapse
No announcement yet.

Squeezing More Performance Out Of The Linux Kernel With Clang + LTO

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Was that thin lto?

    Comment


    • #22
      Michael, it's either "Clang/GCC build or Built with Clang/GCC" :-)

      I'm not a fan of the results. It's not like I don't like your work, I don't like that the kernel has so much untapped performance only due to the use of compiler and its flags. I expected at most ~a 1% performance boost, not ~8%.

      Someone has to show the result to Linus.

      Comment


      • #23
        Hey Michael since we're talking about squeezing out performance from the kernel, I ran some benchmarks today testing out the CacULE schedule vs CFS if you or anyone else is interested.

        https://openbenchmarking.org/result/...IB-CFSVSCACU49

        Winner: CacULE enabled, 1000HZ, full tickless, low-latency PREEMPT
        Loser: the real-time (-rt patches) kernel
        Runner-up loser: stock Ubuntu kernel
        Runner-up winner: perpetually high's custom kernel with CFS

        Quick backstory if anyone is curious:

        A few days ago a phoronix user let me know he was using the CacULE on his mobile Haswell with good results so it got me intrigued.

        I was originally using the alpha-version of CacULE scheduler called cachy which was not based off of CFS. That scheduler was scrapped and the CacULE creator (Hamad Al Mirri) looked at the FreeBSD scheduler and took similar ideas and applied to Ingo Molnar's awesome CFS scheduler.

        So CacULE is now just base CFS + optimizations/shortcuts/etc, for people looking to really squeeze out the most out of their kernel.

        He explains it nicely here: https://github.com/hamadmarri/cacule...discussions/37

        For those that like to tinker or are curious, I encourage you to give it a shot, if not through your own custom compiled kernel, at least through xanmod. It's noticeably fast. (yes, I know I keep saying that but it's true).

        The test results were all after a fresh reboot, then verified through htop that nothing was running in the background. Only difference between each test is just the kernel, nothing else.

        Last thing: The full tickless are nice because it allows for adding "isolcpus=1-3 nohz_full=1-3 rcu_nocbs=1-3" to GRUB to isolate certain CPU cores (here it's cores 1-3 on a 4 core, so 0 is the first core, 1-3 are the 2,3,4th cores) if you need to isolate RCU callbacks or ticker from a certain task. Instead of 1-3, you can also just put "3" and then use ./taskset -c 3 ./your_program and the program will run on the last CPU core, with 0 ticks, unbothered from anything else. Very cool stuff.

        Comment


        • #24
          Originally posted by sdack View Post
          Perhaps it is too early, but what most people thought could happen one day might be just around the corner now. Clang might finally surpass GCC as leading compiler.
          "It is hard to make predictions, especially about the future"

          Historically GCC and LLVM have been one-upping each other for quite some time, with specific use cases showing one (or the other) is better for that specific use, and the winner perhaps changing at the next release (of either compiler). When a good idea is added to one compiler which results in much better performance, the other compiler developers will often consider how to accomplish the equivalent (or better).

          For the hyperscalers, and HPC crowd, even a 1% improvement for their specific use cases might save them many millions of dollars in infrastructure costs, so it is worth pursuing the grail, and many have put major resources into LLVM and both the linux kernel and other apps.

          (Thin) LTO and PGO certainly have shown improvements in the kernel for some specific use cases, and those that want to see GCC continue to be a long term competitor in the kernel space will need to contribute and make equivalent compiler optimizations available there (out of tree does not count), as the large organizations and customers drive how vendors spend their resources, and if the kernel is measurably and consistently faster with LLVM the vendors are going to start delivering such kernels to their customers (and once the avalanche has started, it will be too late for the pebbles to vote).

          Comment


          • #25
            So what Honza Hubicka was telling about GCC LTO on Linux kernel seems to be confirmed here for LLVM LTO. LLVM LTO still bring CFI benefit until Rust/C mixed kernel will nullify it.

            Comment


            • #26
              In search of something funky, I think it was interesting that LTO lost in both compression tests (zstd and lz4). Of course, not all optimizations are positive in all cases, but LTO… how can that go wrong?

              It could be a statistical fluke, and the differences were small, but the two graphs look identical between the tests.

              Comment


              • #27
                Originally posted by andreano View Post
                In search of something funky, I think it was interesting that LTO lost in both compression tests (zstd and lz4). Of course, not all optimizations are positive in all cases, but LTO… how can that go wrong?

                It could be a statistical fluke, and the differences were small, but the two graphs look identical between the tests.
                I suppose it can effect inlining decisions pretty heavily, and that can have secondary effects elsewhere. Either by causing certain inner loops to exceed cpu cache sizes in the extreme case, or more likely just by reorganizing how data is stored in memory causing things like the cpu's memory prefetching to behave slightly differently (and other such minor semi-random issues).

                Comment


                • #28
                  Originally posted by avem View Post
                  Michael, it's either "Clang/GCC build or Built with Clang/GCC" :-)

                  I'm not a fan of the results. It's not like I don't like your work, I don't like that the kernel has so much untapped performance only due to the use of compiler and its flags. I expected at most ~a 1% performance boost, not ~8%.

                  Someone has to show the result to Linus.
                  That boost is not necessarily "free". One argument against using lot is that it can make backtraces unintelligible.

                  Comment


                  • #29
                    Originally posted by avem View Post
                    Michael, it's either "Clang/GCC build or Built with Clang/GCC" :-)

                    I'm not a fan of the results. It's not like I don't like your work, I don't like that the kernel has so much untapped performance only due to the use of compiler and its flags. I expected at most ~a 1% performance boost, not ~8%.

                    Someone has to show the result to Linus.
                    Since when do you only get 1% when something is auto-vectorized!? Since when do you get 1% only when loops get unrolled and jammed, then inlined? He did mention the chips he was running on and we can deduce the cache sizes that come with those.

                    Basically if you have i7s from Sandybridge era up to now, then you've got enough cache for optimised code.

                    Comment


                    • #30
                      Originally posted by carewolf View Post

                      And overall those changes were statistically insignificant before being cherry-picked...
                      That's a valid point. I'm fine with showing cherry-picked benchmarks but you should also display the overall average performance in order to give context. Was the average performance displayed by Michael from the whole set or from the cherry-picked one?
                      ## VGA ##
                      AMD: X1950XTX, HD3870, HD5870
                      Intel: GMA45, HD3000 (Core i5 2500K)

                      Comment

                      Working...
                      X