Announcement

Collapse
No announcement yet.

Squeezing More Performance Out Of The Linux Kernel With Clang + LTO

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by perpetually high View Post
    Hey Michael since we're talking about squeezing out performance from the kernel, I ran some benchmarks today testing out the CacULE schedule vs CFS if you or anyone else is interested.

    https://openbenchmarking.org/result/...IB-CFSVSCACU49

    Winner: CacULE enabled, 1000HZ, full tickless, low-latency PREEMPT
    Loser: the real-time (-rt patches) kernel
    Runner-up loser: stock Ubuntu kernel
    Runner-up winner: perpetually high's custom kernel with CFS

    Quick backstory if anyone is curious:

    A few days ago a phoronix user let me know he was using the CacULE on his mobile Haswell with good results so it got me intrigued.

    I was originally using the alpha-version of CacULE scheduler called cachy which was not based off of CFS. That scheduler was scrapped and the CacULE creator (Hamad Al Mirri) looked at the FreeBSD scheduler and took similar ideas and applied to Ingo Molnar's awesome CFS scheduler.

    So CacULE is now just base CFS + optimizations/shortcuts/etc, for people looking to really squeeze out the most out of their kernel.

    He explains it nicely here: https://github.com/hamadmarri/cacule...discussions/37

    For those that like to tinker or are curious, I encourage you to give it a shot, if not through your own custom compiled kernel, at least through xanmod. It's noticeably fast. (yes, I know I keep saying that but it's true).

    The test results were all after a fresh reboot, then verified through htop that nothing was running in the background. Only difference between each test is just the kernel, nothing else.

    Last thing: The full tickless are nice because it allows for adding "isolcpus=1-3 nohz_full=1-3 rcu_nocbs=1-3" to GRUB to isolate certain CPU cores (here it's cores 1-3 on a 4 core, so 0 is the first core, 1-3 are the 2,3,4th cores) if you need to isolate RCU callbacks or ticker from a certain task. Instead of 1-3, you can also just put "3" and then use ./taskset -c 3 ./your_program and the program will run on the last CPU core, with 0 ticks, unbothered from anything else. Very cool stuff.
    cacule is indeed very nice - mostly it works flawless. But I have encountered one or two games where it got hickups. Detroit become human (proton) is on of them. Cutscenes do have some small stutters.
    Beside of this small negative examples it is really a great scheduler.

    btw i can recommend it built in the xanmod kernels. there exists one cacule branch https://xanmod.org/
    It works also fine as llvm lto built. There is one experimental branch where the maintainers say it is built with thinlto. I have built it by myself sucessfully with fulllto - maybe it breaks on some plattforms I dont know.

    I have used it on Zen2, Skylake, and Haswell works nicely.
    Last edited by CochainComplex; 22 July 2021, 06:43 AM.

    Comment


    • #32
      Originally posted by avem View Post
      Michael, it's either "Clang/GCC build or Built with Clang/GCC" :-)
      Native English speaker here. "clang built" is acceptable, along with "a clang build" For example, you can say, "Apple built hardware is superior to Microsoft." Google "Ford built trucks" for more examples.



      Comment


      • #33
        Do the two geomean results include the suspect StressNG result?

        If so, I suspect the geomean is significantly impacted by that one result - afterall, differences were generally small, so just one outlier could easly distort the results.

        I'm quite curious what's going on with that StressNG result, because it strikes me as being next to impossible for it to be measuring anything meaningful - context switches are after all dominated by various hardware cache-flushes, especially post-spectre. A compiler making even one iota of a difference here would be truly remarkable, let alone such a huge difference.

        I suspect somehow StressNG is suffering from some kind of dead-code elimination or similar benchmark-only issue; or perhaps there's some kind of bug (e.g. the LLVM build doesn't yet support some of those spectre-mitigations?) - whatever the case, it seems rather implausible for it to be real and justified.

        Comment


        • #34
          Originally posted by DanglingPointer View Post
          LLVM Clang and LLD has matured nicely with the kernel. But what has not is the bloody DKMS! If you now try and run Virtualbox guests on a host built using llvm clang and lld with flto=thin, then DKMS modules don't build correctly or at all without jumping through hoops!

          This needs to get fixed otherwise it will continue to impede adoption of this rapidly evolving build tool!
          Those "hoops" were rather expected and straightforward.
          Out of 6 DKMS modules i use on my haswell+kepler laptop, 4 built just fine after adding "export LLVM=1" to /etc/dkms/framework.com, ntfs3 used two warning options not supported in clang yet (i didnt bother to look for equivalent flag) so after removing those from the Makefile, started working too, nvidia modules also required "export CC=clang", likely due to their conftest script. (i tested CUDA in Blender and it surprisingly worked, didnt try other stuff like DRM, Xorg, GL or Vulkan)

          But it would be indeed nice if upstream/distros shipped dkms that checks the kernel config to automatically use the correct toolchain.

          Comment


          • #35
            If you have 20 DKMS modules with 20 idiosyncrasies, then it is "hoops"!

            Clang/LLVM built kernels are arguably even more beneficial in such setups where there are several to many demanding DKMS modules requiring performance!

            Without DKMS getting fixed for all distros using some common denominator to fix the issue, there will be less adoption of Clang/LLVM built kernels by users unless the distro itself is shipped with with the LLVM toolchain as the default build toolchain instead of gcc (e.g. openMandriva)!

            All distros pull from the original DKMS project out from DELL, so I believe DELL should fix it with a common denominator solution so all Distros are fixed by default! It's in their interest! All their servers, workstations, and laptops sold with linux would get free performance boost from the LLVM toolchain with LTO.

            Comment


            • #36
              I've benchmarked various versions of linux 5.12 and 5.13 with gcc-10, gcc-11, clang-11 and clang-12 using all the "stable" stress-ng micro benchmarks and the biggest win is in the af-alg microbenchmark. This shows that the AF_ALG socket based hashing and crypto can be ~2x improved with clang. All other microbenchmarks showed ~+/- 1% performance change, so the overall win in the general use cases is marginal and lower than the std.deviation jitter in the tests.

              Comment


              • #37
                I like how Michael literally shows the results but people still complain.

                Comment


                • #38
                  Originally posted by labyrinth153 View Post
                  I like how Michael literally shows the results but people still complain.
                  I like to remind myself that it is the age of social media, where it is accepted to be open, sharing and outspoken about one's ignorance, and where everything is a fraud, a lie or a conspiracy unless it supports one's ignorance.

                  Comment


                  • #39
                    Originally posted by andyprough View Post

                    But if GCC and Clang are crossing the street together, hand-in-hand, and both are hit by the same bus at the same time, then there's a big problem. Which is why we need to re-write the kernel in rust as quickly as possible.
                    The same bus will have travelled across the country by the time the kernel compilation in Rust is finished.

                    Comment


                    • #40
                      Originally posted by slacka View Post

                      Native English speaker here. "clang built" is acceptable, along with "a clang build" For example, you can say, "Apple built hardware is superior to Microsoft." Google "Ford built trucks" for more examples.
                      "Apple built" or "Ford built" again hardly work. You tried to counter me but you offered full sentences/expressions which sound just fine.

                      "Clang/GCC built application" also sounds just fine. If we remove the "application" it no longer sounds English at least for me.

                      Comment

                      Working...
                      X