Was that thin lto?
Announcement
Collapse
No announcement yet.
Squeezing More Performance Out Of The Linux Kernel With Clang + LTO
Collapse
X
-
Michael, it's either "Clang/GCC build or Built with Clang/GCC" :-)
I'm not a fan of the results. It's not like I don't like your work, I don't like that the kernel has so much untapped performance only due to the use of compiler and its flags. I expected at most ~a 1% performance boost, not ~8%.
Someone has to show the result to Linus.
Comment
-
Hey Michael since we're talking about squeezing out performance from the kernel, I ran some benchmarks today testing out the CacULE schedule vs CFS if you or anyone else is interested.
https://openbenchmarking.org/result/...IB-CFSVSCACU49
Winner: CacULE enabled, 1000HZ, full tickless, low-latency PREEMPT
Loser: the real-time (-rt patches) kernel
Runner-up loser: stock Ubuntu kernel
Runner-up winner: perpetually high's custom kernel with CFS
Quick backstory if anyone is curious:
A few days ago a phoronix user let me know he was using the CacULE on his mobile Haswell with good results so it got me intrigued.
I was originally using the alpha-version of CacULE scheduler called cachy which was not based off of CFS. That scheduler was scrapped and the CacULE creator (Hamad Al Mirri) looked at the FreeBSD scheduler and took similar ideas and applied to Ingo Molnar's awesome CFS scheduler.
So CacULE is now just base CFS + optimizations/shortcuts/etc, for people looking to really squeeze out the most out of their kernel.
He explains it nicely here: https://github.com/hamadmarri/cacule...discussions/37
For those that like to tinker or are curious, I encourage you to give it a shot, if not through your own custom compiled kernel, at least through xanmod. It's noticeably fast. (yes, I know I keep saying that but it's true).
The test results were all after a fresh reboot, then verified through htop that nothing was running in the background. Only difference between each test is just the kernel, nothing else.
Last thing: The full tickless are nice because it allows for adding "isolcpus=1-3 nohz_full=1-3 rcu_nocbs=1-3" to GRUB to isolate certain CPU cores (here it's cores 1-3 on a 4 core, so 0 is the first core, 1-3 are the 2,3,4th cores) if you need to isolate RCU callbacks or ticker from a certain task. Instead of 1-3, you can also just put "3" and then use ./taskset -c 3 ./your_program and the program will run on the last CPU core, with 0 ticks, unbothered from anything else. Very cool stuff.
- Likes 3
Comment
-
Originally posted by sdack View PostPerhaps it is too early, but what most people thought could happen one day might be just around the corner now. Clang might finally surpass GCC as leading compiler.
Historically GCC and LLVM have been one-upping each other for quite some time, with specific use cases showing one (or the other) is better for that specific use, and the winner perhaps changing at the next release (of either compiler). When a good idea is added to one compiler which results in much better performance, the other compiler developers will often consider how to accomplish the equivalent (or better).
For the hyperscalers, and HPC crowd, even a 1% improvement for their specific use cases might save them many millions of dollars in infrastructure costs, so it is worth pursuing the grail, and many have put major resources into LLVM and both the linux kernel and other apps.
(Thin) LTO and PGO certainly have shown improvements in the kernel for some specific use cases, and those that want to see GCC continue to be a long term competitor in the kernel space will need to contribute and make equivalent compiler optimizations available there (out of tree does not count), as the large organizations and customers drive how vendors spend their resources, and if the kernel is measurably and consistently faster with LLVM the vendors are going to start delivering such kernels to their customers (and once the avalanche has started, it will be too late for the pebbles to vote).
- Likes 4
Comment
-
So what Honza Hubicka was telling about GCC LTO on Linux kernel seems to be confirmed here for LLVM LTO. LLVM LTO still bring CFI benefit until Rust/C mixed kernel will nullify it.
- Likes 1
Comment
-
In search of something funky, I think it was interesting that LTO lost in both compression tests (zstd and lz4). Of course, not all optimizations are positive in all cases, but LTO… how can that go wrong?
It could be a statistical fluke, and the differences were small, but the two graphs look identical between the tests.
Comment
-
Originally posted by andreano View PostIn search of something funky, I think it was interesting that LTO lost in both compression tests (zstd and lz4). Of course, not all optimizations are positive in all cases, but LTO… how can that go wrong?
It could be a statistical fluke, and the differences were small, but the two graphs look identical between the tests.
- Likes 1
Comment
-
Originally posted by avem View PostMichael, it's either "Clang/GCC build or Built with Clang/GCC" :-)
I'm not a fan of the results. It's not like I don't like your work, I don't like that the kernel has so much untapped performance only due to the use of compiler and its flags. I expected at most ~a 1% performance boost, not ~8%.
Someone has to show the result to Linus.
- Likes 2
Comment
-
Originally posted by avem View PostMichael, it's either "Clang/GCC build or Built with Clang/GCC" :-)
I'm not a fan of the results. It's not like I don't like your work, I don't like that the kernel has so much untapped performance only due to the use of compiler and its flags. I expected at most ~a 1% performance boost, not ~8%.
Someone has to show the result to Linus.
Basically if you have i7s from Sandybridge era up to now, then you've got enough cache for optimised code.
Comment
-
Originally posted by carewolf View Post
And overall those changes were statistically insignificant before being cherry-picked...## VGA ##
AMD: X1950XTX, HD3870, HD5870
Intel: GMA45, HD3000 (Core i5 2500K)
Comment
Comment