Announcement

**piotrj3** · 29 June 2022, 08:47 PM

Originally posted by stormcrow View Post

It's dangerous to accept benchmarks at face value when there's a drastic change in performance. As another commenter mentioned, -O3 could be optimizing away certain Spectre mitigations in undesirable ways in the kernel. Unless you know for sure what -O3 actually does (and from your post I'd guess you don't and probably don't know how to analyze machine code - I don't either, but I do know what you write in code isn't always what the compiler and linker tells the system to do.) on the processor at execution it should be assumed something broke and there should be an investigation of why gcc -O3 is doing what it's doing.

This is one reason we could use a comprehensive test suite for the Linux kernel that runs through known exploit and bug conditions.

I would actually expect opposite.

Part of agressive optimalizations is making code less reliant on branching and memory layout being always the same (like unrolling the loop) what actually means harder exploitation.

So i wouldn't be afraid for that value.

What I could be afraid potentially is that there is tons of security engineers testing security of kernel on default O2 configuration, but very few at O3. That means even if O3 on average is more likely to remove security vulnerabilities then introduce them, it still means someone probably took care of many O2 security vulnerabilities but not in case of O3.

**sinepgib** · 29 June 2022, 09:12 PM

Originally posted by discordian View Post

Actually the opposite, your are bloating up code, fighting for more cache and then hopefully get the improvement from big fat parallel OOO execution.
smaller CPUs lack that, and are quite often slower with -O3.

What happened to the LTO effort btw, that certainly does slim down the kernel and should rarely cause issues.

Hmmmm, except for dead code removal, you have the same kind of double-edged sword with LTO: inlining may make functions bigger and thus use more cache. This could especially be the case if you call two functions depending on the same leaf function in quick succession, and the latter gets inlined (so no longer being shared).

Originally posted by hamishmb View Post

Oh, the other system without AES instructions was a Core 2 Duo (T7300 possibly). It was decent as well, good enough to daily drive.

I'm kind of tempted to try some benchmarks with this on a spare Pi 1 model B I have to test the impact on the lowest of the low end (albeit not x86). Would anyone be interested in this?

It would be fun to see.

**binarybanana** · 29 June 2022, 10:09 PM

So I did some additional benchmarking using various optimizations. I tested zram performance and wireguard throughput. For all tests I did a few runs and used the best three to calculate an average and then the performance difference. Stock "O2" is the baseline.

Here are the results:

Code:

# zramctl -a zstd -s 8G -f -t 1
# time < /dev/null > /dev/zram3, in s
3.292 3.469 3.319 3.419
3.315 3.429 3.318 3.416
3.304 3.459 3.327 3.359
----- ----- ---- -----
3.304 3.452 3.321 3.40
-4.3% 0% -3.8% -1.5%

# wireguard, same machine
# iperf3, in GB/s
5.72 5.86 6.11 6.13
5.84 5.96 6.18 6.22
5.83 5.91 6.08 6.23
---- ---- ---- ----
5.80 5.91 6.12 6.19
+1.9% 0% +3.5% +4.7%

Column 1: make -j8 AR=gcc-ar NM=gcc-nm KCFLAGS="-march=native -O3 -falign-functions=64 -fipa-pta -fno-semantic-interposition -fgraphite-identity -fdevirtualize-at-ltrans -floop-nest-optimize -flto=8 -ffat-lto-objects -fuse-linker-plugin -Wl,-O1 -Wl,-as-needed" DISABLE_LTO=-fno-lto KLDFLAGS+='-Wl,-O1 -Wl,--as-needed $(KCFLAGS)'
Column 2: make -j8
Column 3: make -j8 AR=gcc-ar NM=gcc-nm KCFLAGS="-march=native -O3 -falign-functions=64 -fipa-pta -fno-semantic-interposition -fdevirtualize-at-ltrans -floop-nest-optimize -flto=8 -ffat-lto-objects -fuse-linker-plugin -Wl,-O1 -Wl,-as-needed" DISABLE_LTO=-fno-lto KLDFLAGS+='-Wl,-O1 -Wl,--as-needed $(KCFLAGS)'
Column 4: make -j8 AR=gcc-ar NM=gcc-nm KCFLAGS="-march=native -Ofast -falign-functions=64 -fipa-pta -fno-semantic-interposition -fdevirtualize-at-ltrans -floop-nest-optimize -flto=8 -ffat-lto-objects -fuse-linker-plugin -Wl,-O1 -Wl,-as-needed" DISABLE_LTO=-fno-lto KLDFLAGS+='-Wl,-O1 -Wl,--as-needed $(KCFLAGS)'

The other crap apart from Ox is from GentooLTO. Remembering that graphite optimizations can be very hit and miss I removed that flag in a third round. Then, just because I'm insane and at it anyway I tried Ofast as well. Overall the third set of optimizations was the fastest (also closest to "just O3"). I don't know how much LTO weights into my testing, so YMMV. Are those 4% (for some tasks) worth it? You decide.

EDIT:
Compiler: gcc version 12.1.1 20220625 (Gentoo 12.1.1_p20220625 p8)
CPU: i5 8300H

EDIT2:
Fixed math
Kernel: 5.18.1

**NobodyXu** · 29 June 2022, 10:21 PM

Originally posted by F.Ultra View Post

Basically in line with how I predicted it would be back in that other thread where we can see that applications that spends lots of time in the kernel like PostgreSQL gets quite a big benefit vs designs that spend most of their time in user-space like RocksDB and Redis.

If postgresql uses io-uring, then perhaps that benefit might also be reduced.

**coder** · 29 June 2022, 10:31 PM

the -O3 kernel build came out ahead only by 1.3%... Barely a blip overall.

This is not barely a blip. Barely a blip would be more like 0.13%. If you ripped out all code optimizations from the kernel that were 1.3% or less, the cumulative effect would be crippling!

For such a simple and naive change, 1.3% is actually pretty good. And for those with a workload that matches one of the more significantly-affected benchmarks, it's a downright windfall!

**smitty3268** · 29 June 2022, 10:59 PM

Originally posted by discordian View Post

Actually the opposite, your are bloating up code, fighting for more cache and then hopefully get the improvement from big fat parallel OOO execution.
smaller CPUs lack that, and are quite often slower with -O3.

Do you have any benchmark results showing that? My initial reaction would be to think that's likely only true for embedded class processors, or something 15+ years old, but I admittedly haven't seen any tests myself.

**hamishmb** · 30 June 2022, 02:47 AM

Originally posted by F.Ultra View Post

Neither TLS nor SSH uses the kernel encryption routines, they both use their own user space versions so will have "zero impact" from how the kernel is compiled. The reason why e.g PostgreSQL benefits here is that it calls a ton of syscalls for each request.

My bad, thanks for the correction.

**marios** · 30 June 2022, 04:35 AM

Originally posted by milkylainen View Post

Of all the stupid things the kernel has accepted/done throughout the years,
exposing O3 through experimental tagged kconfig/lxdialog seems like the least of them.

This should be a nomination for the best Linux quotes of 2022...

**marios** · 30 June 2022, 04:45 AM

Originally posted by discordian View Post

What happened to the LTO effort btw, that certainly does slim down the kernel and should rarely cause issues.

Looks like they only cared about for lto with clang and let gcc burn in hell, most likely because they ended up becoming google slaves and because of android (which is almost always is compiled with clang).

**dpanter** · 30 June 2022, 05:32 AM

When it came to the -O3 kernel build for other workloads like gaming/graphics, web browsing performance, and various creator workloads there was no measurable benefit from the -O3 kernel.

Kernelception! It's the last line of the article btw. I guess tildearrow is on vacation.

Announcement

Benchmarking The Linux Kernel With An "-O3" Optimized Build

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment