Announcement

**CochainComplex** · 23 June 2022, 07:58 AM

Originally posted by Jannik2099 View Post

It's beyond time. O3 consists solely of safe, standards compliant optimizations on gcc and clang. The myth that O3 "has dangerous experimental optimizations" is just a legend from the gcc 4 era.

Yes, compiler bugs exist, but the kernel runs into them at O2 just as much. Enabling O3 would potentially bring an initial wave of new discoveries, but nothing more.

I can't wait for another unqualified Torvalds rant about how "the compiler inserts UB here" - if your code breaks under O3, that's a bug in YOUR code and you should fix it. And there's no guarantee that your bug only ever exhibits at O3. More often than not it will manifest in a later compiler release at O2 too. The kernel is full of this kind of code smell, and years of blaming the compiler has not helped in fixing any of it.

Indeed if O3 causes issues, 99% its related to bad code.

**AlanSMac** · 23 June 2022, 08:00 AM

Originally posted by birdie View Post

On a normal system (with a decent amount of RAM and powerful CPU cores) the kernel itself shouldn't take too much CPU time, so the difference between -O2 and -O3 even if the latter is twice as fast should still be minimal. -O3 might be beneficial for hosting/traffic/VPN providers and people with very weak PCs and that's it.

And some -O3 options, if used without moderation, are outright harmful, since they bloat up the code and that leads to L1/L2 caches being eviscerated.

Hosting/traffic/VPN is most of the internet!

**Jannik2099** · 23 June 2022, 08:01 AM

Originally posted by birdie View Post

GCC 12.1.

O2 vs Ofast:

Code:

+ -fallow-store-data-races [enabled]
+ -fassociative-math [enabled]
+ -fcx-limited-range [enabled]
+ -ffinite-math-only [enabled]
+ -fgcse-after-reload [enabled]
+ -fipa-cp-clone [enabled]
+ -floop-interchange [enabled]
+ -floop-unroll-and-jam [enabled]
+ -fmath-errno [disabled]
+ -fpeel-loops [enabled]
+ -fpredictive-commoning [enabled]
+ -freciprocal-math [enabled]
+ -fsemantic-interposition [disabled]
+ -fsigned-zeros [disabled]
+ -fsplit-loops [enabled]
+ -fsplit-paths [enabled]
+ -ftrapping-math [disabled]
+ -ftree-loop-distribution [enabled]
+ -ftree-partial-pre [enabled]
+ -funroll-completely-grow-size [enabled]
+ -funsafe-math-optimizations [enabled]
+ -funswitch-loops [enabled]
+ -fversion-loops-for-strides [enabled]

O2 vs O3:

Code:

+ -fgcse-after-reload [enabled]
+ -fipa-cp-clone [enabled]
+ -floop-interchange [enabled]
+ -floop-unroll-and-jam [enabled]
+ -fpeel-loops [enabled]
+ -fpredictive-commoning [enabled]
+ -fsplit-loops [enabled]
+ -fsplit-paths [enabled]
+ -ftree-loop-distribution [enabled]
+ -ftree-partial-pre [enabled]
+ -funroll-completely-grow-size [enabled]
+ -funswitch-loops [enabled]
+ -fversion-loops-for-strides [enabled]

O3 vs Ofast:

Code:

+ -fallow-store-data-races [enabled]
+ -fassociative-math [enabled]
+ -fcx-limited-range [enabled]
+ -ffinite-math-only [enabled]
+ -fmath-errno [disabled]
+ -freciprocal-math [enabled]
+ -fsemantic-interposition [disabled]
+ -fsigned-zeros [disabled]
+ -ftrapping-math [disabled]
+ -funsafe-math-optimizations [enabled]

Right, and as you can see the only differences between O3 and Ofast that are not floating-point related are -fallow-store-data-races and -fno-semantic-interposition - the former is bordering on suicide, the latter is something that the kernel should indeed look into (clang defaults to it)

**CochainComplex** · 23 June 2022, 08:02 AM

Originally posted by Jannik2099 View Post

Unless they patched the kernel elsewhere, no they do NOT. The kernel Makefile would override any previous -O flag.

-Ofast is also more or less meaningless in a kernel context as it mainly deals with floating point optimizations.

My comment was meanwhile changed. Tipped it fast was on the run...You are right i meant ofast in a different context but i havent clearly written it.

**Weasel** · 23 June 2022, 08:23 AM

Originally posted by Jannik2099 View Post

Right, and as you can see the only differences between O3 and Ofast that are not floating-point related are -fallow-store-data-races and -fno-semantic-interposition - the former is bordering on suicide

It's not any more suicide than fstrict-aliasing. If it exhibits problems, your code is flawed, period.

**archkde** · 23 June 2022, 09:51 AM

Am I the only one who thinks that the distinction (in the compiler) between -O2 and -O3 (and probably -O1 as well) is nonsense to begin with?

**oiaohm** · 23 June 2022, 09:56 AM

Originally posted by Jannik2099 View Post

Unless they patched the kernel elsewhere, no they do NOT. The kernel Makefile would override any previous -O flag.

-Ofast is also more or less meaningless in a kernel context as it mainly deals with floating point optimizations.

-Ofast is absolutely no for building kernels if you want stability.

Originally posted by birdie View Post

GCC 12.1.

O2 vs Ofast:

Code:

+ -fallow-store-data-races [enabled]
+ -fassociative-math [enabled]
+ -fcx-limited-range [enabled]
+ -ffinite-math-only [enabled]
+ -fgcse-after-reload [enabled]
+ -fipa-cp-clone [enabled]
+ -floop-interchange [enabled]
+ -floop-unroll-and-jam [enabled]
+ -fmath-errno [disabled]
+ -fpeel-loops [enabled]
+ -fpredictive-commoning [enabled]
+ -freciprocal-math [enabled]
+ -fsemantic-interposition [disabled]
+ -fsigned-zeros [disabled]
+ -fsplit-loops [enabled]
+ -fsplit-paths [enabled]
+ -ftrapping-math [disabled]
+ -ftree-loop-distribution [enabled]
+ -ftree-partial-pre [enabled]
+ -funroll-completely-grow-size [enabled]
+ -funsafe-math-optimizations [enabled]
+ -funswitch-loops [enabled]
+ -fversion-loops-for-strides [enabled]

O2 vs O3:

Code:

+ -fgcse-after-reload [enabled]
+ -fipa-cp-clone [enabled]
+ -floop-interchange [enabled]
+ -floop-unroll-and-jam [enabled]
+ -fpeel-loops [enabled]
+ -fpredictive-commoning [enabled]
+ -fsplit-loops [enabled]
+ -fsplit-paths [enabled]
+ -ftree-loop-distribution [enabled]
+ -ftree-partial-pre [enabled]
+ -funroll-completely-grow-size [enabled]
+ -funswitch-loops [enabled]
+ -fversion-loops-for-strides [enabled]

O3 vs Ofast:

Code:

+ -fallow-store-data-races [enabled]
+ -fassociative-math [enabled]
+ -fcx-limited-range [enabled]
+ -ffinite-math-only [enabled]
+ -fmath-errno [disabled]
+ -freciprocal-math [enabled]
+ -fsemantic-interposition [disabled]
+ -fsigned-zeros [disabled]
+ -ftrapping-math [disabled]
+ -funsafe-math-optimizations [enabled]

Thanks for the good list Birdie

Few things in -0fast are problem. There is some floating point code inside the Linux kernel.
https://elixir.bootlin.com/linux/lat...rnel_fpu_begin.

This is not in many files. But turning of maths safeties and -Ofast does. " -funsafe-math-optimizations" is a very big catch all.

Big problem with fast is
"-fallow-store-data-races"

Allow the compiler to perform optimizations that may introduce new data races on stores, without proving that the variable cannot be concurrently accessed by other threads. Does not affect optimization of local data. It is safe to use this option if it is known that global data will not be accessed by multiple threads.

Examples of optimizations enabled by -fallow-store-data-races include hoisting or if-conversions that may cause a value that was already in memory to be re-written with that same value. Such re-writing is safe in a single threaded context but may be unsafe in a multi-threaded context. Note that on some processors, if-conversions may be required in order to enable vectorization.

Linux kernel is multi threaded in lots of places. So yes this one feature of -Ofast can make a lot of race condition locations in the Linux kernel.

There are a few problem child inside Linux kernel space with -03. "-fpredictive-commoning"

Perform predictive commoning optimization, i.e., reusing computations (especially memory loads and stores) performed in previous iterations of loops.

Ok what if you have just performed a command that altered memory mapping from userspace to kernel space or the reverse. Some cases you are need to exactly replay the load and stores so that contents of the page tables exposed to user-space and the contents of page tables exposed to applications are the same.

What is safe to do to user space code that only has a single set of page tables accessible is not always safe when you do it from a kernel.

**oiaohm** · 23 June 2022, 10:05 AM

Originally posted by CochainComplex View Post

Indeed if O3 causes issues, 99% its related to bad code.

The 1% happens to be OS kernel code when it not bad code. Kernel code you have access to multi page table setups. Normal application you can only see 1 set of page tables. This does alter presumed. The result is to build a kernel with -O3 particular optimisations will need to be disabled in particular places.

Lot of ways there should be -O3 for kernel code. Switching between user-space and kernel space page tables and interfacing with with memory across multi groups of page table this is behavour of protected mode OS kernels and fairly much nothing else.

Yes 1% good code with -O3 causing issues turns out be 99% focused to OS kernel building. Horrible part these are race conditions as well so you can have a lot of works for me when the build is in fact completely broken just you don't know yet.

-O2 for the Linux kernel is the safe choice. -O3 for Linux kernel is minor-ally unsafe the build might be totally fine but it could also be broken and -Ofast with Linux kernel is danger diving because something will be wrong somewhere.

Most of options current enabled for -03 that should be harmless to build the Linux kernel with. Problem here its not all options in -O3 are safe.

**F.Ultra** · 23 June 2022, 10:48 AM

Originally posted by birdie View Post

In my 25+ years of using PC, laptops, etc. I've had 0 situations where ntoskrnl.exe or vmlinuz took a discernible amount of CPU time.

That is not how things work, the kernel is not some random exe that executes parallel to userland (there are kernel threads for e.g filesystems that do yes, but not vmlinuz as such) and somehow I think that you already know this?!

If we take e.g a simple ls in a large directory:

Code:

time ls /opt/largedir/ > /dev/null
real 0m0.055s
user 0m0.040s
sys 0m0.016s

We can see that in this particular case 29% of the runtime was spent inside the kernel. Now I have zero reasons to believe that -O3 would decrease this particular number by anything meaningful if at all, but to claim that the efficiency of the kernel is insignificant for your usage of a computer is completely wrong.

**birdie** · 23 June 2022, 11:03 AM

Originally posted by F.Ultra View Post

That is not how things work, the kernel is not some random exe that executes parallel to userland (there are kernel threads for e.g filesystems that do yes, but not vmlinuz as such) and somehow I think that you already know this?!

If we take e.g a simple ls in a large directory:

Code:

time ls /opt/largedir/ > /dev/null
real 0m0.055s
user 0m0.040s
sys 0m0.016s

We can see that in this particular case 29% of the runtime was spent inside the kernel. Now I have zero reasons to believe that -O3 would decrease this particular number by anything meaningful if at all, but to claim that the efficiency of the kernel is insignificant for your usage of a computer is completely wrong.

Thank you for proving my point, you're talking about point zero zero times.

Secondly, I bet in your example the kernel spent ~95% of time waiting for IO and ~5% getting you the data. Again, let's make the kernel twice as fast and as a result you'll shave off 0.0002ms? Woah.

You seemingly don't understand how the kernel works either. It's a proxy, it must be a proxy, if it does any serious work, it's badly coded. I can only imagine e.g. some software encryption/decryption algos taking a lot of CPU time which I mentioned earlier for VPN providers.

Announcement

Experimental -O3 Optimizing The Linux Kernel For Better Performance Brought Up Again

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment