Announcement

**guglovich** · 13 July 2022, 06:56 PM

Now the most interesting thing is whether there is regression with -O2 march-native, which is a very popular option.

**cj.wijtmans** · 13 July 2022, 07:21 PM

Personally i use -Os -march=native in servers.

**cj.wijtmans** · 13 July 2022, 07:28 PM

Originally posted by zamroni111 View Post

Great comment.
I guess phoronix needs to retest with e cores disabled.

It is also possible that the gcc native check went to e cores so the kernel was compiled with optimization for e core instead of p core.

Dont the e cores have different registers than the pcores? As in no avx? How would code even run.. 🤔

**piotrj3** · 13 July 2022, 07:44 PM

Originally posted by brad0 View Post

Except it is not at all.

In my opinion it is because kernel is clearly written with O2 in mind and yet O3 is actually faster,

Meanwhile also kernel is NOT written with one architecture in mind (so it is more generic and has less special optimalizations towards one specific instruction set) and yet march=native is slower.

Only question is, if that is issue of GCC + alder lake, or it is issue of march=native on kernel in general,

**zamroni111** · 13 July 2022, 07:51 PM

Originally posted by cj.wijtmans View Post

Dont the e cores have different registers than the pcores? As in no avx? How would code even run.. 🤔

That's is done automatically by kernel scheduling and the processor itself.
It knows features of each core so will direct avx512 calls to p cores (if avx512 is enabled in phoronix's alder flame system's bios)

There is one gcc process per c source file.
so there are thousands gcc process done in kernel compilations.
Each gcc process does native check independently.
Surely out of those thousands gcc, some went to e core so the native checks also got e core spec.

Due to uniformity of alder flame, Instead of simply march native, phoronix should put detailed gcc options matches to p core.

They can do this on p core to get the detailed parameters:
gcc -march=native -E -v - </dev/null 2>&1 | grep cc1

**mangeek** · 13 July 2022, 09:34 PM

Originally posted by Michael View Post

Yep it's all open source. People are lazy?

This article might be partially my fault. Thank you for doing it, Michael. I'm a happy subscriber and a regular user of Phoronix Test Suite.

**arQon** · 13 July 2022, 10:30 PM

Originally posted by mcloud View Post

Maybe the problem is that the kernel isn't the sole purpose of the machine. If you optimize too much the kernel (say, it now can use an additional variable in registers instead of the stack), now that register can't be used in the next instruction and it needs more push/pop to free it. But I ain't an expert, so what do I know

Indeed, you are not.

But, seriously, thank you for *admitting* that, which puts you a long way above many.

Register use within any kernel (or more accurately, within any ABI, which the kernel is a subset of) is always "known". That is to say, pretending you have a machine with 6 registers A-F, it will be explicitly defined that calls will e.g. modify A and B, and leave C-F unchanged. For any callee that *uses* C-F, there will be boilerplate either within the called code or around it within the caller to preserve at least whichever registers it modifies.
The cost of that preservation (via push/pop, as you say) is typically ** so tiny that it effectively doesn't exist, other than in artificial benchmarks (e.g. ctx_clock), which makes sense when you think about it: what you're calling has to do "something", and that "something" is nearly always at least 1000x more work than the cost of saving and restoring those registers is.

** There have been exceptions, but it's really not something you need to worry about. Being able to remove a register from the ABI does have measurable value, for roughly the reasons you think it does, but outside of that, no. The compiled code will use as many registers as it max(can,needs) in every case, but the cost of pushing and popping one or two more is basically irrelevant.

Cache, OTOH, *is* something that can basically be "destroyed" the way you suggest, but again: even in that case, it's destroyed *because* the thing you're calling has genuine work to do. Similarly, Intel's numerous SIMD attempts (MMX, AVX, etc) do incur significant transition costs, but code that e.g. wastes 200 cycles on setup so that it can use a 3-cycle SIMD write instead of 4x 2-cycle writes with 0 overhead generally means the compiler is broken.

**NotMine999** · 14 July 2022, 12:03 AM

Originally posted by cj.wijtmans View Post

Dont the e cores have different registers than the pcores? As in no avx? How would code even run.. 🤔

It's all PFM...or "Marketing Magic" if you don't know about PFM

**marios** · 14 July 2022, 01:17 AM

Originally posted by zamroni111 View Post

Great comment.
I guess phoronix needs to retest with e cores disabled.specific

It is also possible that the gcc native check went to e cores so the kernel was compiled with optimization for e core instead of p core.

The compiler (at least gcc) has only one optimisation option for both big and little cores. The output binary will not change if the check is ran on a different core. This can also be easily verified with an alder lake cpu by running gcc with affinities and diffing the output binaries. However I do believe that the optimisations are broken for alder lake specifically and generalising the benchmark results for architectures is beyond stupid.

Usually it is OK to run code optimised for a little core on the big core. On the other hand, running code optimised for big cores on little cores gives poor performance (an example reason for that is the assumption of OoO execution, which allows the compiler to skip or relax instruction scheduling during optimisation). Iirc the arm compilers for big.LITTLE, when given specific -march/-mcpu/-mtune options, compile for the little cores. I stated in a previous comment my suspicion on what is wrong for alder lake .

Some further benchmarks could help identify the culprit (including the ones I said in previous post, as well as using different cpu affinities to pin the process in big or little cores and finally check the effects on -march=alderlake for userspace apps). I would run such benchmarks myself, but I do not have an alder lake cpu (and I do not intend to waste money on this abomination).

**AdrianBc** · 14 July 2022, 01:42 AM

Originally posted by Barley9432 View Post

And the entire benefit of Gentoo gone out the window with a single article... nice

The fact that compiling with more optimizations has little effect on the kernel, which contains mostly unstructured code, is not a surprise at all.

On the other hand there are other applications, which spend most of their time in loops iterating over regular data structures, where the compiler optimizations and the selection of the appropriate instruction set can make a very large difference in speed and where the compilation from source in Gentoo provides a noticeable improvement.

I normally use Gentoo on my laptops and desktops, but when I happen to use an Ubuntu or a Fedora on the same hardware, e.g. before installing Gentoo, they feel definitely more sluggish.

Even if I always compile a custom kernel configuration, for the kernel I have never used other compilation flags except the default, because I agree that an improvement over that is unlikely.

Announcement

Benchmarking The Linux 5.19 Kernel Built With "-O3 -march=native"

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment