Now the most interesting thing is whether there is regression with -O2 march-native, which is a very popular option.
Announcement
Collapse
No announcement yet.
Benchmarking The Linux 5.19 Kernel Built With "-O3 -march=native"
Collapse
X
-
Originally posted by zamroni111 View Post
Great comment.
I guess phoronix needs to retest with e cores disabled.
It is also possible that the gcc native check went to e cores so the kernel was compiled with optimization for e core instead of p core.
Comment
-
Originally posted by brad0 View Post
Except it is not at all.
Meanwhile also kernel is NOT written with one architecture in mind (so it is more generic and has less special optimalizations towards one specific instruction set) and yet march=native is slower.
Only question is, if that is issue of GCC + alder lake, or it is issue of march=native on kernel in general,
- Likes 1
Comment
-
Originally posted by cj.wijtmans View Post
Dont the e cores have different registers than the pcores? As in no avx? How would code even run.. π€
It knows features of each core so will direct avx512 calls to p cores (if avx512 is enabled in phoronix's alder flame system's bios)
There is one gcc process per c source file.
so there are thousands gcc process done in kernel compilations.
Each gcc process does native check independently.
Surely out of those thousands gcc, some went to e core so the native checks also got e core spec.
Due to uniformity of alder flame, Instead of simply march native, phoronix should put detailed gcc options matches to p core.
They can do this on p core to get the detailed parameters:
gcc -march=native -E -v - </dev/null 2>&1 | grep cc1Last edited by zamroni111; 13 July 2022, 11:00 PM.
- Likes 1
Comment
-
Originally posted by mcloud View PostMaybe the problem is that the kernel isn't the sole purpose of the machine. If you optimize too much the kernel (say, it now can use an additional variable in registers instead of the stack), now that register can't be used in the next instruction and it needs more push/pop to free it. But I ain't an expert, so what do I know
But, seriously, thank you for *admitting* that, which puts you a long way above many.
Register use within any kernel (or more accurately, within any ABI, which the kernel is a subset of) is always "known". That is to say, pretending you have a machine with 6 registers A-F, it will be explicitly defined that calls will e.g. modify A and B, and leave C-F unchanged. For any callee that *uses* C-F, there will be boilerplate either within the called code or around it within the caller to preserve at least whichever registers it modifies.
The cost of that preservation (via push/pop, as you say) is typically ** so tiny that it effectively doesn't exist, other than in artificial benchmarks (e.g. ctx_clock), which makes sense when you think about it: what you're calling has to do "something", and that "something" is nearly always at least 1000x more work than the cost of saving and restoring those registers is.
** There have been exceptions, but it's really not something you need to worry about. Being able to remove a register from the ABI does have measurable value, for roughly the reasons you think it does, but outside of that, no. The compiled code will use as many registers as it max(can,needs) in every case, but the cost of pushing and popping one or two more is basically irrelevant.
Cache, OTOH, *is* something that can basically be "destroyed" the way you suggest, but again: even in that case, it's destroyed *because* the thing you're calling has genuine work to do. Similarly, Intel's numerous SIMD attempts (MMX, AVX, etc) do incur significant transition costs, but code that e.g. wastes 200 cycles on setup so that it can use a 3-cycle SIMD write instead of 4x 2-cycle writes with 0 overhead generally means the compiler is broken.
Comment
-
Originally posted by zamroni111 View Post
Great comment.
I guess phoronix needs to retest with e cores disabled.specific
It is also possible that the gcc native check went to e cores so the kernel was compiled with optimization for e core instead of p core.
Usually it is OK to run code optimised for a little core on the big core. On the other hand, running code optimised for big cores on little cores gives poor performance (an example reason for that is the assumption of OoO execution, which allows the compiler to skip or relax instruction scheduling during optimisation). Iirc the arm compilers for big.LITTLE, when given specific -march/-mcpu/-mtune options, compile for the little cores. I stated in a previous comment my suspicion on what is wrong for alder lake .
ββββββSome further benchmarks could help identify the culprit (including the ones I said in previous post, as well as using different cpu affinities to pin the process in big or little cores and finally check the effects on -march=alderlake for userspace apps). I would run such benchmarks myself, but I do not have an alder lake cpu (and I do not intend to waste money on this abomination).
- Likes 3
Comment
-
Originally posted by Barley9432 View PostAnd the entire benefit of Gentoo gone out the window with a single article... nice
The fact that compiling with more optimizations has little effect on the kernel, which contains mostly unstructured code, is not a surprise at all.
On the other hand there are other applications, which spend most of their time in loops iterating over regular data structures, where the compiler optimizations and the selection of the appropriate instruction set can make a very large difference in speed and where the compilation from source in Gentoo provides a noticeable improvement.
I normally use Gentoo on my laptops and desktops, but when I happen to use an Ubuntu or a Fedora on the same hardware, e.g. before installing Gentoo, they feel definitely more sluggish.
Even if I always compile a custom kernel configuration, for the kernel I have never used other compilation flags except the default, because I agree that an improvement over that is unlikely.
Last edited by AdrianBc; 14 July 2022, 01:47 AM.
- Likes 1
Comment
Comment