Announcement

**clouddrop** · 26 June 2021, 04:11 AM

Is there a reason for not having a "-O2 -march=native -flto" test?

**CochainComplex** · 26 June 2021, 04:15 AM

Originally posted by DanglingPointer View Post

Would be good to combine the Clang-12 and GCC-11 results.
Also some sort of a final mean in the end on mean winner and first places winner.

https://www.phoronix.com/scan.php?pa...-icelake&num=6
https://www.phoronix.com/scan.php?pa...-clang12&num=1
https://www.phoronix.com/scan.php?pa...epyc7763&num=1

rather recent within the last 2 month

Micheal was already pointing out that GCC 11 shows a slightly slower overall result when -flto is added to -O3 -march=native. Nonetheless GCC and Clang is a very close race - but often CLang seems to bee ahead.

**carewolf** · 26 June 2021, 05:11 AM

Is -flto the same as -flto=thin now? Otherwise we might want to test -flto=thin as well, if they are the same we might want to test the option that activates the full LTO.

**skeevy420** · 26 June 2021, 08:40 AM

Originally posted by coder View Post

Not all benchmarks put equal pressure on instruction cache. In cases that are more limited by it, perhaps you could get a net benefit with that combination.

However, in cases where the hotspots are dominated by a small number of loops, aggressive inlining, unrolling, and vectorization is going to be the winning strategy.

That's exactly why I was wondering what -flto and -Oz (and/or -Os) would do together, especially if tested on a wide array of processors with varying cache sizes to see if -march=native and different cache sizes produce different code than just -Oz with -flto.

Basically, I'm wondering approximately how much processor cache would be the dividing line between picking speed and size optimizations. And that sentence just made me wonder why -march=native doesn't turn -O2 into -Os on low cache processors. Maybe they haven't done those tests to implement a cache size to O level algorithm? Dumb idea?

**coder** · 26 June 2021, 10:15 PM

Originally posted by skeevy420 View Post

made me wonder why -march=native doesn't turn -O2 into -Os on low cache processors. Maybe they haven't done those tests to implement a cache size to O level algorithm?

I think the distinction is more code-specific than CPU-specific.

**Avamander** · 27 June 2021, 04:32 AM

Profile-guided optimization would be a cool final step in this comparison really.

**skeevy420** · 27 June 2021, 10:27 AM

Originally posted by coder View Post

I think the distinction is more code-specific than CPU-specific.

Possibly. It's just hard to not notice things like Zen 2 where the lowest end has 4mb of L3 cache and the highest end 256mb. Entire programs can fit in one while only part of a program can fit in the other. I'm not sure what to make of that other than assumptions. Part of a program versus multiple programs. My assumption would be the one with the lower L3 cache only holding part of a program might like Os/Oz binaries when considering multitasking and interactive environments.

**coder** · 28 June 2021, 01:42 AM

Originally posted by skeevy420 View Post

Possibly. It's just hard to not notice things like Zen 2 where the lowest end has 4mb of L3 cache and the highest end 256mb.

But that's unified cache, rather than L1 Instruction cache or the micro-op "L0" cache.

Originally posted by skeevy420 View Post

Entire programs can fit in one while only part of a program can fit in the other.

You don't need the entire program to fit in any level of the cache hierarchy. All that's needed is for the hot parts of each hot loop to fit in L1 instruction cache. When you get too many L1 cache misses, that's where your start to pay a penalty for large code size. Even then, branch prediction can hide some of it by prefetching the code before it's needed.

Originally posted by skeevy420 View Post

Part of a program versus multiple programs. My assumption would be the one with the lower L3 cache only holding part of a program might like Os/Oz binaries when considering multitasking and interactive environments.

L1 and L2 cache are per-core. So, they scale well with the number of cores. And the cost of a L1 miss is negligible by comparison with the length of a timeslice.

In other words, I think L3 sizes are mainly about data -- not code.

**skeevy420** · 28 June 2021, 08:47 AM

coder
Yeah, plus the L1 and L2, IIRC, are the same throughout the entire Zen 2 platform with slight differences between Zen iterations. On the lower end Zen 2s there's the potential of more binaries/loops/instructions in the L3 that aren't being swapped in and out of cache. That's the only benefit I can see and I have no idea if that matters or not in regards to raw CPU power or benchmarks.

Announcement

LLVM Clang 12 Benchmarks At Varying Optimization Levels, LTO

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment