Glibc's strncasecmp / strcasecmp Get AVX2 & EVEX Optimized Versions, Drops AVX

goldstein.w.n replied

08 April 2022, 11:21 PM
Originally posted by GreenReaper View Post

While this might be a reasonable argument, I feel it has been undermined a little by running the benchmarks in question on Tiger Lake, a far newer architecture which may not put as much emphasis on AVX support as those which launched with it. I don't know if it would be any better - I just think it should be tested on the platforms that are actually impacted by the change.

As for newness: I'm still using many Ivy Bridge and Sandy Bridge servers, since their performance is adequate and they're far cheaper than more modern hardware. But it is possible that this will change by the time the current development version of glibc (2.36) is in use by Debian - we already faced an energy surcharge from one provider in the Netherlands (which relies largely on gas) that effectively reduces the cost differential. Our newest, aside from a few caches, are Xeon-D (Broadwell) - which do at least support AVX2.

There is definitely an issue with running SSE2/SSE4/AVX benchmarks on Tigerlake, especially since the benchmarks are micro-benchmarks so things the the LSD on Tigerlake (with IVB and SnB don't have) can drastically change the expectation.

The reason I think it's alright here is the benchmarks are moreso a sanity test/due-diligence that nothing is horribly wrong. the AVX and SSE4.2 code have the exact same logic and instruction order. The only difference is the AVX version uses VEX prefix on the SIMD instructions. The only issues can then be in the frontend. This can be fixed for the SSE4.2 code explicitly if someone wants to submit a patch. If the AVX code was faster it's really only by accident.

Intuitively I would also expect the SSE4.2 to be better for almost any application since its smaller code size will add less Icache pressure. This is hard to test as `str{n}casecmp` is not hot enough that we would notice these small percentage changes in any application. Probably the way to do it would be to write a toy application that had a bottleneck on `qsort` using `str{n}casecmp` as the comparison function. Didn't think it was needed given the exact match in logic and unfortunately don't own the hardware.
Leave a comment:
GreenReaper replied

02 April 2022, 07:03 AM
Originally posted by CommunityMember View Post

Well, as the commit mention, SSE4.2 (which your IVB has, and the code still supports) is roughly equivalent to AVX performance (3-4% difference), so while your IVB laptop is showing its age, this particular change is unlikely to matter much in any real world scenario..

While this might be a reasonable argument, I feel it has been undermined a little by running the benchmarks in question on Tiger Lake, a far newer architecture which may not put as much emphasis on AVX support as those which launched with it. I don't know if it would be any better - I just think it should be tested on the platforms that are actually impacted by the change.

As for newness: I'm still using many Ivy Bridge and Sandy Bridge servers, since their performance is adequate and they're far cheaper than more modern hardware. But it is possible that this will change by the time the current development version of glibc (2.36) is in use by Debian - we already faced an energy surcharge from one provider in the Netherlands (which relies largely on gas) that effectively reduces the cost differential. Our newest, aside from a few caches, are Xeon-D (Broadwell) - which do at least support AVX2.

Last edited by GreenReaper; 02 April 2022, 07:07 AM.
Likes 1
Leave a comment:
coder replied

31 March 2022, 10:41 PM
Originally posted by ermo View Post

Or you could just be pragmatic and constrain workloads to the cores that support the ISA you're interested in?

I know it was a longish post, but I did try to address that. It only works if whatever code that's spawning the worker threads also knows exactly what they'll be doing. As soon as you lose visibility into the ISA needs/wants of any functions those threads are calling, you can no longer accurately set the affinity for them.

I think glibc's string functions are a perfect example of advanced ISA usage showing up where you wouldn't necessarily expect it. This is no small point! If the app or OS blocked any thread from running on a little core which tried to use an AVX-512 instruction (which includes EVEX), then you'd end up with the little cores sitting almost entirely idle.

Anyway, that pretty much limits this approach to a library that spawns its own, private worker threads. However, a downside of that approach that I've experienced first hand, is where your app ends up with 3 independent sets of worker threads. So, oversubscribing the available cores by > 3x, within a single process, because the different libs don't know about each other! And what if you need to run multiple instances of that app (as I did)? So, it's far from ideal.
Leave a comment:
ermo replied

31 March 2022, 02:34 PM
Originally posted by coder View Post

The more you dig into the complexities of handling cores with heterogeneous ISA support, the more intractable it seems.
(...)
I'm left to conclude that "the cat is already out of the bag", on this one. The only practical way to do Big.Little CPUs is for symmetrical ISA support.
(...)

Or you could just be pragmatic and constrain workloads to the cores that support the ISA you're interested in?

On Linux, that'd be taskset (1).

For now, I actually prefer the AMD chiplet model, in part because the products we've seen so far have identical ISA support in SKUs that are comprised of multi-chiplet CPUs.
Likes 4
Leave a comment:
onlyLinuxLuvUBack replied

31 March 2022, 01:25 PM
Originally posted by V1tol View Post

Well, the problem is benchmarks don't reflect real life usage. Doing synthetic thing like "comparing million of strings per second" does not mean it is working faster. I really like this explanation writeup about possible penalties of using AVX* occasionally in your app (99% of real world apps).

I agree with you,
let say there could be this thing like a benchmark website which could have multiple articles...

season 1 episode 1 could be article with benchmark str cmp ops with zig/nim/go/
season 1 episode 2 could be article with benchmark but this time 10,000 containers (5000 read/5000 write) sqlite with zig to ram DBs
season 1 episode 3 could be article with benchmark but this time with your real world tests like zip a file or something, reencode youtube download, whatever real world
workloads

why not have multiple benches to deeply investigate?
Likes 1
Leave a comment:
coder replied

31 March 2022, 10:15 AM
Originally posted by topolinik View Post

so, those processors are indeed a wrong move from Intel which needs to make the chips able to recover.

I just explained why the problem is harder than it seems. It's not enough simply to run the workload to completion, because you also don't want performance to suffer, excessively.

https://www.phoronix.com/forums/foru...09#post1316609

Originally posted by topolinik View Post

It is simply mad the idea of givin' the ball to the developers side and expect any thread they code to be aware of the ISA exposed by the core they're running on.

Especially because you could get preempted in the middle of some ISA-specific block of code and resumed on another core. It's not always an option to set processor affinity, because the code which is doing the ISA-specific operations might not be the same code managing the thread.

I think this highlights that we're moving beyond the point where it makes sense for apps to spawn their own worker threads. There are numerous problems with the approach:
You don't know how many other apps are also trying to use those cores.

Other threads operating on shared data (and maybe holding locks on it) might not be running (especially if the system is busy) at the time, which can hurt performance due to increasing context switches.

If a library is internally spawning worker threads, the app might end up with multiple sets of worker threads all fighting for the same cores.

If some operations are tied to a subset of the cores, the app doesn't necessarily know this.

So, it makes much more sense to have some sort of OS-provided work queues, whereby you can queue up a tuple of (function, state, parameters) to be dispatched by the OS to available cores. If this approach is taken, and some of the operations turn out to use instructions not supported on all of the cores, then the OS can restrict which cores get to run work items from that app, without the app having oversubscribed those cores with threads that are fighting each other for them.
Likes 2
Leave a comment:
topolinik replied

31 March 2022, 08:38 AM
Originally posted by coder View Post

Anyway, this prompted in much speculation about how to deal with hybrid CPUs having asymmetrical ISA support.

I think this should NOT be the case, as

Experts in the area have pointed out that technically the chip could be designed to catch the error and hand off the thread to the right core, but Intel hasn’t done this here as it adds complexity.

so, those processors are indeed a wrong move from Intel which needs to make the chips able to recover.
It is simply mad the idea of givin' the ball to the developers side and expect any thread they code to be aware of the ISA exposed by the core they're running on.
Leave a comment:
coder replied

31 March 2022, 08:10 AM
Originally posted by topolinik View Post

I did not know such CPUs even existed, any names as example...?

In Alder Lake, Intel disabled AVX-512 on the big cores. Some motherboards had BIOS settings which would let you re-enable it. Intel then cracked down on those motheboard makers, and presumably also could modify their CPUs with a metal layer change to hard-disable it. Even at the microcode level, you'd imagine they could prevent it from being enabled.

https://www.anandtech.com/show/17047...d-complexity/2

Anyway, this prompted in much speculation about how to deal with hybrid CPUs having asymmetrical ISA support.
Likes 1
Leave a comment:
coder replied

31 March 2022, 08:06 AM
Originally posted by zxy_thf View Post

iirc AVX1 does not support integer types and I could image how much trouble it may cause when using floating point units to copy strings.

I'm not sure about that. I think SNaNs should only trigger a FPE, if they're the result of a computation - not simply a load.

Anyway, the AVX version presumably worked. So, if there was anything to workaround, they apparently already did it.
Leave a comment:
coder replied

31 March 2022, 08:02 AM
Originally posted by V1tol View Post

IIRC AVX2 also causes slowdown, that's why this change surprises me.

Heavy AVX2 would cause Haswell CPUs to clock-throttle, slightly. Not nearly as bad as could happen with AVX-512, however. This effect was lessened in successive generations, to the point that I think it's no longer treated as a special case in CPUs like Tiger Lake and Alder Lake.
Leave a comment:

Announcement

Glibc's strncasecmp / strcasecmp Get AVX2 & EVEX Optimized Versions, Drops AVX

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: