Originally posted by V1tol
View Post
Announcement
Collapse
No announcement yet.
Glibc's strncasecmp / strcasecmp Get AVX2 & EVEX Optimized Versions, Drops AVX
Collapse
X
-
-
Originally posted by zxy_thf View Postiirc AVX1 does not support integer types and I could image how much trouble it may cause when using floating point units to copy strings.
Anyway, the AVX version presumably worked. So, if there was anything to workaround, they apparently already did it.
Comment
-
Originally posted by topolinik View PostI did not know such CPUs even existed, any names as example...?
Anyway, this prompted in much speculation about how to deal with hybrid CPUs having asymmetrical ISA support.
- Likes 1
Comment
-
Originally posted by coder View PostAnyway, this prompted in much speculation about how to deal with hybrid CPUs having asymmetrical ISA support.
Experts in the area have pointed out that technically the chip could be designed to catch the error and hand off the thread to the right core, but Intel hasn’t done this here as it adds complexity.
It is simply mad the idea of givin' the ball to the developers side and expect any thread they code to be aware of the ISA exposed by the core they're running on.
Comment
-
Originally posted by topolinik View Postso, those processors are indeed a wrong move from Intel which needs to make the chips able to recover.
Originally posted by topolinik View PostIt is simply mad the idea of givin' the ball to the developers side and expect any thread they code to be aware of the ISA exposed by the core they're running on.
I think this highlights that we're moving beyond the point where it makes sense for apps to spawn their own worker threads. There are numerous problems with the approach:- You don't know how many other apps are also trying to use those cores.
- Other threads operating on shared data (and maybe holding locks on it) might not be running (especially if the system is busy) at the time, which can hurt performance due to increasing context switches.
- If a library is internally spawning worker threads, the app might end up with multiple sets of worker threads all fighting for the same cores.
- If some operations are tied to a subset of the cores, the app doesn't necessarily know this.
So, it makes much more sense to have some sort of OS-provided work queues, whereby you can queue up a tuple of (function, state, parameters) to be dispatched by the OS to available cores. If this approach is taken, and some of the operations turn out to use instructions not supported on all of the cores, then the OS can restrict which cores get to run work items from that app, without the app having oversubscribed those cores with threads that are fighting each other for them.
- Likes 2
Comment
-
Originally posted by V1tol View Post
Well, the problem is benchmarks don't reflect real life usage. Doing synthetic thing like "comparing million of strings per second" does not mean it is working faster. I really like this explanation writeup about possible penalties of using AVX* occasionally in your app (99% of real world apps).
let say there could be this thing like a benchmark website which could have multiple articles...
season 1 episode 1 could be article with benchmark str cmp ops with zig/nim/go/
season 1 episode 2 could be article with benchmark but this time 10,000 containers (5000 read/5000 write) sqlite with zig to ram DBs
season 1 episode 3 could be article with benchmark but this time with your real world tests like zip a file or something, reencode youtube download, whatever real world
workloads
why not have multiple benches to deeply investigate?
- Likes 1
Comment
-
Originally posted by coder View PostThe more you dig into the complexities of handling cores with heterogeneous ISA support, the more intractable it seems.
(...)
I'm left to conclude that "the cat is already out of the bag", on this one. The only practical way to do Big.Little CPUs is for symmetrical ISA support.
(...)
On Linux, that'd be taskset (1).
For now, I actually prefer the AMD chiplet model, in part because the products we've seen so far have identical ISA support in SKUs that are comprised of multi-chiplet CPUs.
- Likes 4
Comment
-
Originally posted by ermo View PostOr you could just be pragmatic and constrain workloads to the cores that support the ISA you're interested in?
I think glibc's string functions are a perfect example of advanced ISA usage showing up where you wouldn't necessarily expect it. This is no small point! If the app or OS blocked any thread from running on a little core which tried to use an AVX-512 instruction (which includes EVEX), then you'd end up with the little cores sitting almost entirely idle.
Anyway, that pretty much limits this approach to a library that spawns its own, private worker threads. However, a downside of that approach that I've experienced first hand, is where your app ends up with 3 independent sets of worker threads. So, oversubscribing the available cores by > 3x, within a single process, because the different libs don't know about each other! And what if you need to run multiple instances of that app (as I did)? So, it's far from ideal.
Comment
-
Originally posted by CommunityMember View PostWell, as the commit mention, SSE4.2 (which your IVB has, and the code still supports) is roughly equivalent to AVX performance (3-4% difference), so while your IVB laptop is showing its age, this particular change is unlikely to matter much in any real world scenario..
As for newness: I'm still using many Ivy Bridge and Sandy Bridge servers, since their performance is adequate and they're far cheaper than more modern hardware. But it is possible that this will change by the time the current development version of glibc (2.36) is in use by Debian - we already faced an energy surcharge from one provider in the Netherlands (which relies largely on gas) that effectively reduces the cost differential. Our newest, aside from a few caches, are Xeon-D (Broadwell) - which do at least support AVX2.Last edited by GreenReaper; 02 April 2022, 07:07 AM.
- Likes 1
Comment
-
Originally posted by GreenReaper View Post
While this might be a reasonable argument, I feel it has been undermined a little by running the benchmarks in question on Tiger Lake, a far newer architecture which may not put as much emphasis on AVX support as those which launched with it. I don't know if it would be any better - I just think it should be tested on the platforms that are actually impacted by the change.
As for newness: I'm still using many Ivy Bridge and Sandy Bridge servers, since their performance is adequate and they're far cheaper than more modern hardware. But it is possible that this will change by the time the current development version of glibc (2.36) is in use by Debian - we already faced an energy surcharge from one provider in the Netherlands (which relies largely on gas) that effectively reduces the cost differential. Our newest, aside from a few caches, are Xeon-D (Broadwell) - which do at least support AVX2.
The reason I think it's alright here is the benchmarks are moreso a sanity test/due-diligence that nothing is horribly wrong. the AVX and SSE4.2 code have the exact same logic and instruction order. The only difference is the AVX version uses VEX prefix on the SIMD instructions. The only issues can then be in the frontend. This can be fixed for the SSE4.2 code explicitly if someone wants to submit a patch. If the AVX code was faster it's really only by accident.
Intuitively I would also expect the SSE4.2 to be better for almost any application since its smaller code size will add less Icache pressure. This is hard to test as `str{n}casecmp` is not hot enough that we would notice these small percentage changes in any application. Probably the way to do it would be to write a toy application that had a bottleneck on `qsort` using `str{n}casecmp` as the comparison function. Didn't think it was needed given the exact match in logic and unfortunately don't own the hardware.
Comment
Comment