Originally posted by Aryma
View Post
Announcement
Collapse
No announcement yet.
Glibc's strncasecmp / strcasecmp Get AVX2 & EVEX Optimized Versions, Drops AVX
Collapse
X
-
For your next CPU buying decision: That patchset is another example next to the recent x86-64-v2-v4 feature levels why ISA support can matter. And that some are more important than others (AVX2 vs. AVX) and thus better supported long term.
- Likes 1
Comment
-
Originally posted by jabl View Post
It's all automatic. When a program uses one of these glibc functions that have several optimized versions using different ISA features available, it does a CPU check and stores the result. The next time such a function is used it only needs to lookup the stored result and jump to the correct function.
- Likes 3
Comment
-
Originally posted by V1tol View PostIIRC AVX2 also causes slowdown, that's why this change surprises me.
I am hope to see (zig+musl) vs (nim dynamic with glibc) vs (nim musl) to see the perf differences for these str cmp ops
- Likes 3
Comment
-
Originally posted by CommunityMember View PostAnd this is why having different ISA features on different CPUs in a system (say P cores having avx-512 and E cores not) can result in the most interesting problems (either via identification of features, or scheduling on cores, or handling the SIGILL). None of the issues are insurmountable, but they do require one to work a bit harder to do the right thing (whatever that might be).
Sure, you can make it "work", as in "not fall over and die", by trapping SIGILL and faulting the thread over to a core which supports that instruction, but if no such cores are free, then the thread that was intended to be running concurrently is now stuck in a queue. That's not a performance win, but performance is the main reason Intel went with Big.Little on non-mobile CPUs.
So, exploring this little thought experiment some more, the next thing you'd do is probably remember which threads need to run on the more capable cores. However, if a program starts lots of worker threads running jobs from a shared work queue, then they're all potentially needing to run on the "big" cores. This is quite a common approach, for programs running lots of threads. So, now you have a situation where the program naively started more threads than there are cores that can actually run them, and they're left to fight each other for execution time on those cores. This will hurt performance relative to running only enough threads to occupy those cores.
Now, let's say you define some new API for querying ISA support, so that you know that all software using that API will know about potential asymmetry between cores and will know not to assume otherwise. And only through that API do you expose any features which differ between the cores. The problem this doesn't solve is that a library like glibc will have no way of knowing or controlling which core it's on. Even if it queries for EVEX support every time a function is called (which comes with its own performance penalty), the function could get preempted and resumed on a less-capable core. And at the level where these threads are being spawned and can potentially specify affinity to more capable CPUs, the caller might be oblivious to the fact that they're using a library that's even affected by such issue.
So, it turns out to be a really thorny problem to make asymmetrical ISA support work well for any but the more carefully-written software. I think I recently saw that OpenBLAS now has a feature supporting it, but that only works because it spawns its own worker threads which are running only OpenBLAS code. That's really what it takes to make it work.
I'm left to conclude that "the cat is already out of the bag", on this one. The only practical way to do Big.Little CPUs is for symmetrical ISA support.
Closing thought: maybe the problem is really having programs create their own worker threads. If you instead submitted packets of work to the OS, then the OS could manage worker threads and learn which cores can successfully run your work packets (if you don't warn it, a priori). That would also be a good move to avoid the problem of oversubscription, where multiple programs (or even libraries within a single program, as I've actually seen) each spawning their own set of worker threads, totaling far more than the number of cores.Last edited by coder; 31 March 2022, 01:00 AM.
- Likes 8
Comment
-
Originally posted by CommunityMember View Post
And this is why having different ISA features on different CPUs in a system (say P cores having avx-512 and E cores not) can result in the most interesting problems (either via identification of features, or scheduling on cores, or handling the SIGILL). None of the issues are insurmountable, but they do require one to work a bit harder to do the right thing (whatever that might be).
Ok this sounds quite obvious for any optimization. But in the past you could optimize for SSE and it ran better across all SSE supporting CPUs now if you start Optimizing for generic AVX2 or AVX-512 it runs better on one generation or vendor but might run even worse on the other. Depending on the constelation of E-core, P-core, TDP ....etcLast edited by CochainComplex; 31 March 2022, 04:15 AM.
Comment
-
Originally posted by onlyLinuxLuvUBack View PostThe phoronix benchmarks and graphs will settle your mind.
I am hope to see (zig+musl) vs (nim dynamic with glibc) vs (nim musl) to see the perf differences for these str cmp ops
- Likes 2
Comment
Comment