Announcement

Collapse
No announcement yet.

Glibc's strncasecmp / strcasecmp Get AVX2 & EVEX Optimized Versions, Drops AVX

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by Aryma View Post

    why they need to removed it in the first place ?
    Because software doesn't maintain itself.

    Comment


    • #12
      Originally posted by anarki2 View Post

      Because software doesn't maintain itself.
      For now...

      Dun Dun Dunnnnn

      Comment


      • #13
        For your next CPU buying decision: That patchset is another example next to the recent x86-64-v2-v4 feature levels why ISA support can matter. And that some are more important than others (AVX2 vs. AVX) and thus better supported long term.

        Comment


        • #14
          Originally posted by jabl View Post

          It's all automatic. When a program uses one of these glibc functions that have several optimized versions using different ISA features available, it does a CPU check and stores the result. The next time such a function is used it only needs to lookup the stored result and jump to the correct function.
          And this is why having different ISA features on different CPUs in a system (say P cores having avx-512 and E cores not) can result in the most interesting problems (either via identification of features, or scheduling on cores, or handling the SIGILL). None of the issues are insurmountable, but they do require one to work a bit harder to do the right thing (whatever that might be).

          Comment


          • #15
            Originally posted by V1tol View Post
            IIRC AVX2 also causes slowdown, that's why this change surprises me.
            The phoronix benchmarks and graphs will settle your mind.
            I am hope to see (zig+musl) vs (nim dynamic with glibc) vs (nim musl) to see the perf differences for these str cmp ops

            Comment


            • #16
              Originally posted by CommunityMember View Post
              And this is why having different ISA features on different CPUs in a system (say P cores having avx-512 and E cores not) can result in the most interesting problems (either via identification of features, or scheduling on cores, or handling the SIGILL). None of the issues are insurmountable, but they do require one to work a bit harder to do the right thing (whatever that might be).
              The more you dig into the complexities of handling cores with heterogeneous ISA support, the more intractable it seems. If you're designing everything from the ground up, with full control over the software, then you can do it easily enough. The problems come from a universe of software that assumes symmetrical ISA levels across all cores, and now you come along and invalidate that assumption.

              Sure, you can make it "work", as in "not fall over and die", by trapping SIGILL and faulting the thread over to a core which supports that instruction, but if no such cores are free, then the thread that was intended to be running concurrently is now stuck in a queue. That's not a performance win, but performance is the main reason Intel went with Big.Little on non-mobile CPUs.

              So, exploring this little thought experiment some more, the next thing you'd do is probably remember which threads need to run on the more capable cores. However, if a program starts lots of worker threads running jobs from a shared work queue, then they're all potentially needing to run on the "big" cores. This is quite a common approach, for programs running lots of threads. So, now you have a situation where the program naively started more threads than there are cores that can actually run them, and they're left to fight each other for execution time on those cores. This will hurt performance relative to running only enough threads to occupy those cores.

              Now, let's say you define some new API for querying ISA support, so that you know that all software using that API will know about potential asymmetry between cores and will know not to assume otherwise. And only through that API do you expose any features which differ between the cores. The problem this doesn't solve is that a library like glibc will have no way of knowing or controlling which core it's on. Even if it queries for EVEX support every time a function is called (which comes with its own performance penalty), the function could get preempted and resumed on a less-capable core. And at the level where these threads are being spawned and can potentially specify affinity to more capable CPUs, the caller might be oblivious to the fact that they're using a library that's even affected by such issue.

              So, it turns out to be a really thorny problem to make asymmetrical ISA support work well for any but the more carefully-written software. I think I recently saw that OpenBLAS now has a feature supporting it, but that only works because it spawns its own worker threads which are running only OpenBLAS code. That's really what it takes to make it work.

              I'm left to conclude that "the cat is already out of the bag", on this one. The only practical way to do Big.Little CPUs is for symmetrical ISA support.

              Closing thought: maybe the problem is really having programs create their own worker threads. If you instead submitted packets of work to the OS, then the OS could manage worker threads and learn which cores can successfully run your work packets (if you don't warn it, a priori). That would also be a good move to avoid the problem of oversubscription, where multiple programs (or even libraries within a single program, as I've actually seen) each spawning their own set of worker threads, totaling far more than the number of cores.
              Last edited by coder; 31 March 2022, 01:00 AM.

              Comment


              • #17
                Originally posted by Aryma View Post

                ​​​​​​​why they need to removed it in the first place ?
                iirc AVX1 does not support integer types and I could image how much trouble it may cause when using floating point units to copy strings.

                Comment


                • #18
                  Originally posted by CommunityMember View Post

                  And this is why having different ISA features on different CPUs in a system (say P cores having avx-512 and E cores not) can result in the most interesting problems (either via identification of features, or scheduling on cores, or handling the SIGILL). None of the issues are insurmountable, but they do require one to work a bit harder to do the right thing (whatever that might be).
                  if proper multithreading isn't already a challange. Now one has to consider P and E cores and the potential downclocking because of the use of some instruction set which is potentially faster if one tames the clockslowdown. For me it appears that its almost impossible to write some generic optimized code anymore. Once you start optimizing for one CPU it is going to run worse on the other ones. Ad absurdum - This even questions the existence of such special instructions sets.
                  Ok this sounds quite obvious for any optimization. But in the past you could optimize for SSE and it ran better across all SSE supporting CPUs now if you start Optimizing for generic AVX2 or AVX-512 it runs better on one generation or vendor but might run even worse on the other. Depending on the constelation of E-core, P-core, TDP ....etc
                  Last edited by CochainComplex; 31 March 2022, 04:15 AM.

                  Comment


                  • #19
                    Originally posted by coder View Post
                    ...the complexities of handling cores with heterogeneous ISA...
                    I did not know such CPUs even existed, any names as example...?

                    Comment


                    • #20
                      Originally posted by onlyLinuxLuvUBack View Post
                      The phoronix benchmarks and graphs will settle your mind.
                      I am hope to see (zig+musl) vs (nim dynamic with glibc) vs (nim musl) to see the perf differences for these str cmp ops
                      Well, the problem is benchmarks don't reflect real life usage. Doing synthetic thing like "comparing million of strings per second" does not mean it is working faster. I really like this explanation writeup about possible penalties of using AVX* occasionally in your app (99% of real world apps).

                      Comment

                      Working...
                      X