Announcement

Collapse
No announcement yet.

Glibc Adds Arm SVE-Optimized Memory Copy - Can "Significantly" Help Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by carewolf View Post
    No, I mean this code the story is about. It operates on 32 bytes at a time aka 256bits, so it won't run faster on a 512bit implementation (well at least according to this summary).
    Oh, right. The patch does call it an initial implementation, leaving open the possibility for further improvements.

    Realistically, the A64FX is the only 512-bit implementation of which I'm aware. So, it's not currently very consequential to support no larger than 256-bit. Also, perhaps there's a point of diminishing returns that 256-bit already exceeds, though 512-bit is a typical cacheline size.

    Comment


    • #12
      Originally posted by coder View Post
      I disagree. You & Linus aren't thinking hard enough about the practical realities of software, on such a system. What's going to happen is that software will spawn too many threads, they'll get faulted off of the weak cores, and will simply contend for time on the more capable cores.

      Often, apps are ignorant of what ISA extensions the libraries they're using even employ. So, putting the burden on the app developer to manage threads and affinities based on core capabilities is unreasonable and unrealistic.
      Maybe, maybe not. Linux distros being source based there is the potential for clever solutions. Libraries generally have multiple paths and fallback, and you could use namespace mechanism as a sort of LD_PRELOAD so that libraries know which path to pick in a way transparent to the app. Maybe you don't even open up the option until an app has high CPU demand.

      In any case, while I believe you that making it work well would be tricky, I also believe that at least a few would find it worth the trouble.

      Comment


      • #13
        Originally posted by WorBlux View Post
        there is the potential for clever solutions. Libraries generally have multiple paths and fallback, and you could use namespace mechanism as a sort of LD_PRELOAD so that libraries know which path to pick in a way transparent to the app.
        That's a nice idea, but the problem with a core-specific dispatch mechanism is that once a thread starts executing on a more capable core, it's basically bound to them. Because, when it gets prempted, it could have some intermediate state that involves the ISA-specific extensions not available by the smaller cores. It might be feasible for the OS to track which address ranges correspond to which ISA levels, so it could know when it might be safe to migrate a thread to a core of a lower class.

        The other problem you'll likely find is that code paths optimized for a particular ISA extension tend to have in-memory datastructures which are specific to that codepath. So, if there are some threads executing the AVX2 path and others executing the AVX-512 path, shared state might not be consistently read/written, between them.

        Originally posted by WorBlux View Post
        In any case, while I believe you that making it work well would be tricky, I also believe that at least a few would find it worth the trouble.
        I'm not opposed to experimentation and even giving users a non-default option to shoot themselves in the foot.

        Also, what Guest said is right - OS developers can already trap certain opcodes on certain cores, if they wanted to simulate a heterogeneous CPU.

        Comment


        • #14
          Originally posted by atomsymbol
          Secondly, some of the solutions I mentioned to Torvalds are based on forms of binary translation, which would enable an app to call a library function without the app dealing with the problem of whether-or-when the library is using AVX2 or AVX-512. But that is no longer a "single-line patch" nor a "1000-line patch".
          Anything that's JIT-compiled can conceivably do it right, assuming the compiler is sophisticated enough.

          Comment


          • #15
            Originally posted by coder View Post
            I'm trying to understand your point about TSX. So, you agree that it was comparable to TME, but you're concerned that it's gone and doesn't appear to be coming back?
            Yeah I should have been clearer, TSX is at best segmented to parts of the enterprise space, which is not doing adoption any favors.


            The situation seems to be more complicated than that though.
            Last edited by brucethemoose; 10 June 2022, 12:44 AM.

            Comment


            • #16
              Originally posted by carewolf View Post

              This implementation is 32bytes at a time, so 32 * 8 = 256bit, so similar to AVX2.
              Right, but future implementations will be wider.

              My point is that SVE2 software written now is going to support those wider implementations in the future, while AVX2 software written now is going to be stuck with AVX2, and without the more flexible instructions of AVX512.

              ARM is going to standardize SVE2 in everything relatively early, while it appears that Intel will have AVX2 products floating around awhile longer, and AVX512 remains extremely fragmented.
              Last edited by brucethemoose; 10 June 2022, 12:45 AM.

              Comment


              • #17
                Originally posted by brucethemoose View Post
                Yeah I should have been clearer, TSX is at best segmented to parts of the enterprise space, which is not doing adoption any favors.
                Well, it wasn't, previously. Haswell had it disabled, due to bugs. It did actually work, in Skylake and Coffee Lake, until Intel started rolling out microcode updates which disabled it (using security vulnerabilities as the excuse).

                This all came as a great disappointment to myself, as I thought TSX and HLE were truly innovative features and really helped move x86 forward. So, I can only shrug in agreement, as you tout the benefits and advantages of TME.

                I think there's a real opportunity for AMD to step in and revive TSX and HLE. I hope they make an appearance in Zen 4, though that's probably a long shot.

                Comment


                • #18
                  Originally posted by brucethemoose View Post
                  SVE2 software written now is going to support those wider implementations in the future,
                  Let's see how wide the SVE2 version of memcpy() decides to support.

                  Originally posted by brucethemoose View Post
                  ARM is going to standardize SVE2 in everything relatively early,
                  ARMv9-A already did standardize on it, though ARMv9-A products are still just starting to trickle onto the market.

                  Perhaps the SVE2 requirement is even to blame for the Cortex-A510's weird option to share a single FPU between two cores, in a fashion slightly reminiscent of AMD's Bulldozer. Mediatek has at least one SoC where they gave each A510 core its own FPU, but I think Qualcomm opted for the shared approach, in their latest flagship.

                  Originally posted by brucethemoose View Post
                  it appears that Intel will have AVX2 products floating around awhile longer, and AVX512 remains extremely fragmented.
                  It's even worse than you say, with Alder Lake supporting only AVX2. Where Alder Lake did manage to raise the bar is in finally widening the E-cores. Previous generations didn't even support AVX!

                  Comment


                  • #19
                    Originally posted by brucethemoose View Post

                    Right, but future implementations will be wider.

                    My point is that SVE2 software written now is going to support those wider implementations in the future, while AVX2 software written now is going to be stuck with AVX2, and without the more flexible instructions of AVX512.
                    No if the code works on 32 bytes at a time. It is equally stuck with a certain width as AVX2 code.

                    Comment


                    • #20
                      Originally posted by carewolf View Post
                      No if the code works on 32 bytes at a time. It is equally stuck with a certain width as AVX2 code.
                      SVE isn't fixed width. The S indeed stands for "Scalable". The 32-byte aspect is merely an upper bound of this current implementation. The idea is that even implementations that are only 128-bits wide will still benefit.

                      https://developer.arm.com/documentat...in-your-c-code

                      BTW, I noticed sysdeps/aarch64/multiarch/memcpy.c has a whole dispatch loop for different ARM server CPUs. Interestingly, the Neoverse N2 doesn't even dispatch to the new SVE version, in spite of supporting it (maybe due to its 128-bit implementation?). Also, it seems as though the A64FX already had its own optimized version. Not a good advertisement for SVE's generic credentials, I think.
                      Last edited by coder; 11 June 2022, 04:11 AM.

                      Comment

                      Working...
                      X