Announcement

Collapse
No announcement yet.

ARMv8.8-A Support With New MOPS Instructions Ready For GCC 12

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ARMv8.8-A Support With New MOPS Instructions Ready For GCC 12

    Phoronix: ARMv8.8-A Support With New MOPS Instructions Ready For GCC 12

    The latest GCC Git activity for next year's GCC 12 compiler is adding of ARMv8.8-A support...

    https://www.phoronix.com/scan.php?pa...C-12-ARMv8.8-A

  • #2
    I use lot of memcpy and sometime memset, any idea about performance improvement of this instruction?
    Developer of Ultracopier/Supercopier and of the game CatchChallenger

    Comment


    • #3
      aarch64 is lining up the inevitable instruction mess nicely...

      Recent toolchains started looking like +wiz+bang+extra+w00t+fpu+whatever...
      Something like 30+ extension _sets_...?

      Comment


      • #4
        For details on the actual instructions, see https://developer.arm.com/documentat...-Forward-only-

        There seems to be a variety of these instructions with slightly different behaviors. They all seem to be similar to the X86 `rep movsb` instruction, so I'd guess similar results. Recent Intel CPUs have an enhanced "rep stosb" that can use larger memory sizes under the covers as detailed here https://msrc-blog.microsoft.com/2021...mset-routines/. The ARM instructions seem to allow the CPU implementors to define how many bytes they copy at a time, so these instructions could make it much easier for programmers to get a high-performing memcpy, memmove, and memset implementations without having to do much. Might still get better performance using vector instructions, but for places like the kernel where you may not want to save off all those registers it may be better to use this.

        Comment


        • #5
          Originally posted by alpha_one_x86 View Post
          I use lot of memcpy and sometime memset, any idea about performance improvement of this instruction?
          If you have a conscious relationship with these functions, I guess you are actually using them way less than in many other coding practices, such as:

          Code:
          std::string("I") + std::string("like") + std::string("to") + std::string("create") + std::string("lots") + std::string("of") + std::string("temporary") + std::string("copies") + std::string("when") + std::string("formatting") + std::string("strings");
          (C++ makes it supereasy to make lots of more or less hidden copies and initializations, and compilers know how to make use of these functions.)

          However, as adler187 said, the big benefit is probably OS kernels that need to copy and zero out pages.

          Comment


          • #6
            Originally posted by andreano View Post
            If you have a conscious relationship with these functions, I guess you are actually using them way less than in many other coding practices
            In multiple place I zero some buffer for question of security or just init range (like my DOS protection, I need 0 array to count req/s).
            In some place critical in performance term I serialize into memory block and copy this 4KB+ into the network output buffer.
            Then for me it's 40% of time for my server and 95% in case of DOS/DDOS.
            Developer of Ultracopier/Supercopier and of the game CatchChallenger

            Comment


            • #7
              Originally posted by alpha_one_x86 View Post
              In multiple place I zero some buffer for question of security or just init range (like my DOS protection, I need 0 array to count req/s).
              In some place critical in performance term I serialize into memory block and copy this 4KB+ into the network output buffer.
              Then for me it's 40% of time for my server and 95% in case of DOS/DDOS.
              Special memcpy instructions won't help at all for large copies or memsets since you're limited by the memory system. Things like prefetching, write streaming, avoiding RFO reads matter more.

              For small, unpredictable sizes hardware might do better - however after a decade of trying speed up rep movsb, using it is rarely faster on modern x86 cores. The GLIBC memcpy has become incredibly complex because rep movsb has performance issues in every CPU. So you still must call memcpy rather than inline rep movsb so that GLIBC can work around the various issues for the particular CPU you are using. Next year it will be even more complex since no CPU has implemented a perfect memcpy instruction.

              CISC-style instructions tend to backfire spectacularly - who still uses the x87 math instructions like FSIN when sin() is much faster (and more accurate)?

              Comment


              • #8
                It's official, ARM 8.8 is definitively no longer RISC, lol.

                NEON already made it not RISC, but what all the RISC fanbois loved most to poke fun at was rep stosb, which isn't even ISA instructions really. It's a pointer to a variable-run-length microcode routine, which is exactly the definition of not RISC... and so...

                Of course, anything SIMD/Vector is not RISC for the same reasons, unless the core is actually the full width of the sum of all the operands for all the clocks required to retire it, there's no decoding required, and no complex addressing for the store step. RISC is a harsh mistress.
                Last edited by linuxgeex; 15 December 2021, 05:22 AM.

                Comment


                • #9
                  Originally posted by linuxgeex View Post
                  It's official, ARM 8.8 is definitively no longer RISC, lol.

                  NEON already made it not RISC, but what all the RISC fanbois loved most to poke fun at was rep stosb, which isn't even ISA instructions really. It's a pointer to a variable-run-length microcode routine, which is exactly the definition of not RISC... and so...

                  Of course, anything SIMD/Vector is not RISC for the same reasons, unless the core is actually the full width of the sum of all the operands for all the clocks required to retire it, there's no decoding required, and no complex addressing for the store step. RISC is a harsh mistress.
                  Only if your definition of RISC implies that all instructions must be single cycle. However then nothing can ever be called RISC since all ISAs contain a few complex instructions and implementations always have the choice of using multiple cycles for some instructions if that simplifies the overall design. Such complex or multi-cycle instructions don't use microcode because they have been designed to not need it (I guess this is why the 8.8 memcpy is split into 3 separate instructions).

                  Comment


                  • #10
                    Originally posted by andreano View Post
                    However, as adler187 said, the big benefit is probably OS kernels that need to copy and zero out pages.
                    You wouldn't need compiler support, just for that. The AArch64 backend for the kernel could use special-purpose asm routines for those specific purposes. And because pages are always fixed-size and aligned, it wouldn't even be hard to write/maintain.

                    Where vector instructions get really annoying for use in memory ops, is all the alignment and size variations. It's really something like general-purpose memcpy() and memset() that would benefit most from these. Plus, you can potentially make the hardware smart enough to avoid RFO and pipeline in some prefetching, like PerformanceExpert mentioned.

                    Comment

                    Working...
                    X