Announcement

Collapse
No announcement yet.

Linux 5.6 To Make Use Of Intel Ice Lake's Fast Short REP MOV For Faster memmove()

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Linux 5.6 To Make Use Of Intel Ice Lake's Fast Short REP MOV For Faster memmove()

    Phoronix: Linux 5.6 To Make Use Of Intel Ice Lake's Fast Short REP MOV For Faster memmove()

    While Intel has offered good Ice Lake support since before the CPUs were shipping (sans taking a bit longer for the Thunderbolt support as a key lone exception, since resolved), a feature that's been publicly known since 2017 is the Fast Short REP MOV behavior and finally with Linux 5.6 that is being made use of for faster memory movements...

    http://www.phoronix.com/scan.php?pag...6-FSRM-Memmove

  • #2
    This is certainly something that would be interesting to see how much of a perf boost it gives.

    Comment


    • #3
      Is Intel Ice Lake the only cpu supporting that instruction?

      Comment


      • #4
        Don't expect a silver bullet... this Stack Overflow has some good info: https://stackoverflow.com/questions/...vsb-for-memcpy

        Comment


        • #5
          Originally posted by cthart View Post
          Don't expect a silver bullet... this Stack Overflow has some good info: https://stackoverflow.com/questions/...vsb-for-memcpy
          The MOV RAX result of 13260 MB/s, when the 8 KiB block of data fits L1D cache, is obsolete because recent x86 CPUs can sustain one 256-bit (32 bytes) store per cycle using AVX instructions, which on a 4 GHz CPU yields theoretical maximum of 4*32=128 GB/s.

          Comment


          • #6
            Originally posted by atomsymbol View Post

            The MOV RAX result of 13260 MB/s, when the 8 KiB block of data fits L1D cache, is obsolete because recent x86 CPUs can sustain one 256-bit (32 bytes) store per cycle using AVX instructions, which on a 4 GHz CPU yields theoretical maximum of 4*32=128 GB/s.
            But isn't AVX mostly disallowed in the kernel?

            Comment


            • #7
              Originally posted by atomsymbol View Post

              The MOV RAX result of 13260 MB/s, when the 8 KiB block of data fits L1D cache, is obsolete because recent x86 CPUs can sustain one 256-bit (32 bytes) store per cycle using AVX instructions, which on a 4 GHz CPU yields theoretical maximum of 4*32=128 GB/s.
              Originally posted by F.Ultra View Post
              But isn't AVX mostly disallowed in the kernel?
              I think there are several points worth mentioning:
              1. With memcpy implemented on x86-64 simply by using the RAX register, one iteration of the memcpy loop has 5 instructions of user code, which translates to at most 4 µops in the µop cache because the CMP+Jcc is fused into a single µop. If the loop is properly aligned to maximize fetch&decode bandwidth, Ryzen 3000 can fetch&decode&execute the 5 user instructions in a single clock. /usr/bin/perf stat measurement shows that with Ryzen 3000 CPU the realworld IPC is 4.9. With the CPU core running at 4.3 GHz, because about 8 bytes are copied in a single clock cycle, the measured bandwidth is 33.7 GB/s if both the source and destination data range is in L1D cache. If the data is in L2 cache, the bandwidth is 24.5GB/s.
              2. Intel Ice Lake desktop CPUs (expected to be released this year) have 2 L1D load ports and 2 L1D store ports, which in theory enables the CPU to copy 2*8=16 bytes per clock cycle via RAX register, which at 5 GHz means about 78 GB/s. However, in this case one memcpy iteration has 7 user instructions (27 user bytes, 6 µops) and it is uncertain whether Ice Lake will be able to sustain processing 6 µops per cycle in this case.
              3. With memcpy implemented on x86-64 by using the YMM0 register, Ryzen 3000's IPC is 4.6. At 4.3 GHz this translates to 126 GB/s of L1D bandwidth, 69 GB/s L2 bandwidth.
              4. In any properly implemented optimized operating system kernel running on x86-64, enabling AVX is a matter of simply switching a single bit indicating that in case of an interrupt the interrupt handler has to save AVX registers:
                Code:
                	atomic { bool old_avx = avx; avx = true };
                	memcpy_avx(src, dst, 8192);
                	if !old_avx then atomic { avx = false };

              Comment


              • #8
                Originally posted by atomsymbol View Post



                I think there are several points worth mentioning:
                1. With memcpy implemented on x86-64 simply by using the RAX register, one iteration of the memcpy loop has 5 instructions of user code, which translates to at most 4 µops in the µop cache because the CMP+Jcc is fused into a single µop. If the loop is properly aligned to maximize fetch&decode bandwidth, Ryzen 3000 can fetch&decode&execute the 5 user instructions in a single clock. /usr/bin/perf stat measurement shows that with Ryzen 3000 CPU the realworld IPC is 4.9. With the CPU core running at 4.3 GHz, because about 8 bytes are copied in a single clock cycle, the measured bandwidth is 33.7 GB/s if both the source and destination data range is in L1D cache. If the data is in L2 cache, the bandwidth is 24.5GB/s.
                2. Intel Ice Lake desktop CPUs (expected to be released this year) have 2 L1D load ports and 2 L1D store ports, which in theory enables the CPU to copy 2*8=16 bytes per clock cycle via RAX register, which at 5 GHz means about 78 GB/s. However, in this case one memcpy iteration has 7 user instructions (27 user bytes, 6 µops) and it is uncertain whether Ice Lake will be able to sustain processing 6 µops per cycle in this case.
                3. With memcpy implemented on x86-64 by using the YMM0 register, Ryzen 3000's IPC is 4.6. At 4.3 GHz this translates to 126 GB/s of L1D bandwidth, 69 GB/s L2 bandwidth.
                4. In any properly implemented optimized operating system kernel running on x86-64, enabling AVX is a matter of simply switching a single bit indicating that in case of an interrupt the interrupt handler has to save AVX registers:
                  Code:
                  atomic { bool old_avx = avx; avx = true };
                  memcpy_avx(src, dst, 8192);
                  if !old_avx then atomic { avx = false };
                Yes but the AVX registers are not stored when you perform a syscall in order to reduce the context switch cost so it's not just interrupt handlers that is a problem AFAIK. There must be a reason why they wernt with REP MOVE in the 5.6 libc instead of utilizing AVX like glibc does in userspace.

                Comment


                • #9
                  Originally posted by F.Ultra View Post

                  Yes but the AVX registers are not stored when you perform a syscall in order to reduce the context switch cost so it's not just interrupt handlers that is a problem AFAIK. There must be a reason why they wernt with REP MOVE in the 5.6 libc instead of utilizing AVX like glibc does in userspace.
                  The kernel more often runs cache-cold, so small implementations of string functions are, relatively speaking, more useful in kernel context.

                  One can also argue that the glibc implementations are not as good in real world situations as in microbenchmarks, due to bloated implementations. See e.g. https://people.ucsc.edu/~hlitz/papers/asmdb.pdf

                  Comment


                  • #10
                    Originally posted by F.Ultra View Post
                    Yes but the AVX registers are not stored when you perform a syscall in order to reduce the context switch cost so it's not just interrupt handlers that is a problem AFAIK. There must be a reason why they went with REP MOVE in the 5.6 libc instead of utilizing AVX like glibc does in userspace.
                    I disagree. If a memcpy in-kernel implementation is using just YMM0 then (in order to preserve user-space YMM register state) it is possible to simply save and restore YMM0 to/from a local stack variable at the beginning and at the end of the memcpy function.

                    If it is impossible to determine at compile-time that the source&destination memory blocks are sufficiently large and properly aligned for AVX memcpy to be profitable then the kernel should call a generic memcpy function instead.

                    Comment

                    Working...
                    X