Announcement

Collapse
No announcement yet.

Linux 5.6 To Make Use Of Intel Ice Lake's Fast Short REP MOV For Faster memmove()

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Linux 5.6 To Make Use Of Intel Ice Lake's Fast Short REP MOV For Faster memmove()

    Phoronix: Linux 5.6 To Make Use Of Intel Ice Lake's Fast Short REP MOV For Faster memmove()

    While Intel has offered good Ice Lake support since before the CPUs were shipping (sans taking a bit longer for the Thunderbolt support as a key lone exception, since resolved), a feature that's been publicly known since 2017 is the Fast Short REP MOV behavior and finally with Linux 5.6 that is being made use of for faster memory movements...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    This is certainly something that would be interesting to see how much of a perf boost it gives.

    Comment


    • #3
      Is Intel Ice Lake the only cpu supporting that instruction?

      Comment


      • #4
        Don't expect a silver bullet... this Stack Overflow has some good info: https://stackoverflow.com/questions/...vsb-for-memcpy

        Comment


        • #5
          Originally posted by atomsymbol

          The MOV RAX result of 13260 MB/s, when the 8 KiB block of data fits L1D cache, is obsolete because recent x86 CPUs can sustain one 256-bit (32 bytes) store per cycle using AVX instructions, which on a 4 GHz CPU yields theoretical maximum of 4*32=128 GB/s.
          But isn't AVX mostly disallowed in the kernel?

          Comment


          • #6
            Originally posted by atomsymbol



            I think there are several points worth mentioning:
            1. With memcpy implemented on x86-64 simply by using the RAX register, one iteration of the memcpy loop has 5 instructions of user code, which translates to at most 4 µops in the µop cache because the CMP+Jcc is fused into a single µop. If the loop is properly aligned to maximize fetch&decode bandwidth, Ryzen 3000 can fetch&decode&execute the 5 user instructions in a single clock. /usr/bin/perf stat measurement shows that with Ryzen 3000 CPU the realworld IPC is 4.9. With the CPU core running at 4.3 GHz, because about 8 bytes are copied in a single clock cycle, the measured bandwidth is 33.7 GB/s if both the source and destination data range is in L1D cache. If the data is in L2 cache, the bandwidth is 24.5GB/s.
            2. Intel Ice Lake desktop CPUs (expected to be released this year) have 2 L1D load ports and 2 L1D store ports, which in theory enables the CPU to copy 2*8=16 bytes per clock cycle via RAX register, which at 5 GHz means about 78 GB/s. However, in this case one memcpy iteration has 7 user instructions (27 user bytes, 6 µops) and it is uncertain whether Ice Lake will be able to sustain processing 6 µops per cycle in this case.
            3. With memcpy implemented on x86-64 by using the YMM0 register, Ryzen 3000's IPC is 4.6. At 4.3 GHz this translates to 126 GB/s of L1D bandwidth, 69 GB/s L2 bandwidth.
            4. In any properly implemented optimized operating system kernel running on x86-64, enabling AVX is a matter of simply switching a single bit indicating that in case of an interrupt the interrupt handler has to save AVX registers:
              Code:
              atomic { bool old_avx = avx; avx = true };
              memcpy_avx(src, dst, 8192);
              if !old_avx then atomic { avx = false };
            Yes but the AVX registers are not stored when you perform a syscall in order to reduce the context switch cost so it's not just interrupt handlers that is a problem AFAIK. There must be a reason why they wernt with REP MOVE in the 5.6 libc instead of utilizing AVX like glibc does in userspace.

            Comment


            • #7
              Originally posted by F.Ultra View Post

              Yes but the AVX registers are not stored when you perform a syscall in order to reduce the context switch cost so it's not just interrupt handlers that is a problem AFAIK. There must be a reason why they wernt with REP MOVE in the 5.6 libc instead of utilizing AVX like glibc does in userspace.
              The kernel more often runs cache-cold, so small implementations of string functions are, relatively speaking, more useful in kernel context.

              One can also argue that the glibc implementations are not as good in real world situations as in microbenchmarks, due to bloated implementations. See e.g. https://people.ucsc.edu/~hlitz/papers/asmdb.pdf

              Comment

              Working...
              X