Announcement

**boxie** · 09 January 2020, 12:37 AM

This is certainly something that would be interesting to see how much of a perf boost it gives.

**mibo** · 09 January 2020, 03:43 AM

Is Intel Ice Lake the only cpu supporting that instruction?

**cthart** · 09 January 2020, 06:48 AM

Don't expect a silver bullet... this Stack Overflow has some good info: https://stackoverflow.com/questions/...vsb-for-memcpy

**F.Ultra** · 10 January 2020, 02:53 PM

Originally posted by atomsymbol

The MOV RAX result of 13260 MB/s, when the 8 KiB block of data fits L1D cache, is obsolete because recent x86 CPUs can sustain one 256-bit (32 bytes) store per cycle using AVX instructions, which on a 4 GHz CPU yields theoretical maximum of 4*32=128 GB/s.

But isn't AVX mostly disallowed in the kernel?

**F.Ultra** · 11 January 2020, 09:38 AM

Originally posted by atomsymbol

I think there are several points worth mentioning:

With memcpy implemented on x86-64 simply by using the RAX register, one iteration of the memcpy loop has 5 instructions of user code, which translates to at most 4 µops in the µop cache because the CMP+Jcc is fused into a single µop. If the loop is properly aligned to maximize fetch&decode bandwidth, Ryzen 3000 can fetch&decode&execute the 5 user instructions in a single clock. /usr/bin/perf stat measurement shows that with Ryzen 3000 CPU the realworld IPC is 4.9. With the CPU core running at 4.3 GHz, because about 8 bytes are copied in a single clock cycle, the measured bandwidth is 33.7 GB/s if both the source and destination data range is in L1D cache. If the data is in L2 cache, the bandwidth is 24.5GB/s.
Intel Ice Lake desktop CPUs (expected to be released this year) have 2 L1D load ports and 2 L1D store ports, which in theory enables the CPU to copy 2*8=16 bytes per clock cycle via RAX register, which at 5 GHz means about 78 GB/s. However, in this case one memcpy iteration has 7 user instructions (27 user bytes, 6 µops) and it is uncertain whether Ice Lake will be able to sustain processing 6 µops per cycle in this case.
With memcpy implemented on x86-64 by using the YMM0 register, Ryzen 3000's IPC is 4.6. At 4.3 GHz this translates to 126 GB/s of L1D bandwidth, 69 GB/s L2 bandwidth.
In any properly implemented optimized operating system kernel running on x86-64, enabling AVX is a matter of simply switching a single bit indicating that in case of an interrupt the interrupt handler has to save AVX registers:
Code:
```
atomic { bool old_avx = avx; avx = true };
memcpy_avx(src, dst, 8192);
if !old_avx then atomic { avx = false };
```

Yes but the AVX registers are not stored when you perform a syscall in order to reduce the context switch cost so it's not just interrupt handlers that is a problem AFAIK. There must be a reason why they wernt with REP MOVE in the 5.6 libc instead of utilizing AVX like glibc does in userspace.

**jabl** · 11 January 2020, 10:44 AM

Originally posted by F.Ultra View Post

Yes but the AVX registers are not stored when you perform a syscall in order to reduce the context switch cost so it's not just interrupt handlers that is a problem AFAIK. There must be a reason why they wernt with REP MOVE in the 5.6 libc instead of utilizing AVX like glibc does in userspace.

The kernel more often runs cache-cold, so small implementations of string functions are, relatively speaking, more useful in kernel context.

One can also argue that the glibc implementations are not as good in real world situations as in microbenchmarks, due to bloated implementations. See e.g. https://people.ucsc.edu/~hlitz/papers/asmdb.pdf

Announcement

Linux 5.6 To Make Use Of Intel Ice Lake's Fast Short REP MOV For Faster memmove()

Linux 5.6 To Make Use Of Intel Ice Lake's Fast Short REP MOV For Faster memmove()

Comment

Comment

Comment

Comment

Comment

Comment