Announcement

**tildearrow** · 14 September 2019, 05:01 PM

Typo:

Originally posted by phoronix View Post

Phoronix: An Improved Linux MEMSET Is Being Tackled For Possibly Better Performance

Borislav Petkov has taken to improve the Linux kernel's memset function with it being an area previously criticzed by Linus Torvalds and other prominent developers...

http://www.phoronix.com/scan.php?pag...-Better-MEMSET

**coder** · 15 September 2019, 01:48 AM

Perhaps it's premature, but I wonder if they could use the new movdiri or movdir64b instructions, introduced in Intel's Tremont core and Tiger Lake CPUs.

https://www.phoronix.com/scan.php?pa...t-Instructions
https://fuse.wikichip.org/news/1158/...lus-successor/

It's hard to find good information on them, but I think they should avoid the write-miss penalty inherent in copy-back caches.

**coder** · 15 September 2019, 02:02 AM

BTW, if anyone is wondering about ERMS, this seems like a pretty good discussion of the subject:

Enhanced REP MOVSB for memcpy

https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy

I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy. ERMSB was introduced with the Ivy Bridge microarchitecture. See the section "Enhanced REP MOVSB and ...

You can see if your CPU has it by grepping for erms in /proc/cpuinfo. It was introduced in Ivy Bridge and is documented here (search for "Enhanced REP MOVSB and STOSB Operation"):

Access Denied

https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf

**carewolf** · 15 September 2019, 03:42 AM

I think this is something glibc optimizes more than the kernel, so how does it compare to their implementation?

**milkylainen** · 15 September 2019, 08:20 AM

As much as I love optimizations I get mixed icky feelings when beautiful kernel code (arch or not) is tainted with way to much knowledge.
There is reason behind having a good compiler and a good compiler + good userspace libraries to abstract horrid and stupid sensitive stuff like this.

**coder** · 15 September 2019, 11:04 AM

Originally posted by milkylainen View Post

As much as I love optimizations I get mixed icky feelings when beautiful kernel code (arch or not) is tainted with way to much knowledge.
There is reason behind having a good compiler and a good compiler + good userspace libraries to abstract horrid and stupid sensitive stuff like this.

I see your point, but I think this is just a step closer to JIT compilation for the native CPU. That would probably be the best of both worlds - keep your nice code and make good & efficient use of the native hardware.

I'm not saying Linux will ever get there, but perhaps some alternative kernel will.

**mlau** · 15 September 2019, 01:20 PM

Originally posted by carewolf View Post

I think this is something glibc optimizes more than the kernel, so how does it compare to their implementation?

glibc has ERMS-optimized string and memory copy routines.

I ran the test program in [1] and what's interesting is that my 6 year old haswell is over twice as fast as a zen 2700x.
(e.g. for 512 byte 63/75/121 for haswell vs. 154/174/188 for zen+). Clang produces slightly faster code than gcc for the 2700x.

Anyone with a ryzen 3000 interested in the code at [1]? I'm interested if zen2 has improved there.

[1] https://lkml.org/lkml/2019/9/13/807

**coder** · 15 September 2019, 09:41 PM

Originally posted by mlau View Post

I ran the test program in [1]

Meh, he's sampling rdtsc twice per iteration (with mfence) and not measuring how much overhead that adds. Sure, he starts the program at 0, but that still calls the code under test, even if it has no work to do.

My preferred way to benchmark small operations is to exponentially increase the number of iterations until the loop time just fits within an OS timeslice, and then repeat that number several times and take the minimum. I also subtract off the overhead of sampling the timer. In this way, I can quickly get cycle-accurate timings without too much concern about the system being completely idle.

Anyway, what are your timings for size 0? We ought to know how much rdtsc_ordered() might be contributing to that discrepancy.

Originally posted by mlau View Post

what's interesting is that my 6 year old haswell is over twice as fast as a zen 2700x.

I wonder if ERMS is patented. Check /proc/cpuinfo - does your 2700X list the erms flag?

Originally posted by mlau View Post

for 512 byte 63/75/121 for haswell vs. 154/174/188 for zen+

BTW, for those who didn't bother to read the linked benchmark, the numbers correspond to:
__builtin_memset()/rep_stosb()/memset_rep()

**mlau** · 16 September 2019, 03:11 AM

the fence is to ensure that the tsc count is taken after previous stores have retired to memory. the benchmark tests how many cycles it takes to store X bytes to the memory hierarchy.
Count 0 illustrates the setup costs of the various methods. IIRC, ERMS has a relatively high startup cost, so it only makes sense with larger amounts.

I don't think ERMS is patented; from my understanding it's an optimization in the microcode to the common "rep;stosb" (repeat store byte/w/l/) patterns to take
advantage of the wide internal datapaths. i.e. you can write "mov ecx, 1000h; rep; stosb" and microcode will optimize that to 128 256b-stores almost for free.

Announcement

An Improved Linux MEMSET Is Being Tackled For Possibly Better Performance

An Improved Linux MEMSET Is Being Tackled For Possibly Better Performance

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment