Linus Torvalds Takes On A Performance Patch: "I Relax By Playing With Inline Assembly"

Written by Michael Larabel in Linux Kernel on 27 June 2023 at 05:00 PM EDT. 18 Comments

"Some people relax with a nice drink by the pool, I relax by playing around with inline [Assembly code]," as a nice quote of the day as Linus Torvalds explained after he took on improving upon a performance optimization patch that was proposed for the ongoing Linux 6.5 merge window.

As written about a month ago on Phoronix, there's been a patch to provide a big throughput boost and lower the latency for the csum_partial call, a function commonly used within the Linux kernel for checksumming purposes. The csum_partial function is used from file-systems to networking for checksums. With the proposed patch there could be a 8~9% latency improvement and around a 30% throughput improvement in certain cases.

That performance optimization patch for csum_partial was sent in today via the x86/misc pull request. As Linus Torvalds was reviewing the code he commented on the mailing list:

"Honestly, looking at that patch, my reaction is "why did it get unrolled in 64-byte chunks, if 40 bytes is the magic value"?

Particularly when there is then that "do a carry op each 32 bytes to make 32-byte chunks independent and increase ILP". So even the 64-byte case isn't *actuall* doing a 64-byte unrolling, it's really doing two 32-byte unrollings in parallel.

So you have three "magic" values, and the only one that really matters is likely the 40-byte one.

Yes, yes, 64 bytes is the usual cacheline size, and is "traditional" for unrolling. But there's nothing really magical about it here.

End result: wouldn't it have been nice to just do 40-byte chunks, and make the 64-byte "two overlapping 32-byte chunks" be two of the 40-byte chunks.

Something like the (ENTIRELY UNTESTED!) attached patch?

Again: this is *not* tested. I took a quick look at the generated assembly, and it looked roughly like what I expected it to look like, but it may be complete garbage.

I added a couple of "likely()" things just because it made the generated asm look more natural (ie it followed the order of the source code there), they are otherwise questionable annotations.

Finally: did I already mention that this is completely untested?"

After sending that message he realized a simple mistake in his hand-written Assembly code and added:

"...I noticed this by looking at the patch one more time. No actual *testing* has happened. It might still be buggy garbage even with that "+r". It's just a bit *less* buggy garbage.

I will now go back to my cave and continue pulling stuff, I just had to do something else for a while. Some people relax with a nice drink by the pool, I relax by playing around with inline asm."

Linus Torvalds ended up merging the original csum_partial optimization patch while his version undergoes further comment/discussion.

AMD Linux engineer Borislav Petkov who managed the x86/misc pull request replied to Linus Torvalds' message with:

"And there's a third kind who relax by the pool with a nice drink, *while* playing around with inline asm. ;-P"

18 Comments