Linus Torvalds Takes On A Performance Patch: "I Relax By Playing With Inline Assembly"

Written by Michael Larabel in Linux Kernel on 27 June 2023 at 05:00 PM EDT. 18 Comments
"Some people relax with a nice drink by the pool, I relax by playing around with inline [Assembly code]," as a nice quote of the day as Linus Torvalds explained after he took on improving upon a performance optimization patch that was proposed for the ongoing Linux 6.5 merge window.

As written about a month ago on Phoronix, there's been a patch to provide a big throughput boost and lower the latency for the csum_partial call, a function commonly used within the Linux kernel for checksumming purposes. The csum_partial function is used from file-systems to networking for checksums. With the proposed patch there could be a 8~9% latency improvement and around a 30% throughput improvement in certain cases.

That performance optimization patch for csum_partial was sent in today via the x86/misc pull request. As Linus Torvalds was reviewing the code he commented on the mailing list:
"Honestly, looking at that patch, my reaction is "why did it get unrolled in 64-byte chunks, if 40 bytes is the magic value"?

Particularly when there is then that "do a carry op each 32 bytes to make 32-byte chunks independent and increase ILP". So even the 64-byte case isn't *actuall* doing a 64-byte unrolling, it's really doing two 32-byte unrollings in parallel.

So you have three "magic" values, and the only one that really matters is likely the 40-byte one.

Yes, yes, 64 bytes is the usual cacheline size, and is "traditional" for unrolling. But there's nothing really magical about it here.

End result: wouldn't it have been nice to just do 40-byte chunks, and make the 64-byte "two overlapping 32-byte chunks" be two of the 40-byte chunks.

Something like the (ENTIRELY UNTESTED!) attached patch?

Again: this is *not* tested. I took a quick look at the generated assembly, and it looked roughly like what I expected it to look like, but it may be complete garbage.

I added a couple of "likely()" things just because it made the generated asm look more natural (ie it followed the order of the source code there), they are otherwise questionable annotations.

Finally: did I already mention that this is completely untested?"

After sending that message he realized a simple mistake in his hand-written Assembly code and added:
"...I noticed this by looking at the patch one more time. No actual *testing* has happened. It might still be buggy garbage even with that "+r". It's just a bit *less* buggy garbage.

I will now go back to my cave and continue pulling stuff, I just had to do something else for a while. Some people relax with a nice drink by the pool, I relax by playing around with inline asm."

Linus Torvalds ended up merging the original csum_partial optimization patch while his version undergoes further comment/discussion.

AMD Linux engineer Borislav Petkov who managed the x86/misc pull request replied to Linus Torvalds' message with:
"And there's a third kind who relax by the pool with a nice drink, *while* playing around with inline asm. ;-P"
Related News
About The Author
Michael Larabel

Michael Larabel is the principal author of and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via

Popular News This Week