Linus Torvalds Takes On A Performance Patch: "I Relax By Playing With Inline Assembly"
"Some people relax with a nice drink by the pool, I relax by playing around with inline [Assembly code]," as a nice quote of the day as Linus Torvalds explained after he took on improving upon a performance optimization patch that was proposed for the ongoing Linux 6.5 merge window.
As written about a month ago on Phoronix, there's been a patch to provide a big throughput boost and lower the latency for the csum_partial call, a function commonly used within the Linux kernel for checksumming purposes. The csum_partial function is used from file-systems to networking for checksums. With the proposed patch there could be a 8~9% latency improvement and around a 30% throughput improvement in certain cases.
That performance optimization patch for csum_partial was sent in today via the x86/misc pull request. As Linus Torvalds was reviewing the code he commented on the mailing list:
After sending that message he realized a simple mistake in his hand-written Assembly code and added:
Linus Torvalds ended up merging the original csum_partial optimization patch while his version undergoes further comment/discussion.
AMD Linux engineer Borislav Petkov who managed the x86/misc pull request replied to Linus Torvalds' message with:
As written about a month ago on Phoronix, there's been a patch to provide a big throughput boost and lower the latency for the csum_partial call, a function commonly used within the Linux kernel for checksumming purposes. The csum_partial function is used from file-systems to networking for checksums. With the proposed patch there could be a 8~9% latency improvement and around a 30% throughput improvement in certain cases.
That performance optimization patch for csum_partial was sent in today via the x86/misc pull request. As Linus Torvalds was reviewing the code he commented on the mailing list:
"Honestly, looking at that patch, my reaction is "why did it get unrolled in 64-byte chunks, if 40 bytes is the magic value"?
Particularly when there is then that "do a carry op each 32 bytes to make 32-byte chunks independent and increase ILP". So even the 64-byte case isn't *actuall* doing a 64-byte unrolling, it's really doing two 32-byte unrollings in parallel.
So you have three "magic" values, and the only one that really matters is likely the 40-byte one.
Yes, yes, 64 bytes is the usual cacheline size, and is "traditional" for unrolling. But there's nothing really magical about it here.
End result: wouldn't it have been nice to just do 40-byte chunks, and make the 64-byte "two overlapping 32-byte chunks" be two of the 40-byte chunks.
Something like the (ENTIRELY UNTESTED!) attached patch?
Again: this is *not* tested. I took a quick look at the generated assembly, and it looked roughly like what I expected it to look like, but it may be complete garbage.
I added a couple of "likely()" things just because it made the generated asm look more natural (ie it followed the order of the source code there), they are otherwise questionable annotations.
Finally: did I already mention that this is completely untested?"
After sending that message he realized a simple mistake in his hand-written Assembly code and added:
"...I noticed this by looking at the patch one more time. No actual *testing* has happened. It might still be buggy garbage even with that "+r". It's just a bit *less* buggy garbage.
I will now go back to my cave and continue pulling stuff, I just had to do something else for a while. Some people relax with a nice drink by the pool, I relax by playing around with inline asm."
Linus Torvalds ended up merging the original csum_partial optimization patch while his version undergoes further comment/discussion.
AMD Linux engineer Borislav Petkov who managed the x86/misc pull request replied to Linus Torvalds' message with:
"And there's a third kind who relax by the pool with a nice drink, *while* playing around with inline asm. ;-P"
18 Comments