Big Throughput Boost & Lower Latency With New Patch For Linux Checksum Function
Queued up ahead of the Linux 6.5 cycle kicking off in about one month is a new Linux x86 optimization patch for further tuning csum_partial, the function used within the kernel for calculating 32-bit checksums on blocks of data. Much lower latency and higher throughput can be observed with the newly-optimized csum_partial on the latest Intel/AMD processors.
The csum_partial function is used throughout the kernel from networking to file-systems for check-summing purposes. A new patch now queued in tip/tip.git is improving the performance of the x86/x86_64 csum_partial implementation. Developer Noah Goldstein noted in the patch:
The patch is queued up into TIP's x86/misc branch until the Linux 6.5 merge window gets underway. It's always a joy to see the never-ending performance optimizations to the Linux kernel.
The csum_partial function is used throughout the kernel from networking to file-systems for check-summing purposes. A new patch now queued in tip/tip.git is improving the performance of the x86/x86_64 csum_partial implementation. Developer Noah Goldstein noted in the patch:
x86/csum: Improve performance of `csum_partial`
1) Add special case for len == 40 as that is the hottest value. The nets a ~8-9% latency improvement and a ~30% throughput improvement in the len == 40 case.
2) Use multiple accumulators in the 64-byte loop. This dramatically improves ILP and results in up to a 40% latency/throughput improvement (better for more iterations).
The patch is queued up into TIP's x86/misc branch until the Linux 6.5 merge window gets underway. It's always a joy to see the never-ending performance optimizations to the Linux kernel.
13 Comments