Significant CRC32C Throughput Optimization On The Way To The Linux Kernel

Written by Michael Larabel in Linux Kernel on 23 October 2024 at 09:56 AM EDT. 22 Comments

Google engineer Eric Biggers has worked on some very nice performance optimizations for the crypto code within the Linux kernel such as faster AES-GCM for Intel and AMD CPUs, much faster AES-XTS disk/file encryption with modern CPUs, and many other optimizations over the years. His latest work is on enhancing the CRC32C crypto performance for x86/x86_64 processors.

Biggers has patches pending to eliminate the jump table and excessive unrolling found within the CRC32C Assembly code used on modern Intel/AMD processors. He explains in this patch within his crypto-pending branch:

"crc32c-pcl-intel-asm_64.S has a loop with 1 to 127 iterations full unrolled and uses a jump table to jump into the correct location. This optimization is misguided, as it bloats the binary code size and introduces an indirect call. x86_64 CPUs can predict loops well, so it is fine to just use a loop instead. Loop bookkeeping instructions can compete with the crc instructions for the ALUs, but this is easily mitigated by unrolling the loop by a smaller amount, such as 4 times.

Therefore, re-roll the loop and make related tweaks to the code.

This reduces the binary code size of crc_pclmul() from 4546 bytes to 418 bytes, a 91% reduction. In general it also makes the code faster, with some large improvements seen when retpoline is enabled."

With the default (Retpoline enabled) state for Intel and AMD CPUs, there is as much as a 66% throughput boost on Intel Emerald Rapids while AMD Zen 2 is even seeing as much as a 29% throughput improvement. Some real nice wins:

Linux faster CRC32C benchmarks

Hopefully this new code will be buttoned up in time for the upcoming Linux v6.13 kernel cycle for boosting the CRC32C kernel crypto performance for modern Intel and AMD processors.

22 Comments