AES-NI XTS To See 2~3x Performance Recovery After Regressing Hard From Retpolines
It turns out the Intel/AMD AES-NI implementation of XTS regressed hard from the Retpolines functionality merged nearly three years ago for mitigating Spectre... But now the crypto performance with the AES-NI XTS implementation is set to recover from that regression with a huge improvement thanks to a new set of patches.
It seems AES-NI XTS performance regressing hard from Retpolines went unnoticed when mitigating Spectre. This happened due to extensive use of indirect calls when processing small quantities of data. But thankfully Ard Biesheuvel investigated it and worked out a backport-friendly fix to address most of the regression. But then for future kernel releases is also a rewritten XTS implementation that is more flexible and avoids the nasty issues that led to the poor performance in the first place under Retpolines.
In the end the new patches allow improving the performance around ~2x for 1k/4k blocks and 3x for 1k blocks that require cipher text stealing.
Linux crypto expert Eric Biggers of Google commented in response, "Thanks for doing this! I didn't realize that there was such a big performance regression here. Getting rid of these indirect calls looks like the right approach; this all seems to have been written for a world where indirect calls are much faster... I did some quick benchmarks on Zen ("AMD Ryzen Threadripper 1950X 16-Core Processor") with CONFIG_RETPOLINE=y and confirmed the speedup on 4096-byte blocks is around 2x there too. (It's over 2x for AES-128-XTS and AES-192-XTS, and a bit under 2x for AES-256-XTS. And most of the speedup comes from the first patch.) Also, the extra self-tests are passing."
Hopefully the backport-friendly patch will work its way to stable branches soon as that alone already provides a hefty speed-up for AES-NI XTS on both Intel and AMD systems with Retpoline-enabled kernels.
It seems AES-NI XTS performance regressing hard from Retpolines went unnoticed when mitigating Spectre. This happened due to extensive use of indirect calls when processing small quantities of data. But thankfully Ard Biesheuvel investigated it and worked out a backport-friendly fix to address most of the regression. But then for future kernel releases is also a rewritten XTS implementation that is more flexible and avoids the nasty issues that led to the poor performance in the first place under Retpolines.
In the end the new patches allow improving the performance around ~2x for 1k/4k blocks and 3x for 1k blocks that require cipher text stealing.
Linux crypto expert Eric Biggers of Google commented in response, "Thanks for doing this! I didn't realize that there was such a big performance regression here. Getting rid of these indirect calls looks like the right approach; this all seems to have been written for a world where indirect calls are much faster... I did some quick benchmarks on Zen ("AMD Ryzen Threadripper 1950X 16-Core Processor") with CONFIG_RETPOLINE=y and confirmed the speedup on 4096-byte blocks is around 2x there too. (It's over 2x for AES-128-XTS and AES-192-XTS, and a bit under 2x for AES-256-XTS. And most of the speedup comes from the first patch.) Also, the extra self-tests are passing."
Hopefully the backport-friendly patch will work its way to stable branches soon as that alone already provides a hefty speed-up for AES-NI XTS on both Intel and AMD systems with Retpoline-enabled kernels.
7 Comments