Intel's Zswap IAA Compress Batching Work Is Very Interesting For Linux Performance

Written by Michael Larabel in Intel on 13 November 2024 at 08:21 AM EST. 5 Comments
INTEL
The Intel In-Memory Analytics Accelerator (IAA) found in various Xeon SKUs since Sapphire Rapids can be of big benefit to Linux servers/workstations with a Linux kernel patch series that has been in the works to provide Zswap IAA compress batching.

The Intel accelerator blocks found within recent generations of Xeon processors have overall only seen limited/niche use given the initial lack of broad software support around them. With time we've seen more software adoption around IAA and friends, including by the Linux kernel itself. One of the patch series I've been eagerly monitoring has been Intel's work on zswap IAA compress batching to use the Intel Analytics Accelerator for parallel compression of pages in large folios.

Benchmarks from Intel engineers of this Zswap IAA compress batching have shown extremely promising results for the latest Linux kernel code atop supported IAA-enabled Xeon processors:

Intel IAA benchmarks


Intel IAA benchmarks


Intel IAA benchmarks


Sent out last week were the v3 patches for using the IAA accelerators for parallel compression of pages in large folios. The performance summary there is:
"The performance testing data with usemem 30 processes and kernel compilation test show throughput gains and elapsed/sys time reduction with zswap_store() large folios using IAA compress batching.

The iaa_crypto wq stats will show almost the same number of compress calls for wq.1 of all IAA devices. wq.0 will handle decompress calls exclusively. We see a latency reduction of 2.5% by distributing compress jobs among all IAA devices on the socket (based on v1 data).

We can expect to see even more significant performance and throughput improvements if we use the parallelism offered by IAA to batch compress the pages comprising a batch of 4K (really any-order) folios, not just batching within large folios. This is the reclaim batching patch 13 in v1, which will be submitted in a separate patch-series.

Our internal validation of IAA compress/decompress batching in highly contended Sapphire Rapids server setups with workloads running on 72 cores for ~25 minutes under stringent memory limit constraints have shown up to 50% reduction in sys time and 3.5% reduction in workload run time as compared to software compressors."

Fascinating work with significant performance benefits, so hopefully this work will end up in the mainline Linux kernel sooner rather than later for helping to make a more compelling IAA experience.
Related News
About The Author
Michael Larabel

Michael Larabel is the principal author of Phoronix.com and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and OpenBenchmarking.org automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via MichaelLarabel.com.

Popular News This Week