Intel Lands A Nice Memset Performance Optimization In Glibc
Intel engineer Noah Goldstein has landed another nice performance optimization in the GNU C Library "glibc" for benefiting newer Intel processors.
The latest performance optimization by Noah Goldstein in the area of the open-source toolchain is improving the large memset performance with non-temporal stores.
The focus of this latest optimization effort is benefiting at least Skylake-X and Ice Lake -- for the latter applicable to both client and server processors. Goldstein explained of this memory set optimization now in Glibc Git:
The patch is now in Glibc Git as yet another nice performance optimization thanks to Intel's software team and their relentless open-source tuning contributions across the stack.
The latest performance optimization by Noah Goldstein in the area of the open-source toolchain is improving the large memset performance with non-temporal stores.
The focus of this latest optimization effort is benefiting at least Skylake-X and Ice Lake -- for the latter applicable to both client and server processors. Goldstein explained of this memory set optimization now in Glibc Git:
"x86: Improve large memset perf with non-temporal stores [RHEL-29312]
Previously we use `rep stosb` for all medium/large memsets. This is notably worse than non-temporal stores for large (above a
few MBs) memsets. See [here] for data using different stategies for large memset on ICX and SKX.
Using non-temporal stores can be up to 3x faster on ICX and 2x faster on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb` is able to do. But, the zero-over-zero writeback optimization has been removed as a potential side-channel attack, so there is no longer any good reason to only rely on `rep stosb` for large memsets. On the flip size, non-temporal writes can avoid data in their RFO requests saving memory bandwidth.
...
The results on the memset-large benchmark suite on TGL-client for N=20 runs:
Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2 : 0.928
Geometric Mean across the suite New / Old SSE2 : 0.924
So roughly a 7.5% speedup. This is lower than what we see on servers (likely because clients typically have faster single-core bandwidth so saving bandwidth on RFOs is less impactful), but still advantageous."
The patch is now in Glibc Git as yet another nice performance optimization thanks to Intel's software team and their relentless open-source tuning contributions across the stack.
6 Comments