Linux 6.8 Picks Up AMD CPU Optimization To Avoid Unnecessarily Serializing MSR Accesses
The x86 CPU pull request is ready for the Linux 6.8 kernel and besides adding new AMD Zen feature flags easily isolating different CPU generations, there is also an AMD CPU optimization to avoid an unnecessary MFENCE+LFENCE barrier.
The Linux x86/x86_64 kernel has imposed an MFENCE and LFENCE synchronization barrier when accessing certain MSRs, since that's the necessary behavior with Intel CPUs and added to the Linux kernel by Intel engineers. But the MFENCE+LFENCE barrier isn't needed in the case of AMD processors and so this optimization for Linux 6.8 is eliminating that behavior when running on AMD processors. In particular, the MFENCE+LFENCE barrier isn't needed for the Timestamp Calibration (TSC) and x2APIC model specific registers (MSRs) on AMD processors.
I wrote about this performance optimization back when it was originally being worked on for the Linux kernel. Now it's coming to land in Linux 6.8.
Using a modified IPI benchmark AMD evaluated the performance impact of bypassing this synchronization barrier:
That change to not serialize MSR accesses on AMD processors can be found in the x86/cpu pull for Linux 6.8, albeit too bad this behavior wasn't properly gated for being Intel-specific in the first place.
The Linux x86/x86_64 kernel has imposed an MFENCE and LFENCE synchronization barrier when accessing certain MSRs, since that's the necessary behavior with Intel CPUs and added to the Linux kernel by Intel engineers. But the MFENCE+LFENCE barrier isn't needed in the case of AMD processors and so this optimization for Linux 6.8 is eliminating that behavior when running on AMD processors. In particular, the MFENCE+LFENCE barrier isn't needed for the Timestamp Calibration (TSC) and x2APIC model specific registers (MSRs) on AMD processors.
I wrote about this performance optimization back when it was originally being worked on for the Linux kernel. Now it's coming to land in Linux 6.8.
Using a modified IPI benchmark AMD evaluated the performance impact of bypassing this synchronization barrier:
"Comparing the performance of x2AVIC with and without the fix, it can be seen the performance improves by ~4%.
Performance captured using an unmodified ipi-bench using the 'mesh-ipi' option with and without weak_wrmsr_fence() on a Zen4 system also showed significant performance improvement without weak_wrmsr_fence(). The 'mesh-ipi' option ignores CCX or CCD and just picks random vCPU.
Average throughput (10 iterations) with weak_wrmsr_fence(),
Cumulative throughput: 4933374 IPI/s
Average throughput (10 iterations) without weak_wrmsr_fence(),
Cumulative throughput: 6355156 IPI/s"
That change to not serialize MSR accesses on AMD processors can be found in the x86/cpu pull for Linux 6.8, albeit too bad this behavior wasn't properly gated for being Intel-specific in the first place.
3 Comments