Intel Optimization Around Batched TLB Flushing For Folios Looks Great
The optimization by Intel engineer Huang Ying has been picked up by Andrew Morton's "mm-unstable" branch. If all goes well this optimization could be merged for the upcoming Linux v6.3 kernel cycle. The patch sums up the work well:
The TLB flushing will cost quite some CPU cycles during the folio migration in some situations. For example, when migrate a folio of a process with multiple active threads that run on multiple CPUs. After batching the _unmap and _move in migrate_pages(), the TLB flushing can be batched easily with the existing TLB flush batching mechanism. This patch implements that.
We use the following test case to test the patch.
On a 2-socket Intel server,
- Run pmbench memory accessing benchmark
- Run `migratepages` to migrate pages of pmbench between node 0 and node 1 back and forth.
With the patch, the TLB flushing IPI reduces 99.1% during the test and the number of pages migrated successfully per second increases 291.7%.
NOTE: TLB flushing is batched only for normal folios, not for THP folios. Because the overhead of TLB flushing for THP folios is much lower than that for normal folios (about 1/512 on x86 platform).
Intel continues doing a lot of great upstream kernel work besides just the timely enablement of their new hardware but also relentlessly optimizing the Linux kernel for greater performance and efficiency.
The exciting part of the patch comments...