Linux 6.2 Landing Scalability Improvement For Large IBM Power Systems
Up to now IBM Power processors on Linux have relied on the generic queued spinlock "qspinlock" implementation within the kernel but that has been found to cause latency and starvation issues on large IBM POWER10 servers. -- testing on upwards of 16 sockets with 1,920 CPU threads. Thus for Linux 6.2 there is now a Power specific queued spinlocks implementation designed to overcome issues of the generic qspinlocks and allow for better scalability on large IBM Power / OpenPOWER systems. This new implementation should also work better in para-virtualized environments.
IBM Power CPUs installed on a Raptor Talos II motherboard.
Nicholas Piggin who spearheaded this Power qspinlock code commented in the earlier patch series on it:
Since the RFC series, I tested this on a 16-socket 1920 thread POWER10 system with some microbenchmarks, and that showed up significant problems with the previous series. High amount of spinning on the lock up-front (lock stealing) for SPLPAR mode (paravirt) really hurts scalability when the guest is not overcommitted. However on smaller KVM systems with significant overcommit (e.g., 5-10%), this spinning is very important to avoid performance tanking due to the queueing problem. So rather than set STEAL_SPINS and HEAD_SPINS based on SPLPAR at boot-time, I lowered them and do more to dynamically deal with vCPU preemption. So behaviour of dedicated and shared LPAR mode is now the same until there is vCPU preemption detected. This seems to be leading to better results overall, but some worst-case latencies are significantly up with the lockstorm test (latency is still better than generic queued spinlocks, but not as good as it previously was or as good as simple). Statistical fairness is still significantly better.
This Power qspinlock implementation for better system scalability is being merged for Linux 6.2 as part of this pull request. For those making use of the old PowerPC Book3S, this pull request for Linux 6.2 is also now zeroing the general purpose registers on interrupt routine entry. This change is being done to reduce the influence over user registers on speculation within the kernel system call handlers. This may drop the Book3S performance by about 1% with this sanitizing of registers on interrupt routine entry but being done in the name of security.