Linux 5.9 To Allow Controlling Page Lock Unfairness In Addressing Performance Regression
Following the Linux 5.0 to 5.9 kernel benchmarks on AMD EPYC and it showing the in-development Linux 5.9 kernel regressing in some workloads, bisecting that issue, and that bringing up the issue of the performance regression over page lock fairness a solution for Linux 5.9 has now landed.
As outlined in detail in the aforelinked article, the performance regression hitting the likes of the Apache HTTPD web server test stem from work carried out by Linux creator Linus Torvalds for Linux 5.9 on trying to improve the page lock fairness. That fundamental issue though is quite complicated as making the page lock more "fair" can hurt the performance sometimes as shown in the regressed benchmarks on Phoronix.
Long-term Linus Torvalds and other upstream developers will be looking at further improving the page lock behavior, but merged today for Linux 5.9 was a short-term solution. The change is allowing a controlled amount of unfairness in the page lock.
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made the page locking entirely fair, in that if a waiter came in while the lock was held, the lock would be transferred to the lockers strictly in order.
That was intended to finally get rid of the long-reported watchdog failures that involved the page lock under extreme load, where a process could end up waiting essentially forever, as other page lockers stole the lock from under it.
It also improved some benchmarks, but it ended up causing huge performance regressions on others, simply because fair lock behavior doesn't end up giving out the lock as aggressively, causing better worst-case latency, but potentially much worse average latencies and throughput.
Instead of reverting that change entirely, this introduces a controlled amount of unfairness, with a sysctl knob to tune it if somebody needs to. But the default value should hopefully be good for any normal load, allowing a few rounds of lock stealing, but enforcing the strict ordering before the lock has been stolen too many times.
...
This whole issue has exposed just how critical the page lock can be, and how contended it gets under certain locks. And the main contention doesn't really seem to be anything related to IO (which was the origin of this lock), but for things like just verifying that the page file mapping is stable while faulting in the page into a page table.
The controlled amount of unfairness for the page lock can also be user-controlled via the sysctl sysctl_page_lock_unfairness or via /proc/sys/vm/page_lock_unfairness. That value basically controls how many times the kernel will re-try the existing/previous unfair case while a value of zero (0) is the fair page lock mode.
Originally that page_lock_unfairness value in the original patch was a value of 1000 but in testing of the patches over the past several days I've found a value of 4~5 to actually offer the best performance -- in select cases, it can even mean better performance than found on Linux 5.8 stable. With today's merge the default value is set to 5. Some runs on a Threadripper box:
"PLU 5" represents what will be the new default Linux 5.9 performance with the 1000 value not being the default in the merged patches. In some cases depending upon hardware, a value of 4 sometimes will perform better. Ultimately though past Linux 5.9 there will hopefully be further improvements to the page lock.
An AMD EPYC 7F72 system was working out great for frequent kernel builds and testing during this long process, among other high core count AMD boxes.
Indeed the performance is much better off in this particular test profile than in the earlier Linux 5.9 state or with the patch with a default value of 1000.
Now that the patch is mainlined and with the default page_lock_unfairness value of 5, I am running benchmarks on more systems and with more workloads in looking for other performance regressions or any unexpected regressions.
Thanks to the Phoronix Premium members and their support for making this extended testing possible, with having a couple systems now for more than one week hammering away on patch testing after having the regression bisected. Thanks as well to AMD in their speedy AMD EPYC hardware that makes quick work of testing -- especially when it comes to the kernel bisecting with the repeated builds of the massive codebase.
If you enjoyed this article consider joining Phoronix Premium to view this site ad-free, multi-page articles on a single page, and other benefits. PayPal or Stripe tips are also graciously accepted. Thanks for your support.