The Latest On The Linux 5.9 Kernel Regression Stemming From Page Lock Fairness
Last week we reported on a Linux 5.9 kernel regression following benchmarks from Linux 5.0 to 5.9 and there being a sharp drop with the latest development kernel. That kernel regression was bisected to code introduced by Linus Torvalds at the start of the Linux 5.9 kernel cycle. Unfortunately it's not a trivial problem and one still being analyzed in coming up with a proper solution. So the short story is it's a work-in-progress while this article has some additional insight and benchmarks done over the course of the past few days.
For those wondering about that Linux 5.9 kernel regression spotted for the likes of Apache HTTPD / Nginx / Hackbench, it's still being evaluated. Linus Torvalds knows what's going on but he's still trying to sort out the best way to solve it in addressing those sharp performance drops but also trying to address his motives for writing that patch originally.
As pointed out in the earlier article, the issue stems from this patch rewriting the wait_on_page_bit_common() logic. While he is contemplating reverting it, he's hoping to work on a new hybrid model to address the old and new use-cases.
An example of some of the impacts being observed in these different test cases across systems.
Here is how Torvalds originally summed up the issue with all of the interesting technical details in an email to Phoronix following the bisect report:
Let's say that a page is locked by some user (doesn't matter why - it might be IO, it might be just for any of the number of "I need to lock this page to guarantee that the state stays consistent"), and a number of other processes come in and want to lock it.
What we do (and what we did before that commit too) is to queue them up in the hashed page wait-queue, and a flag is set in the page structure to say "this page has waiters".
We queue things up in order, so that the oldest waiter is first. That is *meant* to be about fairness, but you'll see later why that part didn't use to matter.
Anyway, this thing didn't change fundamentally by that commit that you bisected the performance regression to - yes, the commit changes that queuing, but not in any fundamental significant way, the change is incidental to the big change which is the "what happens at wakeup".
So the big difference is that wakeup behavior. Both before and after, the basic trigger is the same: the process that holds the page lock does an "unlock_page()", and as it unlocks it also checks that "do we have waiters" bit that got set, and goes off to process that wait list.
And here's the big difference. We *used* to just wake things up on that wait list, and that was it. We had various anti-herd behavior, so we'd only wake up *one* exclusive waiter process (ie somebody who was doing a "lock_page()") so we wouldn't have _all_ those waiters suddenly wake up, but basically we just did "oh, we've now released the page, so wake up the waiters".
That sounds obvious, but it actually has a rather non-obvious effect. In particular, the page is now unlocked, and the waiters have been woken, but they aren't necessarily *running*. They have become runnable, but particularly under load, it's probably some time before those waiters will actually get any CPU time.
In the meantime, all the processes that *weren't* on the wait queue (and the original page locker in particular) and are runnable are free to come in and take that page lock again.
That's actually really efficient - it's good for throughput to let the first possible locker come in and get the lock - but it's extremely unfair. The poor process that was woken up because it has been waiting a long time didn't actually get the lock at all, and quite the reverse - when it finally gets to run, now the page is locked *again* by somebody else who came in and never had to wait for it at all.
And to make things worse, the original waiter now sees that page locked again, so guess what it has to do? Right: it just adds itself back to the end of the wait queue, so now it has to wait all over again, plus all the old waiters are now ahead of it. Doubly unfair.
So that was the old behavior. It's why you ended up having watchdogs fire, because under heavy load and bad luck, the above situation could keep on happening multiple times, so the old waiter could end up waiting for literally tens of seconds.
But it was _efficient_. We'd give the page lock to the first process that came along as long as it was unlocked. So the old behavior is potentially extremely unfair, but it also maximizes giving the page lock away, and thus maximizes throughput (and it might even minimize _average_ latency, because all the lucky people got the page lock very quickly). It just makes for truly horrendous maximum latency because it's so unfair.
So what's the new model?
The new model is all similar up until the wakeup happens. At that point, instead of just waking up the process that waits for the page lock, we *give* the page lock to that process, and tell it "tag, you're it, you've got the page lock", even before it actually gets a CPU to run on.
There's still a very small window between "I unlocked the page" and "I gave the lock to the waiter", but now it's a couple of cycles, so the unfairness isn't really going to happen (and if it happens once in a blue moon, it's not a big deal - it's going to be _one_ process that gets the lock out of turn if it is _really_ really lucky, no re-queueing, and no "this might happen several times").
So now you can basically ignore the tiny chance of unfairness, and it's not going to be compounded by re-queueing or duplication.
And in many cases, you'll actually see performance improvements, because for the case where you just transfer one lock to one waiter, the new model of just transferring it directly is not only fair, it's slightly more efficient too.
That nasty unfair case might have caused huge latency spikes for the unlucky process, and horribly unfairness, but it would have helped all those lockers who came in and were lucky. And maybe they locked the page, and almost immediately unlocked it again, and maybe they wouldn't have hurt the original waiter.
And that's what I suspect is going on, and what I hope I'll see from your profiles: find some common case that gets the lock quickly and releases it immediately, and that used to be helped enormously by the fundamentally unfair locking model we used to have.
And then I hope I can fix that to do something different, and fix the performance regression that way.
Because I suspect the commit you bisected is working exactly as intended, and it's just that the old unfairness was really good for performance. We've seen that before - particularly with lock contention, unfair behavior is often very good for average performance even if it then can be very very bad for the worst-case latency.
So what's going on now then and the path forward?