New Set Of 86 Patches Overhaul The Linux Kernel's Preemption Model

Written by Michael Larabel in Linux Kernel on 8 November 2023 at 06:51 AM EST. 14 Comments

Ankur Arora of Oracle on Tuesday sent out a set of 86 patches for making the Linux kernel preemptible. This is based on earlier work by prominent Linux kernel engineer Thomas Gleixner.

Arora explains of this Linux kernel preemptible work in his "request for comments" message:

We have two models of preemption: voluntary and full (and RT which is a fuller form of full preemption.) In this series -- which is based on Thomas' PoC, we try to unify the two by letting the scheduler enforce policy for the voluntary preemption models as well.

(Note that this is about preemption when executing in the kernel. Userspace is always preemptible.)

Background
==

Why?: both of these preemption mechanisms are almost entirely disjoint. There are four main sets of preemption points in the kernel:

1. return to user
2. explicit preemption points (cond_resched() and its ilk)
3. return to kernel (tick/IPI/irq at irqexit)
4. end of non-preemptible sections at (preempt_count() == preempt_offset)

Voluntary preemption uses mechanisms 1 and 2. Full preemption uses 1, 3 and 4. In addition both use cond_resched_{rcu,lock,rwlock*} which can be all things to all people because they internally contain 2, and 4.

Now since there's no ideal placement of explicit preemption points, they tend to be randomly spread over code and accumulate over time, as they are are added when latency problems are seen. Plus fear of regressions makes them difficult to remove. (Presumably, asymptotically they would spead out evenly across the instruction stream!)

In voluntary models, the scheduler's job is to match the demand side of preemption points (a task that needs to be scheduled) with the supply side (a task which calls cond_resched().)

Full preemption models track preemption count so the scheduler can always knows if it is safe to preempt and can drive preemption itself (ex. via dynamic preemption points in 3.)

Design
==

As Thomas outlines, to unify the preemption models we want to: always have the preempt_count enabled and allow the scheduler to drive preemption policy based on the model in effect.

With this big set of patches, the system is booting and the performance is "pretty close" to the Linux 6.6 baseline. But there are a number of broken pieces right now like non-x86 architectures and kernel livepatching and other features.

Those wanting to learn more can see the Linux kernel mailing list where this RFC is being discussed.

14 Comments