Originally posted by EphemeralEft
View Post
For a somewhat longer explanation, kernel preemption means that threads running in kernel mode can be preempted, just like userspace threads. The idea is to be able to respond quicker when some high priority event happens. This tends to be important for things like audio, industrial automation systems etc., but if latencies get really excessive even plain desktop usage starts to stutter. However from a throughput perspective no preemption is the best, since that minimizes context switching and keeps caches warmer etc. So it's kind of a compromise whether you value throughput vs. response time. And thus there are a bunch of these different preemption models in the kernel. PREEMPT_NONE, as the name implies, means no kernel preemption. Voluntary is like PREEMPT_NONE, except there are a bunch of of places in the kernel where a function cond_resched() is called which means basically "now would be a good time to reschedule". So a bit like co-operative multitasking. It's the most commonly used model in distro kernels. However the Linux kernel scheduler developers hate it, because all these scheduling points are not under the control of the scheduler. There can be too many of them, leading to excess context switches, or there can be too few of them, leading to stutter and missed deadlines, all depending on the whims of some driver or other subsystem developers sprinkling their code with cond_resched(). And PREEMPT_FULL, is, well, full preemption where kernel code can be preempted. Now in reality there are many places in the kernel where preemption isn't possible, and in PREEMPT_FULL mode there is a "spinlock nesting counter", and basically when checking whether to preempt there's a check to see whether that counter is 0, and if there's a flag saying "preempt now!" then it preempts at that point (which basically means run the scheduler code and see if there's a higher priority thread that it should switch to). And finally PREEMPT_RT is like PREEMPT_FULL except there's a lot of changes to minimize the situation where the kernel is in a non-preemptable state, in order to minimize latency, at a much bigger cost in throughput performance.
Now finally PREEMPT_LAZY is like PREEMPT_FULL, except there's an additional "lazy preempt" flag. The logic is that for tasks in the realtime scheduling classes (RR/FIFO/DEADLINE), it works like PREEMPT_FULL, and preempts ASAP. However for normal tasks it instead uses the lazy flag to postpone the preemption, in order to as much as possible get the throughput benefits of run-to-completion. At some point, such as the timer tick, the lazy flag then gets upgraded to the full preemption flag, thus keeping latencies in check even for non-realtime tasks. So the aim is to get kind of best of both worlds, good latency for realtime tasks, good throughput and still decent latency for non-realtime tasks.
Eventually the scheduler developers want to get rid of PREEMPT_VOLUNTARY and all those cond_resched() calls sprinkled all around, but that'll take time before they get there.
Leave a comment: