Red Hat Proposes Queue PerCPU Work "QPW" For Better Handling Per-CPU Work On RT Linux
Red Hat engineer Leonardo Bras has laid out a proposal for QPWs, or "Queue PerCPU Work", as a better means of handling per-CPU operations within the Linux kernel especially for real-time (RT) workloads.
On Saturday morning a proposal was laid out for QPWs to replace work queues as what would just wrap around existing local_locks and workqueue behavior on non-RT kernels. But for real-time environments QPWs would lock the CPU's per-CPU structure and perform the work locally.
Bras explained in the RFC patch proposal:
This QPW proposal is now awaiting feedback from other kernel developers. Especially with the real-time patches hopefully being mainlined in the coming months, better handling per-CPU operations become all the more important for the best RT experience.
On Saturday morning a proposal was laid out for QPWs to replace work queues as what would just wrap around existing local_locks and workqueue behavior on non-RT kernels. But for real-time environments QPWs would lock the CPU's per-CPU structure and perform the work locally.
Bras explained in the RFC patch proposal:
The problem:
Some places in the kernel implement a parallel programming strategy consisting on local_locks() for most of the work, and some rare remote operations are scheduled on target cpu. This keeps cache bouncing low since cacheline tends to be mostly local, and avoids the cost of locks in non-RT kernels, even though the very few remote operations will be expensive due to scheduling overhead.
On the other hand, for RT workloads this can represent a problem: getting an important workload scheduled out to deal with remote requests is sure to introduce unexpected deadline misses.
The idea:
Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks. In this case, instead of scheduling work on a remote cpu, it should be safe to grab that remote cpu's per-cpu spinlock and run the required work locally. Tha major cost, which is un/locking in every local function, already happens in PREEMPT_RT.
Also, there is no need to worry about extra cache bouncing: The cacheline invalidation already happens due to schedule_work_on().
This will avoid schedule_work_on(), and thus avoid scheduling-out an RT workload.
For patches 2, 3 & 4, I noticed just grabing the lock and executing the function locally is much faster than just scheduling it on a remote cpu.
Proposed solution:
A new interface called Queue PerCPU Work (QPW), which should replace Work Queue in the above mentioned use case.
If PREEMPT_RT=n, this interfaces just wraps the current local_locks + WorkQueue behavior, so no expected change in runtime.
If PREEMPT_RT=y, queue_percpu_work_on(cpu,...) will lock that cpu's per-cpu structure and perform work on it locally. This is possible because on functions that can be used for performing remote work on remote per-cpu structures, the local_lock (which is already a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which is able to get the per_cpu spinlock() for the cpu passed as parameter.
This QPW proposal is now awaiting feedback from other kernel developers. Especially with the real-time patches hopefully being mainlined in the coming months, better handling per-CPU operations become all the more important for the best RT experience.
Add A Comment