Meta Proposes Shared Workqueue For Linux's CFS - Small Throughput Win

Written by Michael Larabel in Linux Kernel on 13 June 2023 at 08:15 AM EDT. 10 Comments

Meta engineers have proposed a shared wakequeue "swqueue" feature for the Linux kernel's CFS scheduler that can help with a small throughput performance improvement and slightly better latency, particularly for AMD systems with multiple CCXs.

Posted today was a "request for comments" on this swqueue CFS feature that Meta engineers have been working on. In their case they were driven to work on swqueue to enhance the throughout on AMD EPYC servers running HHVM web server processes for Facebook.

Some of the key takeaways from their RFC patch cover letter are:

We noticed that CPUs were still going idle even when the host was overcommitted. In response, we wrote the "shared wakequeue" (swqueue) feature proposed in this patch set. The idea behind swqueue is simple: it enables the scheduler to be aggressively work conserving by placing a waking task into a per-LLC FIFO queue that can be pulled from by another core in the LLC FIFO queue which can then be pulled from before it goes idle.

With this simple change, we were able to achieve a 1 - 1.6% improvement in throughput, as well as a small, consistent improvement in p95 and p99 latencies, in HHVM. These performance improvements were in addition to the wins from the debugfs knobs mentioned above.
...
The ~1 - 1.6% improvement in HHVM throughput is similarly visible using work-conserving sched_ext schedulers (even very simple ones like global FIFO).

In both single and multi socket / CCX hosts, this can measurably improve performance. In addition to the performance gains observed on our internal web workloads, we also observed an improvement in common workloads such as kernel compile when running shared wakequeue.
...
swqueue in this form seems to provide a small, but noticeable win for front-end CPU-bound workloads spread over multiple CCXs. The reason seems fairly straightforward: swqueue encourages work conservation inside of a CCX by having a CPU do an O(1) pull from a per-LLC queue of runnable tasks. As mentioned above, it is complementary to SIS_NODE, which searches for idle cores on the wakeup path.

While swqueue in this form encourages work conservation, it of course does not guarantee it given that we don't implement any kind of work stealing between swqueues. In the future, we could potentially push CPU utilization even higher by enabling work stealing between swqueues, likely between CCXs on the same NUMA node.

The swqueue patch set is just over 200 lines of new code and the RFC patches are now out for review on the kernel mailing list.

10 Comments