Linux 6.6 WQ Change May Help Out AMD CPUs & Other Systems With Multiple L3 Caches

Written by Michael Larabel in Linux Kernel on 10 September 2023 at 05:00 PM EDT. 7 Comments
LINUX KERNEL
In addition to the EEVDF scheduler replacing the CFS code in Linux 6.6, another fundamental and interesting change with Linux 6.6 is on the workqueue (WQ) side with a rework that can benefit systems with multiple L3 caches like modern AMD chiplet-based systems.

With the workqueue changes for Linux 6.6, unbound workqueues now support more flexible affinity scopes. Tejun Heo explained in that pull:
"The default behavior is to soft-affine according to last level cache boundaries. A work item queued from a given LLC is executed by a worker running on the same LLC but the worker may be moved across cache boundaries as the scheduler sees fit. On machines which multiple L3 caches, which are becoming more popular along with chiplet designs, this improves cache locality while not harming work conservation too much.

Unbound workqueues are now also a lot more flexible in terms of execution affinity. Differeing levels of affinity scopes are supported and both the default and per-workqueue affinity settings can be modified dynamically. This should help working around amny of sub-optimal behaviors observed recently with asymmetric ARM CPUs.

This involved signficant restructuring of workqueue code. Nothing was reported yet but there's some risk of subtle regressions. Should keep an eye out."

The patch series when this code was being worked on add more context:
"Unbound workqueues used to spray work items inside each NUMA node, which isn't great on CPUs w/ multiple L3 caches. This patchset implements mechanisms to improve and configure execution locality.
...
This has been mostly fine but CPUs became a lot more complex with many more cores and multiple L3 caches inside a single node with [differing] distances across them, and it looks like it's high time to improve workqueue's locality awareness.
...
Ryzen 9 3900x - 12 cores / 24 threads spread across 4 L3 caches. Core-to-core latencies across L3 caches are ~2.6x worse than within each L3 cache. ie. it's worse but not hugely so. This means that the impact of L3 cache locality is noticeable in these experiments but may be subdued compared to other setups."

The patch series is promising and should be interesting to see how this code, which is part of Linux 6.6 pans out.

AMD EPYC and Ryzen chiplet CPUs


With all of the interesting changes built up for the Linux 6.6 merge window, which culminates today with the Linux 6.6-rc1 release, I'll be benchmarking many systems over the days ahead in looking at the Linux 6.6 performance on AMD and Intel systems compared to prior kernel releases.
Related News
About The Author
Michael Larabel

Michael Larabel is the principal author of Phoronix.com and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and OpenBenchmarking.org automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via MichaelLarabel.com.

Popular News This Week