Intel Brews Linux Change For More Efficient Idle CPU Searching Under Heavy System Load
Intel's Chen Yu worked out this new "SIS_UTIL" scheduler feature to search for an idle CPU based on the sum of the utilization average. This stems from finding that the kernel's select_idle_cpu() is too time consuming when looking for an idle CPU while the system is overloaded.
Chen Yu explained in the detailed commit message:
It would be ideal to have a crystal ball to answer this question: How many CPUs must a wakeup path walk down, before it can find an idle CPU? Many potential metrics could be used to predict the number. One candidate is the sum of util_avg in this LLC domain. The benefit of choosing util_avg is that it is a metric of accumulated historic activity, which seems to be smoother than instantaneous metrics (such as rq->nr_running). Besides, choosing the sum of util_avg would help predict the load of the LLC domain more precisely, because SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle time.
In summary, the lower the util_avg is, the more select_idle_cpu() should scan for idle CPU, and vice versa. When the sum of util_avg in this LLC domain hits 85% or above, the scan stops. The reason to choose 85% as the threshold is that this is the imbalance_pct(117) when a LLC sched group is overloaded.
With the patch and testing using Netperf, "There is -87.9% less CPU scans after patched, which indicates lower overhead. Besides, with this patch applied, there is -13% less rq lock contention in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested.try_to_wake_up.default_wake_function.woken_wake_function. This might help explain the performance improvement - Because this patch allows the waking task to remain on the previous CPU, rather than grabbing other CPUs' lock."
The patch though did already yield a regression within the Stress-NG socket benchmark but is an area for further investigation. There was also a slight regression in at least one Hackbench test configuration.
See this commit to TIP's sched/core for more technical information on this fair scheduler change for the kernel. It will be interesting to kick the tires with this change for Linux 5.20 to see any other areas/workloads where it may help an overloaded system or conversely any other regressions it may introduce, but that's the fun for the benchmarks and in any case great to see Intel's continued contributions in striving for low-level Linux improvements.