Intel Brews Linux Change For More Efficient Idle CPU Searching Under Heavy System Load

Written by Michael Larabel in Intel on 28 June 2022 at 05:22 AM EDT. Add A Comment
INTEL
A "sched/fair" scheduler change queued this morning into TIP's sched/core for Linux 5.20 aims to enhance the efficiency when searching for an idle CPU under heavy system load. The change led by Intel should improve the kernel's efficiency when the system is overloaded but as with most low-level tuning does run the risk of regressions.

Intel's Chen Yu worked out this new "SIS_UTIL" scheduler feature to search for an idle CPU based on the sum of the utilization average. This stems from finding that the kernel's select_idle_cpu() is too time consuming when looking for an idle CPU while the system is overloaded.

Chen Yu explained in the detailed commit message:
It would be ideal to have a crystal ball to answer this question: How many CPUs must a wakeup path walk down, before it can find an idle CPU? Many potential metrics could be used to predict the number. One candidate is the sum of util_avg in this LLC domain. The benefit of choosing util_avg is that it is a metric of accumulated historic activity, which seems to be smoother than instantaneous metrics (such as rq->nr_running). Besides, choosing the sum of util_avg would help predict the load of the LLC domain more precisely, because SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle time.

In summary, the lower the util_avg is, the more select_idle_cpu() should scan for idle CPU, and vice versa. When the sum of util_avg in this LLC domain hits 85% or above, the scan stops. The reason to choose 85% as the threshold is that this is the imbalance_pct(117) when a LLC sched group is overloaded.

With the patch and testing using Netperf, "There is -87.9% less CPU scans after patched, which indicates lower overhead. Besides, with this patch applied, there is -13% less rq lock contention in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested.try_to_wake_up.default_wake_function.woken_wake_function. This might help explain the performance improvement - Because this patch allows the waking task to remain on the previous CPU, rather than grabbing other CPUs' lock."

The patch though did already yield a regression within the Stress-NG socket benchmark but is an area for further investigation. There was also a slight regression in at least one Hackbench test configuration.

See this commit to TIP's sched/core for more technical information on this fair scheduler change for the kernel. It will be interesting to kick the tires with this change for Linux 5.20 to see any other areas/workloads where it may help an overloaded system or conversely any other regressions it may introduce, but that's the fun for the benchmarks and in any case great to see Intel's continued contributions in striving for low-level Linux improvements.
Related News
About The Author
Michael Larabel

Michael Larabel is the principal author of Phoronix.com and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and OpenBenchmarking.org automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via MichaelLarabel.com.

Popular News This Week