With A Few Lines Of Code, AMD's Nice Performance Optimization For Linux 5.20
A patch from AMD to further tune the Linux kernel's scheduler around NUMA imbalancing has been queued up and slated for introduction in Linux 5.20. For some workloads this scheduler tuning can help out significantly for AMD Zen-based systems and even on Intel Xeon servers has the possibility of helping too.
The change from AMD to the fair scheduler is considering CPU affinity when allowing NUMA imbalance within the find_idlest_group() function.
AMD engineer K Prateek Nayak explained:
For the Stream memory benchmark test case, this patch was able to increase performance upwards of 40%:
This patch on top of the current Linux kernel code benefited Stream by 36~44% as an example common test case.
While an AMD led optimization for benefiting their Zen-based processors with multiple last level caches per socket, this Linux scheduler change can also benefit Intel CPUs too in cases of multi-socket servers. For Stream on an Intel Xeon Scalable "Ice Lake" server the Stream performance saw a 54~82% improvement over the current Linux performance.
Not bad at all with this kernel patch just being a few lines of code!
As of this morning the patch was queued into sched/core, making it material to be sent in for the Linux 5.20 merge window later this summer unless any issues come about with this code that previously was residing on the kernel mailing list. Once the Linux 5.20 cycle is abound with this and any other optimizations, I'll definitely be around with some fresh Xeon and EPYC benchmarks.
The change from AMD to the fair scheduler is considering CPU affinity when allowing NUMA imbalance within the find_idlest_group() function.
AMD engineer K Prateek Nayak explained:
In the case of systems containing multiple LLCs per socket, like AMD Zen systems, users want to spread bandwidth hungry applications across multiple LLCs. Stream is one such representative workload where the best performance is obtained by limiting one stream thread per LLC. To ensure this, users are known to pin the tasks to a specify a subset of the CPUs consisting of one CPU per LLC while running such bandwidth hungry tasks.
...
Ideally we would prefer that each stream thread runs on a different CPU from the allowed list of CPUs. However, the current heuristics in find_idlest_group() do not allow this during the initial placement.
[Example behavior]
Once the first four threads are distributed among the allowed CPUs of socket one, the rest of the treads start piling on these same CPUs when clearly there are CPUs on the second socket that can be used.
Following the initial pile up on a small number of CPUs, though the load-balancer eventually kicks in, it takes a while to get to {4}{4} and even {4}{4} isn't stable as we observe a bunch of ping ponging between {4}{4} to {5}{3} and back before a stable state is reached much later (1 Stream thread per allowed CPU) and no more migration is required.
We can detect this piling and avoid it by checking if the number of allowed CPUs in the local group are fewer than the number of tasks running in the local group and use this information to spread the 5th task out into the next socket (after all, the goal in this slowpath is to find the idlest group and the idlest CPU during the initial placement!).
For the Stream memory benchmark test case, this patch was able to increase performance upwards of 40%:
This patch on top of the current Linux kernel code benefited Stream by 36~44% as an example common test case.
While an AMD led optimization for benefiting their Zen-based processors with multiple last level caches per socket, this Linux scheduler change can also benefit Intel CPUs too in cases of multi-socket servers. For Stream on an Intel Xeon Scalable "Ice Lake" server the Stream performance saw a 54~82% improvement over the current Linux performance.
Not bad at all with this kernel patch just being a few lines of code!
As of this morning the patch was queued into sched/core, making it material to be sent in for the Linux 5.20 merge window later this summer unless any issues come about with this code that previously was residing on the kernel mailing list. Once the Linux 5.20 cycle is abound with this and any other optimizations, I'll definitely be around with some fresh Xeon and EPYC benchmarks.
18 Comments