Linux Patches Aim To Mitigate An Inconsistent Performance / NUMA Imbalancing Issue
An interesting Linux kernel patch series was posted this week to address inconsistent NUMA imbalancing behavior for at least some workloads. In such cases these patches address performance differences seen over the past number of Linux kernel releases going on for a while.
Longtime Linux kernel developer Mel Gorman summed up the issue well in his kernel mailing list post:
Gorman confirmed the problem has happened for a while -- starting around Linux 5.7~5.8, being fixed in Linux 5.12, and broken again in Linux 5.13. Thankfully though he was able to work out this patch series to address this problem affecting NPB with MPICH and potentially other workloads too:
See this patch series for more details on the pending work.
Longtime Linux kernel developer Mel Gorman summed up the issue well in his kernel mailing list post:
A problem was reported privately related to inconsistent performance of NAS when parallelised with MPICH. The root of the problem is that the initial placement is unpredictable and there can be a larger imbalance than expected between NUMA nodes. As there is spare capacity and the faults are local, the imbalance persists for a long time and performance suffers.
This is not 100% an "allowed imbalance" problem as setting the allowed imbalance to 0 does not fix the issue but the allowed imbalance contributes the the performance problem. The unpredictable behaviour was most recently introduced by commit c6f886546cb8 ("sched/fair: Trigger the update of blocked load on newly idle cpu").
mpirun forks hydra_pmi_proxy helpers with MPICH that go to sleep before the execing the target workload. As the new tasks are sleeping, the potential imbalance is not observed as idle_cpus does not reflect the tasks that will be running in the near future. How bad the problem depends on the timing of when fork happens and whether the new tasks are still running. Consequently, a large initial imbalance may not be detected until the workload is fully running. Once running, NUMA Balancing picks the preferred node based on locality and runtime load balancing often ignores the tasks as can_migrate_task() fails for either locality or task_hot reasons and instead picks unrelated tasks.
Gorman confirmed the problem has happened for a while -- starting around Linux 5.7~5.8, being fixed in Linux 5.12, and broken again in Linux 5.13. Thankfully though he was able to work out this patch series to address this problem affecting NPB with MPICH and potentially other workloads too:
See this patch series for more details on the pending work.
3 Comments