New Scheduler Optimization Can Help Out PostgreSQL & More On Sapphire Rapids
Stemming from Intel engineers finding significant overhead in some Linux scheduler functions when running PostgreSQL within a Docker instance, a new scheduler patch is on the way for Linux 6.7 that will help out at least Ice Lake and Sapphire Rapids with some migration-heavy workloads. With the change being in the common scheduler code, it's also likely to help out other hardware platforms too.
Adding to the work already queuing for what is expected to be in Linux 6.7, the patch sched/fair: Ratelimit update to tg->load_avg is quite interesting.
After finding noticeable overhead in the scheduler code while running PostgreSQL in a Docker container on a dual socket Sapphire Rapids server, the code change is now rate limiting updates to load_avg. The rate limiting of load_avg is set to once per millisecond. In turn the cost of accessing load_avg is "greatly reduced and performance improved." Benchmarks show it helping PostgreSQL with sysbench on Sapphire Rapids by as much s 21%, Hackbench on Icelake went up by as much as 22%, and Hackbench on Sapphire Rapids even improved by up to 48%. Netperf in an extreme case improved by 189%.
This performance optimization patch adds just a dozen lines of new code to the scheduler code. With it now in TIP.git's sched/core branch, it's likely to be sent in for the Linux 6.7 merge window barring any problems from coming about. It will be interesting to benchmark this change with other migration-heavy workloads and on more hardware with the next kernel cycle.
Adding to the work already queuing for what is expected to be in Linux 6.7, the patch sched/fair: Ratelimit update to tg->load_avg is quite interesting.
"When using sysbench to benchmark Postgres in a single docker instance with sysbench's nr_threads set to nr_cpu, it is observed there are times update_cfs_group() and update_load_avg() shows noticeable overhead on a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group
10.63% 10.04% [kernel.vmlinux] [k] update_load_avg
Annotate shows the cycles are mostly spent on accessing tg->load_avg with update_load_avg() being the write side and update_cfs_group() being the read side. tg->load_avg is per task group and when different tasks of the same taskgroup running on different CPUs frequently access tg->load_avg, it can be heavily contended."
After finding noticeable overhead in the scheduler code while running PostgreSQL in a Docker container on a dual socket Sapphire Rapids server, the code change is now rate limiting updates to load_avg. The rate limiting of load_avg is set to once per millisecond. In turn the cost of accessing load_avg is "greatly reduced and performance improved." Benchmarks show it helping PostgreSQL with sysbench on Sapphire Rapids by as much s 21%, Hackbench on Icelake went up by as much as 22%, and Hackbench on Sapphire Rapids even improved by up to 48%. Netperf in an extreme case improved by 189%.
This performance optimization patch adds just a dozen lines of new code to the scheduler code. With it now in TIP.git's sched/core branch, it's likely to be sent in for the Linux 6.7 merge window barring any problems from coming about. It will be interesting to benchmark this change with other migration-heavy workloads and on more hardware with the next kernel cycle.
4 Comments