Linux 6.9 Has A Big Rework To CPU Timers - Some Power/Performance Benefits
The Linux 6.9 kernel has a big rework to the CPU timer code that has been years in the making and has some power and performance benefits.
Thomas Gleixner summed up this big hierarchical timer model update in the timers/core pull request for the Linux 6.9 merge window. He explained of this significant undertaking rather well:
Gleixner went on to explain the benefits of this big effort:
This timers pull was merged alongside all of the other TIP.git material for the Linux 6.9 merge window. It will be fun to fire up some benchmarks soon of Linux 6.9 to see how the performance and power is looking overall given the numerous feature changes this cycle.
Thomas Gleixner summed up this big hierarchical timer model update in the timers/core pull request for the Linux 6.9 merge window. He explained of this significant undertaking rather well:
The hierarchical timer pull model
When timer wheel timers are armed they are placed into the timer wheel of a CPU which is likely to be busy at the time of expiry. This is done to avoid wakeups on potentially idle CPUs.
This is wrong in several aspects:
1) The heuristics to select the target CPU are wrong by definition as the chance to get the prediction right is close to zero.
2) Due to #1 it is possible that timers are accumulated on a single target CPU
3) The required computation in the enqueue path is just overhead for dubious value especially under the consideration that the vast majority of timer wheel timers are either canceled or rearmed before they expire.
The timer pull model avoids the above by removing the target computation on enqueue and queueing timers always on the CPU on which they get armed.
This is achieved by having separate wheels for CPU pinned timers and global timers which do not care about where they expire.
As long as a CPU is busy it handles both the pinned and the global timers which are queued on the CPU local timer wheels.
When a CPU goes idle it evaluates its own timer wheels:
- If the first expiring timer is a pinned timer, then the global timers can be ignored as the CPU will wake up before they expire.
- If the first expiring timer is a global timer, then the expiry time is propagated into the timer pull hierarchy and the CPU makes sure to wake up for the first pinned timer.
The timer pull hierarchy organizes CPUs in groups of eight at the lowest level and at the next levels groups of eight groups up to the point where no further aggregation of groups is required, i.e. the number of levels is log8(NR_CPUS). The magic number of eight has been established by experimention, but can be adjusted if needed.
In each group one busy CPU acts as the migrator. It's only one CPU to avoid lock contention on remote timer wheels.
The migrator CPU checks in its own timer wheel handling whether there are other CPUs in the group which have gone idle and have global timers to expire. If there are global timers to expire, the migrator locks the remote CPU timer wheel and handles the expiry.
Depending on the group level in the hierarchy this handling can require to walk the hierarchy downwards to the CPU level.
Special care is taken when the last CPU goes idle. At this point the CPU is the systemwide migrator at the top of the hierarchy and it therefore cannot delegate to the hierarchy. It needs to arm its own timer device to expire either at the first expiring timer in the hierarchy or at the first CPU local timer, which ever expires first.
This completely removes the overhead from the enqueue path, which is e.g. for networking a true hotpath and trades it for a slightly more complex idle path.
Gleixner went on to explain the benefits of this big effort:
This has been in development for a couple of years and the final series has been extensively tested by various teams from silicon vendors and ran through extensive CI.
There have been slight performance improvements observed on network centric workloads and an Intel team confirmed that this allows them to power down a die completely on a mult-die socket for the first time in a mostly idle scenario.
There is only one outstanding ~1.5% regression on a specific overloaded netperf test which is currently investigated, but the rest is either positive or neutral performance wise and positive on the power management side.
This timers pull was merged alongside all of the other TIP.git material for the Linux 6.9 merge window. It will be fun to fire up some benchmarks soon of Linux 6.9 to see how the performance and power is looking overall given the numerous feature changes this cycle.
7 Comments