I saw this on the docket a few days ago and looked at the slides, although I couldn't find the paper on sci-hub.
One of the things they talk about is cores running at different frequencies. But not all CPUs have that capability. If all the cores share a supply voltage, the the core running at the highest operating performance point sets the voltage requirement for the whole chip, and there's no energy to be saved by running any of the other cores slower. Intel's FIVR is supposed to enable per-core voltage levels, but my 1st gen Haswell can't do it, and the latest Alder Lake desktop CPUs only use the FIVR for auxiliary voltages and run the cores directly from external power. So this aspect of the work might not be applicable outside of server parts.
Also, they suggest the idea of recently active cores being at a higher frequency, and "keeping cores warm", but CPU OPP transitions can be very fast, on the order of 20 us, (EDIT: Apparently not. I notice I am confused.) and IIRC only the shallowest C1 idle state keeps cores charged up and ready -- I remember reading somewhere that C1E and below drop the voltage down to the minimum or power gate the core. I have a hunch that the performance gains they're seeing are mostly due the workload looking less multi-threaded and not triggering Intel's power management algorithm to reduce the turbo ratio limit. The processor they tested on has 16 cores in only a 125W TDP (can't find datasheet, alas, and IDK whether Xeons have PL2 >> PL1 = TDP like the consumer parts), and the maximum 1T clock is 1.76x the base clock.
Finally, this might be working around schedutil's glass jaw with multi-threaded serial workloads. By packing all the threads onto one core, schedutil would see that core as fully loaded even if each task spends a lot of it's time asleep. The downside would be unnecessarily higher energy use when the threads are actually independent.
One of the things they talk about is cores running at different frequencies. But not all CPUs have that capability. If all the cores share a supply voltage, the the core running at the highest operating performance point sets the voltage requirement for the whole chip, and there's no energy to be saved by running any of the other cores slower. Intel's FIVR is supposed to enable per-core voltage levels, but my 1st gen Haswell can't do it, and the latest Alder Lake desktop CPUs only use the FIVR for auxiliary voltages and run the cores directly from external power. So this aspect of the work might not be applicable outside of server parts.
Also, they suggest the idea of recently active cores being at a higher frequency, and "keeping cores warm", but CPU OPP transitions can be very fast, on the order of 20 us, (EDIT: Apparently not. I notice I am confused.) and IIRC only the shallowest C1 idle state keeps cores charged up and ready -- I remember reading somewhere that C1E and below drop the voltage down to the minimum or power gate the core. I have a hunch that the performance gains they're seeing are mostly due the workload looking less multi-threaded and not triggering Intel's power management algorithm to reduce the turbo ratio limit. The processor they tested on has 16 cores in only a 125W TDP (can't find datasheet, alas, and IDK whether Xeons have PL2 >> PL1 = TDP like the consumer parts), and the maximum 1T clock is 1.76x the base clock.
Finally, this might be working around schedutil's glass jaw with multi-threaded serial workloads. By packing all the threads onto one core, schedutil would see that core as fully loaded even if each task spends a lot of it's time asleep. The downside would be unnecessarily higher energy use when the threads are actually independent.
Comment