Announcement

**yump** · 15 September 2022, 05:19 PM

I saw this on the docket a few days ago and looked at the slides, although I couldn't find the paper on sci-hub.

One of the things they talk about is cores running at different frequencies. But not all CPUs have that capability. If all the cores share a supply voltage, the the core running at the highest operating performance point sets the voltage requirement for the whole chip, and there's no energy to be saved by running any of the other cores slower. Intel's FIVR is supposed to enable per-core voltage levels, but my 1st gen Haswell can't do it, and the latest Alder Lake desktop CPUs only use the FIVR for auxiliary voltages and run the cores directly from external power. So this aspect of the work might not be applicable outside of server parts.

Also, they suggest the idea of recently active cores being at a higher frequency, and "keeping cores warm", but CPU OPP transitions can be very fast, on the order of 20 us, (EDIT: Apparently not. I notice I am confused.) and IIRC only the shallowest C1 idle state keeps cores charged up and ready -- I remember reading somewhere that C1E and below drop the voltage down to the minimum or power gate the core. I have a hunch that the performance gains they're seeing are mostly due the workload looking less multi-threaded and not triggering Intel's power management algorithm to reduce the turbo ratio limit. The processor they tested on has 16 cores in only a 125W TDP (can't find datasheet, alas, and IDK whether Xeons have PL2 >> PL1 = TDP like the consumer parts), and the maximum 1T clock is 1.76x the base clock.

Finally, this might be working around schedutil's glass jaw with multi-threaded serial workloads. By packing all the threads onto one core, schedutil would see that core as fully loaded even if each task spends a lot of it's time asleep. The downside would be unnecessarily higher energy use when the threads are actually independent.

**Quackdoc** · 15 September 2022, 06:12 PM

WIll this be a good viable scheduler system? or something suitable for only specific tasks like all the other ones? we shall see.

**Linuxxx** · 15 September 2022, 08:46 PM

Originally posted by yump View Post

Finally, this might be working around schedutil's glass jaw with multi-threaded serial workloads. By packing all the threads onto one core, schedutil would see that core as fully loaded even if each task spends a lot of it's time asleep. The downside would be unnecessarily higher energy use when the threads are actually independent.

So I take it this means that this scheduler has no chance in beating CFS + performance, neither in throughput nor in power-efficiency?

**NobodyXu** · 15 September 2022, 09:45 PM

Originally posted by yump View Post

So this aspect of the work might not be applicable outside of server parts.

I don't think this will be useful for servers, since they are usually overloaded with tasks to maximize profit.

**yump** · 15 September 2022, 10:05 PM

Originally posted by Linuxxx View Post

So I take it this means that this scheduler has no chance in beating CFS + performance, neither in throughput nor in power-efficiency?

If I am interpreting slides 32 and 34 correctly, it beats CFS+performance in throughput, and loses on efficiency as you would expect from higher clock speed. Remember cpufreq isn't the only thing that can modulate the CPU frequency. The CPU's embedded power management controller has it's own rules about turbo frequency vs number of active cores.

**waxhead** · 16 September 2022, 03:23 AM

The nest scheduler is not actually a replacement for the CFS , it is a refinement to how CFS works. I did see the entire talk and while CFS does apparently look ahead to find a free (timeslice) on a core , Nest tries to minimize overhead on each core instead of spreading the tasks across as many cores as possible. Note that in the talk they talked about introducing artificial "go to the next core" about every second or so to avoid hotspots on sillicone and stuff like that.

**JosiahBradley** · 16 September 2022, 03:28 AM

I hate to ruin the party but simply allowing the hardware power logic speed cores up faster and race to idle faster would be much better and that's exactly what's already being worked on. This scheduler isn't actually faster, it's wasting power efficiency to do more work, which is literally done by spinning cores. This also leads to an obvious side channel attack on the scheduler itself unless the spin is random. This is a looks neat on paper, bad in practice thing.

**waxhead** · 16 September 2022, 03:41 AM

Originally posted by JosiahBradley View Post

I hate to ruin the party but simply allowing the hardware power logic speed cores up faster and race to idle faster would be much better and that's exactly what's already being worked on. This scheduler isn't actually faster, it's wasting power efficiency to do more work, which is literally done by spinning cores. This also leads to an obvious side channel attack on the scheduler itself unless the spin is random. This is a looks neat on paper, bad in practice thing.

Sorry for being a jackass at the party. Sure the CPU does not get magically faster, but this is like having a lot of cars going back and forth over a hill with boxes of something. When there is no work the car stops and have to accelerate to go over the hill. It makes sense to throw more boxes on cars already running to avoid the time it takes to accelerate the other cars. yes, the hardware makes for super fast acceleration - but you still get a little more done.

As for energy efficiency - Sometimes doing work at 200% speed at 100W is the same as doing something at 100% speed at 50W. And speaking of security - I think you are spot on. It will probably make paranoid people loose what's left of their hair... (darn... I forget how I look in front of a mirror!)

The real answer here, as with so many thing is - it depends... I do not think that most yapping around here on the forums (me included) will notice much of significance on their desktops. However with any gazillion cores CPU that actually does things other than staying idle most of the time this may have huge benefits.

**willmore** · 16 September 2022, 07:39 AM

Originally posted by JosiahBradley View Post

I hate to ruin the party but simply allowing the hardware power logic speed cores up faster and race to idle faster would be much better and that's exactly what's already being worked on. This scheduler isn't actually faster, it's wasting power efficiency to do more work, which is literally done by spinning cores. This also leads to an obvious side channel attack on the scheduler itself unless the spin is random. This is a looks neat on paper, bad in practice thing.

There are two competeting problems here:

There is latency in clocking up an idle core
The last core running a task is optimal (if the task has much cache footprint)

Depending on the tasks you're running, it makes sense to optimize for one of these cases or the other. If you have a task with a light memory footrpint, it won't benefit from being put back on the last core it ran (on the assumption that the cache will still be 'hot') as it doesn't realy on the cache much. It would make far better sense putting it on the most recently vacated core as that core should still be at high clocks and the hit to not having 'hot' cache for that task will be minimal as the task doesn't care about that.

But, the typical task is cache sensitive--that's why we have caches. Take a look at the recent post on comparing the 5800X and the 5800X3D. The slower clocked processor with more cache won it most of the tests vs the faster-per-core-less-cache case.

So, conventional wisdom is that keeping a task on a core is best. If it's the case that we most optimize for (big compute task that runs for a long time), then the core it came from shoulds still be hot as the only reason the task isn't on that core is that something higher priority bumped it off recently. Should the core, for some reason, have been allowed to get clocked down, the big cache footprint task would still benefit from getting back on that core.

You're really going to have to hunt around for a workload where this NEST code makes more sense. And, even than, the better solution for all cases is to improve core startup latencies. So, it may nice to have this algo hanging around for the few cases where it makes sense, I don't see this being a widely useful solution.

**jason.oliveira** · 20 September 2022, 11:24 AM

Originally posted by ATLief View Post

Like all nice things, this probably comes with some obscure security vulnerability.

It literally un-does everything retpoline is supposed to do. That's where the performance increase comes from.

Announcement

"Nest" Is An Interesting New Take On Linux Kernel Scheduling For Better CPU Performance

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment