Announcement

**V1tol** · 15 September 2022, 07:23 AM

Originally posted by ms178 View Post

I use BORE as of late, but I don't know if that already accounts for this.

BORE is just patched CFS. None of the schedulers I know address this task assignment problems. They are mostly focused on efficiency of task prioritization but none of them actually take into account CPU cores.

**ms178** · 15 September 2022, 07:36 AM

Originally posted by V1tol View Post

BORE is just patched CFS. None of the schedulers I know address this task assignment problems. They are mostly focused on efficiency of task prioritization but none of them actually take into account CPU cores.

Thanks for your insight. I don't know much about these internals, hence I am shocked to get to know that such information is not already accounted for in today's schedulers as boost frequency behavior has a huge impact on performance. Also the scheduler needs to know about power management and core/cache layout, too. I guess that means that there are some performance gains still to be had in the future.

**gamerk2** · 15 September 2022, 07:42 AM

Originally posted by birdie View Post

I've been thinking about that for years, it's strange no one has raised the issue earlier. The Linux kernel is notorious for its penchant for juggling tasks between CPU cores for no reasons and that results in emptying whatever you had in L1/L2 caches prior and non-zero delays considering new CPU cores could be at their absolute lowest power settings when they are given a task to execute.

You could simply run:

7z b -mmt1

And see in top or any graphical process manager how the task is thrown between CPU cores.

There's a reason for that: You want to typically try and maximize thread uptime. Remember that there are literally hundreds if not thousands of threads fighting for CPU time; your thread of interest is going to get bumped at some point. You then have two options: Either wait for whoever bumped it to finish and reschedule on the same core, or bump someone else and reschedule on another core ASAP. That's why you see instances of high workload threads jumping between cores: The OS is trying to maximize their uptime, even if the core allocation is considered suboptimal.

[Granted, in a situation where you only have a handful of high workload threads, it would be "more" ideal if those were locked to their cores and the rest of the threads fought over the remaining ones. This does incur a latency hit though, and breaks down as the number of "high workload" threads increases.]

**ATLief** · 15 September 2022, 08:10 AM

Like all nice things, this probably comes with some obscure security vulnerability.

**birdie** · 15 September 2022, 08:35 AM

Originally posted by gamerk2 View Post

There's a reason for that: You want to typically try and maximize thread uptime. Remember that there are literally hundreds if not thousands of threads fighting for CPU time; your thread of interest is going to get bumped at some point. You then have two options: Either wait for whoever bumped it to finish and reschedule on the same core, or bump someone else and reschedule on another core ASAP. That's why you see instances of high workload threads jumping between cores: The OS is trying to maximize their uptime, even if the core allocation is considered suboptimal.

[Granted, in a situation where you only have a handful of high workload threads, it would be "more" ideal if those were locked to their cores and the rest of the threads fought over the remaining ones. This does incur a latency hit though, and breaks down as the number of "high workload" threads increases.]

I ran the test on an almost completely idle system (xorg + graphical terminal). Thread uptime doesn't hold true in this case. CFS moves tasks across cores nilly-willy.

**mangeek** · 15 September 2022, 08:55 AM

I love this. I feel like the pursuit of pure performance sort of ignores practical realities in many real-world use cases. It's good to see some out-of-the-box thinking on such things.

Another scheduling thing I think might be handy is one that tries to keep threads on the core they were started on for longer to maximize cache benefits (and to start all processes on E cores), and a third would be for virtual guests that tries to keep a given process on a core, but also tries to spread out to use as many cores as possible simultaneously to take advantage of gang scheduling realities.

**schmidtbag** · 15 September 2022, 09:30 AM

Originally posted by birdie View Post

I've been thinking about that for years, it's strange no one has raised the issue earlier. The Linux kernel is notorious for its penchant for juggling tasks between CPU cores for no reasons and that results in emptying whatever you had in L1/L2 caches prior and non-zero delays considering new CPU cores could be at their absolute lowest power settings when they are given a task to execute.

From what I heard, one of the benefits to this is to spread out the heat per-core. It may not make a big difference, but enough where a CPU can retain boost clocks for longer. So long as you have a shared L3 cache, the extended boost clocks probably make up for the L1 and L2 emptying.
Speaking of L1 - that's pretty much always going to be constantly cycled even if a thread were totally locked to a single core. It's too small to retain enough data, and it's meant to be small so it can be a lot faster.
Also as pointed out already, Windows has been doing the same thing for a long while.

**bug77** · 15 September 2022, 09:49 AM

Originally posted by CochainComplex View Post

But this implies that the die layout has to be hardcoded for any cpu or at least any generation especially for the large server cpus...or maybe its given by staying in a numa node (for the lager ones)? maybe it is less problematic then i think

edit. There is another culprit aswell.

Lets imagine the scheduler chooses for performance the best high clockable core. Statistically one out of n is the best and will it always be. This would result in a potential overusage of one core and its neighbours. This will result in more thermal wearout on this particular core (and its neighbour.

As far as I've understood, this isn't about always choosing the fastest, but choosing cores that have been used recently. E.g. a core that is already running at 1.5GHz may be a better choice over a core that is currently power gated completely. Also, if you happen to land on the same core the task ran on recently, the data may still be cached there.

**FireBurn** · 15 September 2022, 10:37 AM

Patches are here: https://gitlab.inria.fr/nest-public/...image_creation

Lets see if this works on something newer than kernel 5.9

**F.Ultra** · 15 September 2022, 10:47 AM

Originally posted by user1 View Post

Interesting.. What you described here is exactly the same as the behavior of the Windows scheduler someone described in another forum. In that forum I've even seen people calling the Windows scheduler "dumb", when it turns out the Linux scheduler actually works in a similar way.

Try running the same example on Windows and you will see a massive difference, on Linux the 7z task will be for the most part switched between two cores on a 60/40 split (aka CFS tries to keep it a bit longer on the original core but then decides to switch it to another core but quickly back again) while if you run this on Windows it will be constantly switched between all cores, aka of you have 4 cores then each core will flatline at 25%. So while the CFS behaviour isn't optimal it's far from as bad as the Windows one in this regard.

Announcement

"Nest" Is An Interesting New Take On Linux Kernel Scheduling For Better CPU Performance

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment