Announcement

Collapse
No announcement yet.

Is The Linux Kernel Scheduler Worse Than People Realize?

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is The Linux Kernel Scheduler Worse Than People Realize?

    Phoronix: Is The Linux Kernel Scheduler Worse Than People Realize?

    A number of Phoronix readers have been pointing out material to indicate that the Linux kernel scheduler isn't as good as most people would assume...

    http://www.phoronix.com/scan.php?pag...-Scheduler-Bad

  • xrysf03
    replied
    Sounds like quite a tabloid topic :-)

    The process scheduler's job is unrewarding. No matter what default behavior it assumes, some people will be happy and others will be outraged.

    The one key problem area to me is cache hierarchy and its influence on optimal process scheduling (CPU affinity) decisions. For such decisions, with caching in mind, to be optimal, the scheduler would need to know, on a per thread basis, what other threads the thread at hand belongs to, and how much dependent on "shared state" (memory) the threads mutually are.

    Yes on the outside in the user space there is the concept of a process and its internal threads, and by default, the inner threads share the virtual memory space of the mother goose process. Different processes have separate virtual memory spaces. That said, consider that some programs actually use thread-private memory, with truly shared memory objects (prone to cache thrashing) being a tiny fraction. (And worse, if the threads allocate their "actually thread-private" object trees from a shared heap, individual objects won't be aligned to cache lines.) And, some software that's structured into separate processes uses explicitly mapped shared memory windows for inter-process communication.

    That user-space taxonomy maps user-space threads onto a uniform kernel-space notion of "tasks". In the Linux kernel, there's a single declaration of "struct task" that caters for user-space threads and the mother goose processes. Apparently the inner threads belonging to a single process are ketp track of using a "struct task_group" in the kernel, with one of the tasks being flagged as a "group leader". The glue between the user space and kernel space is called NTPL = New POSIX Threads Library. The kernel does seem to support "group scheduling", but this is possibly where different schedulers vary in their handling of the individual "tasks within a group".

    Principally: imagine what information the kernel would need to correctly trade off CPU utilization vs. cache locality. I mean based on what should the scheduler decide, whether to move a process onto an idle CPU core that's possibly "more distant in the cache hierarchy", vs. keep the runnable thread on a more congested CPU core that's closer to the other running threads, vs. move the whole pack (mother goose process, task group) to a different "NUMA island" that's overall idle. This is a topic for quite some clustering analysis! Academically fun to consider the algorithms, but in reality probably a headache. Back to the leading question in this paragraph = what information is needed for the decision: 1) information about the structure of hardware and 2) information about the structure of software (threads and processes to make things simple).

    The kernel does have (or can obtain) information about the multi-level clustering of CPU cores in SMT pairs (quads anyone?), multi-core islands sharing L2 cache, NUMA nodes sharing L3 cache and a flock of DRAM controllers, maybe 2nd-order NUMA nodes consisting of 2 or more chips in a common package (socket) that are glued together by some ring bus that has lower latency than QPI/HT but a higher latency than the on-chip interconnect... That's some 4 levels of "locality". And, the top-level SMT grouping, besides featuring the tightest cache integration, may or may not benefit from "tight coupling" between threads co-depending on the load percentage required by individual threads - so there's a micro-tradeoff at this top level of the cache hierarchy that's "orthogonal" to the "cache locality" criterion.

    Actually the arguments for keeping threads within a single SMT cluster are similar to the arguments for keeping them on a single CPU core (and share the available time of that CPU core). One important aspect is the "hungriness" of each individual thread for CPU time. If your process consists of many threads that are pretty much event-driven and mostly just wait for IO or for another thread to deliver something via a mutex-locked queueing object, they will run most efficiently on a shared core or within a cluster of SMT pseudo-cores. The difficult decisions need to be taken where each thread is ready to eat 100% of the CPU core while at the same time the threads concurrently access a shared memory space. (Which would point to the interesting topic of interthread-locking in such a scenario, and its impact on per-thread efficiency, but that's probably a different story, at the programming level.)

    The difficult part probably is: how does the scheduler get to know, if the threads within a process (tasks within a task group, a group of processes sharing an SHM window) are tightly or loosely coupled in terms of concurrent memory access? The fine-grained parts of memory allocation happen in the user space, in libc. The allocator living in the user-space libc asks the kernel to allocate virtual memory pages for the process (task group). Libc could possibly keep track of which thread asked for an allocation of what chunk of memory, but if the virtual memory space is shared, then the thread that requested allocation doesn't need to be the thread actually using the memory, or using it more often than others. So the libc memory allocator probably doesn't care about "thread affinity" of individual memory allocations. And the kernel on its own doesn't have much chance to glean/infer/track actual memory *usage* (and detect concurrent access patterns for instance) to use in some heuristic algorithms. It *could* do that theoretically, using some tracing/profiling mechanisms (causing deliberate page faults or some such) but that would have a heavy impact on performance. This is not something you can do routinely in a production setup. So... some kind of automatic "per-task memory access pattern tracing" won't ever likely become a metric in the job of a process scheduler :-(

    It might be nifty, for a user-space application programmer, to have an API towards the kernel, where you could tweak some per-thread and per-process knobs, to specify "jumpiness" (willingness to switch to a different core or NUMA node, wrt the time penalty of the jump) and "cuddliness" (sensitivity to distance in a cache hierarchy). Conceptually, these relationships are in fact double-ended: to use kernel-space taxonomy, they create a general graph or mesh of "task-to-task cache affinity", which typically makes sense within a task group (user space process) but with SHM in the game it can be relevant between processes too. (Per-thread CPU hungriness is a dynamic metric that could be monitored at runtime.) So if the application programmer could explicitly hint to the kernel, it might make the scheduler's job easier, while it would not require the application programmer to strictly pin a thread to a particular CPU core (and study the precise hardware organization of the machine it happens to be running on right now).

    Unfortunately, IMO, POSIX does not have such tweakable knobs. Then again, glibc on Linux does provide some Linux-specific syscalls, such as "gettid()" (and e.g. some corresponding proprietary elements in the POSIX-based threading and timing API), which do give you some limited direct access to the kernel-space threading model. And you have the /proc filesystem. And POSIX as a standard also sustains some gradual development. So it is not entirely out of question, that such "jumpiness and cuddliness" hints could one day materialize. If someone among the key sorcerers finds this idea interesting enough.

    So if the scheduler did have some more fine-grained information about the threads, it would then be able to carry out its dual-plane cluster analysis in the tree of cache-coherent system busses and in the mesh of inter-thread affinities, and try to find an optimal mapping of one to the other ;-) Just kidding, this wouldn't be possible on every context switch, but such analysis could run in a dedicated kernel thread at some moderate priority... and an interesting question would be, what to do on thread startup. Have some default "new child affinity" pre-calculated for each existing thread? Have an option to specify "cuddliness" on thread creation and just use that? Lots of interesting questions :-)

    The way it is now, where the scheduler has relatively little information about actual "cuddliness" among the threads within a process, the scheduler's maintainer can tilt the heuristic this or that way, but either way he will please some users while outraging others. And sadly, corner cases, such as those people who prefer to run their CPU-hungry task on bare metal (such as a dedicated CPU core), will never be happy :-(

    Consider that multi-threaded programs can be very different. The threading can be used to achieve different goals. I myself have written heavily multithreaded networking programs, where every thread would mostly wait (sleep, blocking) for an event to appear in a queue or on a socket, process that event and sleep again. My programs typically operated on a shared tree of some internal state = a shared memory space with heavily concurrent access. But there are users who multi-thread their programs to squeeze maximum horsepower out of their multi-core hardware, for any kind of intensive number crunching. The authors of this software typically try to avoid concurrent access to memory, as that thrashes the cache and mutexes stall their desired number crunching. They go out of their way to guarantee local memory allocation within a NUMA node. Those are the miners and less demanding HPC folks. And I can imagine a hybrid piece of software, that needs to combine threads of both these categories: computer games. Which could be handled by some per-thread selective approach on part of the scheduler. And if you happen to have threads that need 100% crunching horsepower AND concurrent access to large memory, *that's* what I call "between a rock and a hard place" :-) That's where I would expect to meet some of the high-end scientific HPC folks.

    The job of a process scheduler is by no means easy.

    Leave a comment:


  • aht0
    replied
    Originally posted by SystemCrasher View Post
    Highly competitive players, absolutely inclined on outshooting everyone could go as far as to reduce visuals to minimum and even e.g. disable textures at all, etc to get highest FPS possible.
    Yeah, it was so done ages a go. Quake 2 comes to mind. Soldier of Fortune 1/2 (before the PB anticheat in "2")

    on most modern FPSes, you really can't go "under" certain graphics minimum though. Anti-cheat engines (client side Punkbuster for example) check game files while you play. Some game files' hash does not match with the list of "known good hashes" - you'd get instantly kicked from the server. Punkbuster would also detect if you tried to lower graphic settings over game console (like you could "lose that annoying grass" in Call of Duty:MW1) - latter depends if admin took precautions and set the limitation up by himself. But it's possible to restrict usage of particular game settings.

    Try hacking the anti-cheat itself - Global permaban from the game.

    You also get autokicked when your Punkbuster client side version mismatches, definition files are old or service does not work properly for some reason. As additional insurance policy, server admins can configure server-side Punkbuster to request in-game screenshot images from client side computer. Often enough with the private hacks, Punkbuster itself does not detect hack (works very well against public hacks) but admin can check the screenshots coming from hackers computer and see him hack.

    Decent admins watch their servers console and react when players complain about some particular player. There are also unofficial streaming servers like PbBans.com or GGC, who distribute (if server side PB has been setup to work with it or both, by admin) data about known and proved cheaters and help keeping them away. For example, someone gets caught in on PBBans-enabled server, it would in future get booted out from all other PBBans-enabled servers for this particular game.

    Cheaters are huge issue in PC gaming scene.

    Leave a comment:


  • SystemCrasher
    replied
    Originally posted by F1esDgSdUTYpm0iy View Post
    Actually, not just in shooters. It's quite common in RTS (and even MOBA) to reduce your visual settings to the absolute minimum to get the most out of your machine. On that level of gameplay anyhow.
    I guess RTS are a bit more forgiving to stuff like this. Also, 60FPS could mean anyhting. Even if GPU barely handles average case it is still 60 FPS. It going to be sad experience to learn your FPS has crashed in the middle of rumble. In shooters it happens to be especially frustrating due to laggy movement and crappy aim. Running without VSYNC gives better idea of available margins and if current quality level is okay overall.

    Leave a comment:


  • aht0
    replied
    Ok, yes, thanks correcting me

    Leave a comment:


  • F1esDgSdUTYpm0iy
    replied
    Originally posted by aht0 View Post
    I think that as the FPS goes up, input lag itself is accordingly decreasing. And that does create feeling that more fps the better. No matter the type of sync. I played without any v-sync at all. Human eye is to be said not to distinguish much difference between 40 cadres per second or 80. In-game one can feel the difference. It's all I've found for explanation.
    I think you're misinterpreting my point, more than anything. I'll try explaining -- VSync by itself already has the secondary effect of causing input lag. I'm using a shitty PC here and, whether or not I have VSync enabled has fairly little effect on the framerate performance of the games I play. It does however have a dramatic effect on the input lag.

    That was my point -- VSync, not FPS is more often than not the actual issue of perceived responsiveness.

    Leave a comment:


  • aht0
    replied
    Originally posted by F1esDgSdUTYpm0iy View Post
    More than anything, VSync is the "issue". Not the exact FPS or the refresh rate of the monitor. At least, that's been my experience as a gamer. Things like FreeSync were created, amongst other reason for the reason of synchronization being a cause for lag.
    I think that as the FPS goes up, input lag itself is accordingly decreasing. And that does create feeling that more fps the better. No matter the type of sync. I played without any v-sync at all. Human eye is to be said not to distinguish much difference between 40 cadres per second or 80. In-game one can feel the difference. It's all I've found for explanation.

    One sample I happened to find when I had Afterburner overlay running during recording, showing fps and cpu threads "live". OS: Win8.1 Pro x64, SAS hardware raid0, GTX660 Ti, 8Gb RAM, FX8350 cpu. Game : Battlefield 3, Team Deathmatch on pretty small map.
    https://www.twitch.tv/shaqan/v/21069086

    If the Linux with current scheduler could repeat that usage of cores displayed, it should work as is for gaming. Cores 1-6 were dealing with the game, cores 7-8 were encoding video.

    Leave a comment:


  • F1esDgSdUTYpm0iy
    replied
    Originally posted by SystemCrasher View Post
    Highly competitive players, absolutely inclined on outshooting everyone could go as far as to reduce visuals to minimum and even e.g. disable textures at all, etc to get highest FPS possible.
    Actually, not just in shooters. It's quite common in RTS (and even MOBA) to reduce your visual settings to the absolute minimum to get the most out of your machine. On that level of gameplay anyhow.

    Leave a comment:


  • SystemCrasher
    replied
    Originally posted by gamerk2 View Post
    Games are actually not hugely affected by latency in most situations; remember that you have a target 16ms render time,
    Really depends! E.g. shooters are often doing ALL computations on per-frame basis. This includes physics/motion/etc. Needless to say, player wants it to be as smooth as possible to have a good aim. This makes it a good idea to run at FPS exceeding monitor refresh rate. While it does not yields any visual improvements, it improves latency of rest of computations, allowing for improved, less jumpy aim. Highly competitive players, absolutely inclined on outshooting everyone could go as far as to reduce visuals to minimum and even e.g. disable textures at all, etc to get highest FPS possible. Most hardcore players would hate even 1ms latency induced by USB. Whatever, being anyhow competitive with these wrenches on weak machine surely takes serious reduction of visual settings, its the only chance outshot 'em, at least sometimes.

    Leave a comment:


  • SystemCrasher
    replied
    Originally posted by duby229 View Post
    Certainly SSD's have more bandwidth and lower latency than HDD's.
    They have ORDERS OF MAGNITUDE lower latency. Random seeks are killing HDDs to nowhere. SSDs are ok with it.

    That's true, but regardless they are still horribly slow, in fact the slowest component of your computer by thousands of times.
    My RAM does ~17GBytes/s on quite powerful desktop. SSD read is like 200MBytes/s. Just some hundreds, eh? Though it comes with worse latency and write speed is worse. But intense writes where it could get noticeable are quite rare (otherwise you'll wear it out really fast anyway).

    To make things more "funny"... ever seen "text file is busy" message? When you'll try to overwrite running executable, attempt will be denied with this error. The reason is: system relies on ability to re-read pages from file, so it could discard pages from RAM, even if there was no swap. Should OS face severe memory pressure, even swap-less system could get laggy due to pages discards and re-reading from (slow!) HDD. Somehow, I'm not smartass enough to know if there is knob to disable this latency-badass mechanism. I'm very curious about it, if someone knows answer they're really welcome. SSDs make it happen orders of magnitude faster. If something is runaway, OOM killer would kick in 1-2 seconds, making associated lag spike short. Mechanical HDDs would give you some paging-related lags under these circumstances. Nowhere as bad as with swap, but still above of what most users would prefer.

    The best way to implement a highly responsive system is to reduce bloat by removing all the things you'll never use. Make sure there is free RAM and available cores
    Sure, in helps. And btw, cpu scheduler isn't bad, so even if cpu is busy and no free cores are available, it hardly going to be laggy, unless someone terribly screwed things up like e.g. running CPU-heavy processes under highest priorities (or even rt). Unless such a suicidal actions in effect, cpu scheduler would do the trick. RAM on other hand is quite important due to mentioned issues.

    Leave a comment:

Working...
X