With that said, the continuous mouse movement that you observed while playing music on Windows likely can be attributed to having the display server in the kernel. Things that live in the kernel are always resident, which insulates them from the effects of disk thrashing. Windows put all of the components required to draw the cursor on your screen into the NT the kernel, which is the reason why it performed so well. In the case of Linux, the display server and compositor live in userland, so the kernel can page them out to a swap device as it pleases. This is likely why you experience lags. Some people think that changing the CPU scheduler will help, but the effect that a CPU scheduler has on this is fairly chaotic in nature. The proper way of handling this would be to use a more intelligent page replacement algorithm.
ZFS has ARC, which tends to handle this more gracefully. Unfortunately, there are some outstanding memory management issues that prevent it from handling this as well as it could. Specifically, the Linux kernel's virtual memory support is awful. It lacks support for slab-based allocations, all allocations use a single lock and it does not obey GFP flags. LLNL wrote a compatibility shim that attempts to handle this, but it is far from ideal. In addition, mmap() is currently double cached between the page cache and ZFS ARC, which can create churn in the kernel as the kernel evicts pages required for your display server only to load them back from ARC. Admittedly, this is better than going to disk, but it still degrades performance.
Using BFS for desktop users should be OK, as they generally have at max a 4core+4HTT system . Servers have >8 cores, and they need the throughput and scalability of CFS.
Desktop users generally need responsivity.
Perhaps there is another expert taking this idea? Because the Linux scheduler is broken! What would you say for example about a security concept providing security for only 90 percent of the users?
However the main reason for developing the upgradeable rwlocks was not just to
create more critical sections that other CPUs can have read access. Ultimately
I had a pipe dream that it could be used to create multiple runqueues as you
have done in your patch. However, what I didn't want to do was to create a
multi runqueue design that then needed a load balancer as that took away one
of the advantages of BFS needing no balancer and keeping latency as low as
I've not ever put a post up about what my solution was to this problem because
the logistics of actually creating it, and the work required kept putting me
off since it would require many hours, and I really hate to push vapourware.
Code speaks louder than rhetoric. However since you are headed down creating
multi runqueue code, perhaps you might want to consider it.
What I had in mind was to create varying numbers of runqueues in a
hierarchical fashion. Whenever possible, the global runqueue could be grabbed
in order to find the best possible task to schedule on that CPU from the entire
pool. If there was contention however on the global runqueue, it could step
down in the hierarchy and just grab a runqueue effective for a numa node and
schedule the best task from that. If there was contention on that it could
step down and schedule the best task from a physical package, and then shared
cache, then shared threads, and if all that failed only would it just grab a
local CPU runqueue. The reason for doing this is it would create a load
balancer by sheer virtue of the locking mechanism itself rather than there
actually being a load balancer at all, thereby benefiting from the BFS approach
in terms of minimising latency, finding the best global task, not requiring a
load balancer, and at the same time benefit from having multiple runqueues to
avoid lock contention - and in fact use that lock contention as a means to an
Alas to implement it myself I'd have to be employed full time for months
working on just this to get it working...
Last edited by ulenrich; 03-11-2013 at 10:03 AM.