Announcement

**geearf** · 20 January 2024, 02:22 PM

Originally posted by S.Pam View Post

It seems many like Kyber. Do you know if it supports io weights like BFQ?

I don't know anything about it sorry.
Now I was told by a CachyOS dev that there is a regression in 6.7, and 6.6, where none is better for nvme but once it'll be fixed Kyber will be it again.

**ahrs** · 20 January 2024, 11:44 PM

Originally posted by fitzie View Post

i think something like this went unnoticed for so long because people will just use noop for scheduling on nvme. I use mq-deadline on my sata ssds, and bfq on my spinning rust.

I still use none on my nvme drives:

HTML Code:

cat /sys/block/nvme*/queue/scheduler
[none] mq-deadline kyber bfq
[none] mq-deadline kyber bfq

I don't know what the advantage of an I/O scheduler is on fast nvme drives unless you're running many mixed I/O workloads at once. If I did use a scheduler I'd probably also go with Kyber like Pop OS.

**sobrus** · 03 February 2024, 02:56 PM

It's no secret that mq-deadline doesn't scale very well - it was originally done as a proof-of-concept conversion from deadline, when the blk-mq multiqueue layer was written.

I wonder why so widely used scheduler has been not optimized for so many years? It's no secret that linux I/O schedulers sucks? great news!

Apart from that, maybe someone will know. Do we need a scheduler for NVMe? There are two contradicting concepts:
a) nvme is multiqueue, supporting 65536 queues in hardware, etc, it doesn't need any scheduler,
b) while nvme is fast, during heavy load,a task can be starved by other I/O heavy tasks, so we should use something like deadline or kyber for better latency/responsiveness.

Which one is true?

**S.Pam** · 09 February 2024, 10:43 AM

Originally posted by sobrus View Post

Do we need a scheduler for NVMe? There are two contradicting concepts:
a) nvme is multiqueue, supporting 65536 queues in hardware, etc, it doesn't need any scheduler,
b) while nvme is fast, during heavy load,a task can be starved by other I/O heavy tasks, so we should use something like deadline or kyber for better latency/responsiveness.

Which one is true?

Let's break this down in parts:

* What is a queue?
A queue is a buffer to hold some amount of data. The simplest queues are FIFO - First In First Out.

* Why are queues needed?
Queues are needed when the target cannot immediately accept a request. Without the queue, there would be added latency overhead in waiting for and handling each request, leading to poor performance.

* What is a multiqueue, MQ?
Multiple queues are parallel, independent queues. Usually this is good for avoiding lock contention when only one thing at the time can add/relive things to the queue.

* What is queue depth?
The queue depth is how many independent positions there are in the queue.

* Where are the queues located?
The queues, both single and MQ are stored in system RAM. The nvme controller can directly access all the queues over the PCI bus. With SATA drives supporting NCQ, the queue is on the SATA device itself.

The nvme standard supports up to 65536 parallel queues, each 65536 requests deep. In reality, the nvme controller has limited amount of processing cores and the flash media may also be limited in how many parallel requests it can handle.

Once the OS has placed data on the nvme queue, there is limited possibility to rearrange and prioritise each queue and request within them.

So how can we answer the question if an IO scheduler is needed?

First, an MQ scheduler can help avoid contention by letting individual CPU cores use different queues.

Secondly, as the queues fill up, it means the drive or PCI bus has a bottle neck. When this happens, the latency for each IO request increases because the nvme controller has to process all requests before it on the queue first. Because the OS cannot manage data inside the queues, an IO scheduler must be used to prioritise data before submitting it to the nvme queue. The taget might be to improve latency or throughput for different applications. In these cases, it is better to have shallow nvme queue and manage the requests via the scheduler instead of letting the nvme controller do this.

An analogy is the bufferbloat concept in networking. Many broadband routers employ large send/receive buffers to improve bandwidth utilisation, but this also greatly increases latency when the the link is full.

Nvme drives are fast, very fast even, compared to traditional SATA bus based media. The case for large buffers and advanced queuing algorithms is simply not as strong. That said, we also have very fast CPUs and in some cases we do still want to control latency vs throughput in software rather than letting the nvme controller manage it.

**sobrus** · 09 February 2024, 11:06 AM

Thank you for thorough explanation, this makes very much sense, and is perfectly in-line with my findings.
Not only scheduler is more flexible but also cpu overhead is really rather negligible, at least with my fastest nvme drive (sx850x)
Apps like cp hit 100% cpu at around 3GiB/s - way before scheduler makes any impact.

So I will revert back to deadline scheduler (which I've been using for many many years) for all my drives from hdd to nvme.

And if somebody wants to tune I/O speed, the easiest way is to look at read_ahead_kb and nr_requests. Increasing these two params can do wonders in certain workloads (sometimes at the cost of latency). My other nvme drive (MP510) can reach real-life full sequential speed with certain apps only with read_ahead_kb set to 1MB or more (which is like 3x improvement over 64kB despite its advertised 600k read IOPS speed).
Nr_requests is probably more usefull with deadline and hdds to rearrange requests in more linear fashion, but can be tried nevertheless.

**S.Pam** · 09 February 2024, 11:14 AM

It is worth noting that only the BFQ and CFQ schedulers supports IO bandwidth and latency priority via cgroups io.* files. This can help with stuttering in desktops environments.

**sobrus** · 09 February 2024, 11:27 AM

I wasn't aware if it, thanks!

Announcement

MQ-Deadline Scheduler Optimized For Much Better Scalability

Comment

Comment

Comment

Comment

Comment

Comment

Comment