Announcement

**LtdJorge** · 14 May 2024, 04:46 PM

Cool, these kind of improvements will pave the future of NVMe performance. I have high hopes for Ceph’s SeaStore with ZNS.

**pong** · 14 May 2024, 07:06 PM

The zoned IO to NVME bit reminds me of something I was wondering about in the past relating to dirty-write-back page "zones" for homogeneous NVME devices but where there were lots of user-defined "logical zones" of scattered write-pending write page buffers across RAM & scattered over mapped regions of the NVME drives.

In the past I had a thought that it would be nice to be able to control the heuristics for when write-eligible dirty page ranges were actually written out based on dirtiness-age time-out (i.e. flush anything that has been still pending write older than NN seconds) and also based on the contiguous size of a sub-group of dirty pages exceeding some desirable "quantum" threshold which e.g. would mean that if/when a group of contiguous dirty write pages exceeded a particular value of NN MBy then it'd be relatively most desirable to write the oldest such "zones" of >= NN MBy size out because that'd be a relatively optimum IO output write size for the underlying NVME storage system to write efficiently (NVME performance / lifetime and also avoiding unneeded CPU/kernel activity) as a group and with less medium "excess wear" (by not writing I/O blocks smaller than the minimally desirable size based on the NVME characteristics).

In userspace one seemed to have relatively basic control with msync / madvise and at the kernel level one had some I/O scheduler / elevator related tunables to affect write-back policy / frequency / prioritization but nothing IIRC that I noticed that would simply try to optimize the writing to happen in largest-block-IO-write-size possible chunks given a time out and also avoiding too much memory pressure from a large backlog of dirty unwritten I/O page buffers.

It seemed like one could either write one's own kernel I/O scheduler or one had to deal with somewhat crude knobs to get the existing ones to approximate the desired heuristics of what scattered "block write zones" would be encouraged to be written when according to age / I/O size criteria.

Announcement

Zone Write Plugging Comes To Linux 6.10 For Better Performance

Zone Write Plugging Comes To Linux 6.10 For Better Performance

Comment

Comment