Axboe Achieves 8M IOPS Per-Core With Newest Linux Optimization Patches

coder replied

17 October 2021, 01:20 PM
Originally posted by Yttrium View Post

It needs to be said that IO is inherently DISK LIMITED.

No, not if the data is in cache.

For his tests, he's likely using O_DIRECT, to force I/O to bypass the cache. That makes the benchmark relevant for accessing databases too big to fit in memory. So, if that's what you're doing, then the limiting factor for non-exotic storage devices is going to be the storage device, itself.

However, it's certainly possible for someone to use io_uring on slower storage, with an access pattern that exhibits a high cache hit-rate. That's a case we could actually see in things like Samba, which was one of the first to trial an io_uring backend.
Leave a comment:
coder replied

17 October 2021, 01:14 PM
Originally posted by MastaG View Post

But could this also have a positive effect for a regular desktop user running a webbrowser and playing some games on Steam for example?

Let's be honest: no, probably not.

If some of these optimizations aren't specific to io_uring, then the potential exists. However, what he's essentially doing is shaving already tiny overheads here and there, and the only way you're going to see the effect is when some extremely IOPS-intensive stuff. The I/O that games do is going to be optimized to be more sequential and less IOPS-intensive, specifically so they run well on machines with far lower-performance SSDs and without oodles of RAM for lots of caching.

And web browsers are going to do mostly synchronous I/O, when updating their cache, history, and the persistent state of web apps. The main hit I think you're seeing from them is the overhead of updating indexing structures, be they at the filesystem metadata level or within file-level databases.
Likes 1
Leave a comment:
cb88 replied

17 October 2021, 12:44 PM
Originally posted by George99 View Post

Not only the same guy but also same hardware. Otherwise it would be pointless.

Actually not... he upgraded hardware not long ago and that was part of getting to millions of IOPS.... that said his work probably would make the original system go faster also. He upgraded I think it was because his CPU couldn't keep up with Optane etc...
Leave a comment:
sdack replied

17 October 2021, 12:39 PM
Originally posted by WorBlux View Post

Cost, latency, and cache design would like to interject some words to the contrary...

Those are just again the worries of the old schoolers. We once had a time where every PC needed to have MS Office installed and it then had to preload itself at boot up, just so people could work on their documents faster. Now Office is a service on the Internet. You do not even know where your "files" are being saved, if they are saved, if they are actually files or perhaps records in a database. Much of the "old school"-thinking is dying and we are seeing it on mobile phones, too, where many apps not really quit but persist in the background.

And since you have mentioned it, yes, we do have now three levels of cache, and yet do you want to hold on to the separation of storage from main memory as if this was somehow majorly important. It is not important to the users when they just want to get their work done, and ideally not having to worry where and how something gets stored.

So you may feel triggered by the idea of the "old school"-models falling away, because of your worries and perhaps an inability to imagine new ways, but others will and it is happening already.

Last edited by sdack; 17 October 2021, 12:46 PM.
Leave a comment:
sdack replied

17 October 2021, 12:28 PM
Originally posted by blackshard View Post

I understand, but it is not the idea of optimizing the api that I'm criticizing, but the numbers!
As long as there is not a serious benchmark with consistent variables, all those numbers (7M, 7.4M, 8M IOPS...) are just trash...
I mean: I could take a 5900X and do 8M IOPS. Then I overclock the 5900X to an higher stellar frequency and do 9M IOPS, and so I reach a new record; but what matters? The api/algorithm below isn't any better, just throwing out a bigger useless number.

You want to be careful with your choice of words and not shit on this effort just because you do not get what you are looking for.

We started with switches and punch cards, then magnetic tape until we arrived at spinning disks with a magnetic coating. All these had a downside compared to transistor- and capacitor-based storage: they are very slow to access. Even when one can transfer hundreds of megabytes per second from a single spinning disk these days, does it still require milliseconds before a head is positioned over the right track. This makes traditional storage devices very sensitive to random access compared to sequential access.

The block layer of the Linux kernel, but practically all operating systems, is designed with this bottleneck in mind. It never mattered much how fast a first bytes gets accessed as long as the overall throughput stays high, because huge access times simply make it pointless. It sometimes even gets traded for higher throughput on purpose.

Now things change, and we have storage systems that tear down this bottleneck of the access time. Storage devices are no longer moving mechanical contraptions, but are manufactured with similar lithography processes as the main memory, and we see technologies overlap and merge. Operating systems need to catch up to it and need to adjust. This is what is happening here and it is only the beginning.

So calling these numbers "trash" and "useless", because you are looking for traditional benchmark numbers so you can compare it to a hard drive or SSD is just narrow-minded and insulting. What matters is that the block layer opens up to allow very fast access speeds.

Being able to do 8 million I/O operations per second means one can transfer 8 million random blocks of i.e. 512 bytes into main memory at effectively 4GB/sec, all while the main memory, being the designated "random access memory", and having a peak rate of just 25GB/sec (i.e. DDR4-3200). And we are using software (OS, block layer, file system) to perform this transfer. It should make you think and allow you to appreciate the work, and not steep to insults.
Likes 7
Leave a comment:
WorBlux replied

17 October 2021, 11:52 AM
Originally posted by sdack View Post

These gains are great, but this is just a precursor to a more fundamental shift in design that is coming. Samsung is already developing memory chips that combine DDR with FLASH technologies, and other manufacturers will follow with their own innovations. It is now only a matter of time until main memory becomes persistent and software no longer has to load and store data explicitly on storage devices, and when all data will become available at all times.

I know some game designers are already waiting for such a change desperately where the data does not have to be streamed from a drive into main memory or a game world has to be cut up into sections just to fit into main memory.

Of course, some people will hold on to the classic design, because of their worries and "old school"-thinking, but when people's workflow changes and it is no longer a "load, work, save"-process but people can jump straight to the "work" part then it will cause a shift in designs. Old schoolers will still want to load and save their documents, and count files on a drive like these were eggs in a basket.

Cost, latency, and cache design would like to interject some words to the contrary...
Persistent memory is unlikely on the desktop anytime soon. Certain types of specialized data proccessors-maybe.

Also a world were a memory leak can fill a drive w/junk?

And if you loose power (and hence the program counter and register values) how do you find the data you want later? (Answer: you still need a file system, key-value map, or similar structure to organize things into discrete name-referenced bags of bits (files)) Game programmers aren't going to start programing to direct memory address again. At best it would be a post-install optimization pass.

Also even then the virtual memory the program model sees is a lie. Virtual memory is mapped to physical memory in chunks which are dynamically allocated.

So no, this flat static view of memory is old school, and there are good reasons nobody want's to do it for a general purpose computing.

What they want instead is to avoid unnecessary data copying. Instead of disk -> memory -> GPU they want to do disk->GPU. Doing this DMA off persistent memory is simpler, but in no way reduce the conceptual need to differentiate between working memory and storage.

Last edited by WorBlux; 17 October 2021, 11:55 AM.
Likes 5
Leave a comment:
Volta replied

17 October 2021, 11:51 AM
Originally posted by blackshard View Post

I mean: I could take a 5900X and do 8M IOPS. Then I overclock the 5900X to an higher stellar frequency and do 9M IOPS, and so I reach a new record; but what matters? The api/algorithm below isn't any better, just throwing out a bigger useless number.

In this case api/algorithm is better, so your example is plain stupid. The link below describes why IOPS is very important:

Why do IOPS matter?

https://newbedev.com/why-do-iops-matter

Solution 1: This is because sequential throughput is not how most I/O activity occurs. Random reads/write operations are more representative of normal system ac

Last edited by Volta; 17 October 2021, 11:54 AM.
Likes 4
Leave a comment:
WorBlux replied

17 October 2021, 11:11 AM
Originally posted by turboNOMAD View Post

I think the question of MastaG is more like: do any of this optimizations even affect regular desktop use case? E.g. when loading a game on Steam, IO concurrency (queue depth) is very low, and existing software does not explicitly use the io_uring API. So the question is, will existing desktop applications (like Steam games) see any improvement at all, or the code paths being optimized are already taking negligible time/power (in desktop use case) before the optimizations?

It looks like there are some changes in the block layer as well, weather that will really help on largely sequential I/O is doubtful. Though it may help nvme in general.

It's definitely more important on the workstation or server to process data streams/tables/trees.
Likes 1
Leave a comment:
blackshard replied

17 October 2021, 11:01 AM
Originally posted by Yttrium View Post

It needs to be said that IO is inherently DISK LIMITED. No matter how many IO per core you can do, its limited to the disk speed you're attached to. Jens Axboe managed to do 8M IOPS on a single core, which is amazing because you would need less cores and operations to serve all disk bandwidth to your customers. Also, if you look at the pull request you'll see that the changes made are very general and not architecture specific so every architecture would see an improvement based on its capability. Most likely this would result in bigger cores and higher frequencies seeing more improvement but that's the case with all optimisations in general.

I understand, but it is not the idea of optimizing the api that I'm criticizing, but the numbers!
As long as there is not a serious benchmark with consistent variables, all those numbers (7M, 7.4M, 8M IOPS...) are just trash...
I mean: I could take a 5900X and do 8M IOPS. Then I overclock the 5900X to an higher stellar frequency and do 9M IOPS, and so I reach a new record; but what matters? The api/algorithm below isn't any better, just throwing out a bigger useless number.
Likes 2
Leave a comment:
sdack replied

17 October 2021, 10:10 AM
These gains are great, but this is just a precursor to a more fundamental shift in design that is coming. Samsung is already developing memory chips that combine DDR with FLASH technologies, and other manufacturers will follow with their own innovations. It is now only a matter of time until main memory becomes persistent and software no longer has to load and store data explicitly on storage devices, and when all data will become available at all times.

I know some game designers are already waiting for such a change desperately where the data does not have to be streamed from a drive into main memory or a game world has to be cut up into sections just to fit into main memory.

Of course, some people will hold on to the classic design, because of their worries and "old school"-thinking, but when people's workflow changes and it is no longer a "load, work, save"-process but people can jump straight to the "work" part then it will cause a shift in designs. Old schoolers will still want to load and save their documents, and count files on a drive like these were eggs in a basket.
Likes 3
Leave a comment:

Announcement

Axboe Achieves 8M IOPS Per-Core With Newest Linux Optimization Patches

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: