Axboe Achieves 8M IOPS Per-Core With Newest Linux Optimization Patches

coder replied

17 October 2021, 05:32 PM
Originally posted by cl333r View Post

Can io_uring be used with inotify instead of e/poll?

I don't understand the question. io_uring has its own ioctl for blocking on the completion queue. liburing provides C wrapper functions for this, such as io_uring_wait_cqe() or io_uring_peek_cqe().

If you're new to io_uring, these docs are a decent primer:
https://kernel.dk/io_uring.pdf

https://kernel.dk/io_uring-whatsnew.pdf
Likes 2
Leave a comment:
cl333r replied

17 October 2021, 05:08 PM
Can io_uring be used with inotify instead of e/poll?
Leave a comment:
Space Heater replied

17 October 2021, 04:11 PM
Originally posted by sdack View Post

UNIX/Linux systems have always dominated the server market, because of their persistency. No other OS could deliver the reliability and thus uptimes as UNIX/Linux could.

IBM i, z/OS, OpenVMS, and HPE nonstop are clear examples of operating systems that typically have greater availability compared with Unix and Unix-like systems. Yet Unix and Unix-like systems still took over the market.
Likes 1
Leave a comment:
coder replied

17 October 2021, 03:16 PM
Originally posted by sdack View Post

I just see the work by Axboe as one of these increments. When the hardware gets any faster, and it always does, might IO_uring not be able to hold the pace.

PMEMFILE has been in progress for about 4 years, to enable direct userspace access to persistent memory. That's how long it's been documented on Phoronix, at least.

I don't see anyone suggesting that the conventional UNIX I/O model is the best hammer to attack all storage problems. However, neither is persistent memory the suitable solution for everything (or even most things). If we just forget about the limitations of our conventional technology and imagine Optane were everything Intel originally billed it as (and then some!), there's the whole issue of having to perform every memory operation with transactional semantics. That's going to be a performance dealbreaker, all by itself. The impact on performance an programming complexity is surely a tradeoff your game developer friends wouldn't like, however annoyed they are with having to load stuff in from storage.

Originally posted by sdack View Post

What gets currently celebrated as glorious gains with each patch set can also be seen as an attempt at catching up.

Not really, because this is just one core and the SSD is new technology that's far-and-away faster than any other NVMe drive. If you scale this up across an entire server CPU, then it'd have no trouble saturating as many SSDs as you could plausibly connect to it.

To put some numbers to it, I think Axboe said the single SSD could handle only 5.5 M IOPS. If you put 30 of them on a single 64-core Epyc, then that's just 165 M IOPS worth of SSD capacity. At 8 M IOPS per core, linear scaling would predict 512 M IOPS. Of course, the server CPUs run a lower clockspeed and we know scaling won't be linear, but I also didn't count the SMT threads.

Of course, that's all very simplistic, but I think it's clear the CPU is still far ahead of storage, leaving plenty of cycles for the network stack and for userspace code to do interesting things with the data.

Last edited by coder; 17 October 2021, 03:21 PM.
Likes 2
Leave a comment:
coder replied

17 October 2021, 03:00 PM
Originally posted by WorBlux View Post

Yes there are used for this tech, but it's slower and less resilient than the current tech.

One thing I also forgot to point out is that NAND flash needs to be used in fairly large blocks. One of Intel's selling points for Optane was supposedly that you have direct bit-level read/write access.

Another point worth considering is the addition of memory device support in CXL. This is aimed at having coherent memory pools that aren't directly connected to a single processor node. It's how I think NVDIMMs are likely to be deployed in the future, and would represent a new step in the memory hierarchy. This stands opposed to the flatland sdack is envisioning.
Likes 2
Leave a comment:
sdack replied

17 October 2021, 02:50 PM
Originally posted by coder View Post

Anyway, the whole debate is happening at a silly level of abstraction. What would make it productive is if we had a specific technology or API with specific performance and functional tradeoffs vs. conventional methods. Without anything concrete, you can debate design philosophy and usage models interminably.

I do not think of it as silly. If someone were to reimplement an entire OS and its applications just to give people full persistency right now would it at best end as systemd it - people hated it because of the harshness of the change. I am rather seeing it as a gradual development and the technology is also only advancing in small increments. I just see the work by Axboe as one of these increments. When the hardware gets any faster, and it always does, might IO_uring not be able to hold the pace. What gets currently celebrated as glorious gains with each patch set can also be seen as an attempt at catching up. Once it gets any faster than the software can process it will it need additional hardware to make use of it.

Last edited by sdack; 17 October 2021, 02:55 PM.
Leave a comment:
WorBlux replied

17 October 2021, 02:48 PM
Originally posted by coder View Post

sdack is referring to they way they're unloaded by the OS, and then resumed on demand. That's not something every *nix or any non-recent version of Windows did. And I don't consider it equivalent to swapping, because it's more sophisticated than that.

That would make more sense in this context.
Leave a comment:
WorBlux replied

17 October 2021, 02:35 PM
Originally posted by sdack View Post

You are just not seeing the forest because of all the trees in it. UNIX/Linux systems have always dominated the server market, because of their persistency. No other OS could deliver the reliability and thus uptimes as UNIX/Linux could. Despite the fact that all data from main memory is lost did we use every trick to achieve persistency, to keep the server up and running for as long as possible, and to provide people with near 100% reliability and 24/7 operations. Now is the industry developing memory systems that hold their data until it is overwritten. You think of it is a problem because of the way we currently do things. It is not. Persistency has always been the goal and the hardware and software is adjusting to it.

Yes there are used for this tech, but it's slower and less resilient than the current tech. Unless persistent storage gets to be was both the cheapest and fastest option (and durable enough) I don't see anyone bothering other than application where you are running a known fixed workload and the development overhead needed to work directly with the hardware is worth the performance/reliability boost.

Glorious Complexity of Intel Optane DIMMs and Micron Exiting 3D XPoint

https://www.servethehome.com/glorious-complexity-of-intel-optane-dimms-and-micron-exiting-3d-xpoint/

We go into the glorious complexity of Intel Optane DIMMs just as Micron exits the 3D XPoint business for CXL persistent memory

So currently for an application to use persistent memory, it needs to be PMEM aware, or the PMEM is mapped to the LBA of a fictional SSD.

And no current OS is going to go back to using physical rather than virtual memory. And further the physical memory model is not well suited to a dynamic muti-tenancy environment like the desktop or workstation.

Also the Big Iron Unix Boxes were often priced in the six digits. You payed a lot for the hardware, and in the end everyone moved to commodity hardware and pushed fault handling to new software models.
Likes 1
Leave a comment:
coder replied

17 October 2021, 02:29 PM
Originally posted by blackshard View Post

the numbers are not contextualized. We don't know the variables in the game so we can't say how much of the bigger number is due to io_uring optimization and how much due to just more powerful and capable hardware.

You have to follow the links and Axboe's twitter to get more details. It's unfortunate that Michael didn't do that work for us, in one of the earlier articles. Then, he could at least link to it and we'd see more of the details in one place.

Still, he's not in the wrong to be trumpeting Axboe's progress, IMO. It's good times, for those doing IOPS-heavy stuff with Linux!
Likes 5
Leave a comment:
blackshard replied

17 October 2021, 02:26 PM
Originally posted by sdack View Post

You want to be careful with your choice of words and not shit on this effort just because you do not get what you are looking for.

...cut...

Being able to do 8 million I/O operations per second means one can transfer 8 million random blocks of i.e. 512 bytes into main memory at effectively 4GB/sec, all while the main memory, being the designated "random access memory", and having a peak rate of just 25GB/sec (i.e. DDR4-3200). And we are using software (OS, block layer, file system) to perform this transfer. It should make you think and allow you to appreciate the work, and not steep to insults.

Probably I didn't explain myself very well, probably you are just pompous enough to think I'm the first noob passing here around that didn't care about block devices overhead and the development of a ring-buffer based kernel api like io_uring. Plus, I'm not insulting anyone and never had intention to.

Still have to clarify: everyday this site is publishing a new article about new records of io_uring api. I find these articles dummy for the reasons I explained: the numbers are not contextualized. We don't know the variables in the game so we can't say how much of the bigger number is due to io_uring optimization and how much due to just more powerful and capable hardware.
Leave a comment:

Announcement

Axboe Achieves 8M IOPS Per-Core With Newest Linux Optimization Patches

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: