Originally posted by coder
View Post
Announcement
Collapse
No announcement yet.
Axboe Achieves 8M IOPS Per-Core With Newest Linux Optimization Patches
Collapse
X
-
Originally posted by coder View PostAnyway, the whole debate is happening at a silly level of abstraction. What would make it productive is if we had a specific technology or API with specific performance and functional tradeoffs vs. conventional methods. Without anything concrete, you can debate design philosophy and usage models interminably.Last edited by sdack; 17 October 2021, 02:55 PM.
Comment
-
Originally posted by WorBlux View PostYes there are used for this tech, but it's slower and less resilient than the current tech.
Another point worth considering is the addition of memory device support in CXL. This is aimed at having coherent memory pools that aren't directly connected to a single processor node. It's how I think NVDIMMs are likely to be deployed in the future, and would represent a new step in the memory hierarchy. This stands opposed to the flatland sdack is envisioning.
- Likes 2
Comment
-
Originally posted by sdack View PostI just see the work by Axboe as one of these increments. When the hardware gets any faster, and it always does, might IO_uring not be able to hold the pace.
I don't see anyone suggesting that the conventional UNIX I/O model is the best hammer to attack all storage problems. However, neither is persistent memory the suitable solution for everything (or even most things). If we just forget about the limitations of our conventional technology and imagine Optane were everything Intel originally billed it as (and then some!), there's the whole issue of having to perform every memory operation with transactional semantics. That's going to be a performance dealbreaker, all by itself. The impact on performance an programming complexity is surely a tradeoff your game developer friends wouldn't like, however annoyed they are with having to load stuff in from storage.
Originally posted by sdack View PostWhat gets currently celebrated as glorious gains with each patch set can also be seen as an attempt at catching up.
To put some numbers to it, I think Axboe said the single SSD could handle only 5.5 M IOPS. If you put 30 of them on a single 64-core Epyc, then that's just 165 M IOPS worth of SSD capacity. At 8 M IOPS per core, linear scaling would predict 512 M IOPS. Of course, the server CPUs run a lower clockspeed and we know scaling won't be linear, but I also didn't count the SMT threads.
Of course, that's all very simplistic, but I think it's clear the CPU is still far ahead of storage, leaving plenty of cycles for the network stack and for userspace code to do interesting things with the data.Last edited by coder; 17 October 2021, 03:21 PM.
- Likes 2
Comment
-
Originally posted by sdack View PostUNIX/Linux systems have always dominated the server market, because of their persistency. No other OS could deliver the reliability and thus uptimes as UNIX/Linux could.
- Likes 1
Comment
-
Originally posted by cl333r View PostCan io_uring be used with inotify instead of e/poll?
If you're new to io_uring, these docs are a decent primer:
- Likes 2
Comment
-
Originally posted by blackshard View Post
I understand, but it is not the idea of optimizing the api that I'm criticizing, but the numbers!
As long as there is not a serious benchmark with consistent variables, all those numbers (7M, 7.4M, 8M IOPS...) are just trash...
I mean: I could take a 5900X and do 8M IOPS. Then I overclock the 5900X to an higher stellar frequency and do 9M IOPS, and so I reach a new record; but what matters? The api/algorithm below isn't any better, just throwing out a bigger useless number.
Originally posted by blackshard View Postthe numbers are not contextualized. We don't know the variables in the game so we can't say how much of the bigger number is due to io_uring optimization and how much due to just more powerful and capable hardware.
All by the same guy, he shares more info in tweets that the articles reference.
He's only used two different systems, but the Optane storage he tested against remained the same until he saturated it's controller. I don't recall the exact amount before he upgraded to the newer system, I think it might have been around 3.8M or something. The linked tweet is about his new record on the upgraded system, where he notes he got it to the point that one of these storage devices alone was the new bottleneck, regardless of CPU speed.
So with that goal achieved, he added a 2nd Optane disk (same model), and had the goal of seeing how much he could get his particular CPU core to handle across both devices. We're now at 8M and these devices handle around 5M IOPS each.
So yes, there may be a slight boost during the CPU/system upgrade, but he's made steady improvements on both systems prior and after. You don't have to pay attention to the specific numbers, but the scale/ratio of improvements is worthwhile. We went from like 2M to 8M, a 4x improvement, and that was a big improvement over what AIO was capable of already IIRC.
As for benchmarking, besides him being the only source with hardware only changing once, he details that he uses FIO, a common disk I/O benchmarking tool. The linked tweet thread even has him share the command he's been using to get these results: taskset -c 0 t/io_uring -b512 -d128 -s32 -c32 -p1 -F1 -B1 -n1 /dev/nvme2n1
So... not that many "variables in the game"? Bulk of the improvements reported are from optimizations, very little from more powerful hardware.
Although in order to benefit from those new records, you would need hardware of similar capability (eg without the Optane, you'd bottleneck on the storage device considerably earlier, top NVMe products like WD SN850 or Samsung 980 Pro peak IOPS around 1M, regular SATA SSDs like a Crucial MX500 at around 100k IOPS. If you don't saturate the storage device, then like the developer, you need a CPU that can handle such load.
In practice, most of us won't perform workloads that push hardware like that... we probably don't even regularly saturate a SATA SSD IOPS. AFAIK, it's only going to matter if you can't already saturate the IOPS capability due to bottleneck of CPU. That shouldn't be an issue for the SATA SSD with the 100k IOPS, but less CPU usage should be required to perform the same amount of IOPS, but for short bursts of random I/O you'd probably not notice... 1k I/O issued on that 100k IOPS device would be done in 10ms? The blip in CPU usage wouldn't be perceived. (I could be completely misunderstanding the topic here, not an expert)
- Likes 6
Comment
-
Originally posted by coder View PostOf course, that's all very simplistic, but I think it's clear the CPU is still far ahead of storage, leaving plenty of cycles for the network stack and for userspace code to do interesting things with the data.
Comment
-
Originally posted by sdack View PostYou do not want CPUs to do menial workloads of shuffling memory around. This is part of the point here. Memory is handled by memory controllers and MMUs,
Of course, someone is probably going to chime in about the new data-streaming accelerator engines, in Ice Lake SP or Sapphire Rapids (I forget which). However, if your goal is to replace DRAM with nonvolatile memory, then you can't outsource all memory accesses to a separate engine - it's got to be the CPU accessing it.
Originally posted by sdack View Postwhen Axboe drops his Intel box for an AMD one, because the Intel box could not max out the full potential of an Intel storage device ... You did notice it, too, right?!
Originally posted by sdack View PostIntel will solve it not by tweaking the CPUs or relying on software, but they will seek to develop standards with the memory industry to integrate these new technologies on the hardware level without relying too much on a CPU's processing power.
They are also pushing CXL and probably helped drive the inclusion of memory devices into recent version of the spec.
Originally posted by sdack View PostConsumer SSDs are now reaching 7GB/sec, although weak in IOPS, but as you can see is this limitation falling quickly.
- Likes 1
Comment
Comment