Announcement

Collapse
No announcement yet.

Axboe Achieves 8M IOPS Per-Core With Newest Linux Optimization Patches

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • coder
    replied
    Originally posted by BillBroadley View Post
    Not really, even a semi-nice desktop these days might well have 2 SSDs.
    I was talking about a server with ~30 drives doing ~165 M IOPS. That would be a big, expensive setup explicitly spec'd out for for high-IOPS workloads. So, they'd probably have at least considered whether NVDIMMs were a viable option.

    Leave a comment:


  • BillBroadley
    replied
    Originally posted by coder View Post
    Realistically, anyone doing anything like that amount of IOPS is probably going to use NVDIMMs and PMEMFILE.
    Not really, even a semi-nice desktop these days might well have 2 SSDs. For an enthusiast having something like two 1TB WD SN850s wouldn't be unusual ($160 each). Motherboards with 2 M.2's aren't unusual. A pair can manage over 2M IOPs, and that's hardly the most aggressive I/O system I've seen on a high end desktop or workstation.

    Sure desktop rarely need 2M IOPs, but games are often written with ease of programming and not optimal I/O. Additionally 3D environments with various z-buffer, load objects as you run/fly/drive around 3D environments, on demand textures (in multiple resolutions), etc generate large amounts of I/O. Sure it might not be 10M IOPS, but having to dedicate 5% of a single core instead of 10% is a win. Doubly so if *gasp* you actually multitask while in games, maybe recording a video stream of the game, or running anything else intensive. Even rather sedate games like MS flight sim can generate a fair bit of I/O.

    On more mobile platforms running on battery, using 5-10% less power for I/O can be a noticeable savings.
    Last edited by BillBroadley; 19 October 2021, 12:58 AM.

    Leave a comment:


  • coder
    replied
    Originally posted by yump View Post
    64 cores * 3 GHz / (165 MIOP/s) is a little over 1000 CPU cycles per I/O. That doesn't sound like much to me.
    Realistically, anyone doing anything like that amount of IOPS is probably going to use NVDIMMs and PMEMFILE.

    However, if they have some reason not to, then don't forget that these numbers only accounted for a single CPU. You could scale up to more CPUs. In the future, CPUs could scale up to more cores, there's potential clock scaling, IPC improvements, DDR5, chip stacking (AMD's V-Cache, for instance), and CPUs are continually adding tweaks like TSX or Intel's upcoming userspace interrupts, which could serve to further optimize some otherwise-stubborn syscall overheads. So, I wouldn't worry about CPUs running out of gas anytime soon.

    And, if that's still not enough compute, CXL's recently-added support for memory devices will enable you to even scale up to more than 2 Epyc CPUs sharing a pool of nonvolatile memory.

    As a matter of fact, it's really Optane that's running out of gas! Intel's 2nd generation Optane has only managed 4 layers, while 3D NAND is now up to something like 384 layers?

    According to this, Samsung is developing 5-layer DDR5 DRAM. I don't know how the areal density of DRAM compares with 3D XPoint, but it'd be ironic if Optane even lost the density and GB/$ race to DDR5.

    Last edited by coder; 18 October 2021, 01:35 AM.

    Leave a comment:


  • yump
    replied
    Originally posted by coder View Post
    To put some numbers to it, I think Axboe said the single SSD could handle only 5.5 M IOPS. If you put 30 of them on a single 64-core Epyc, then that's just 165 M IOPS worth of SSD capacity. At 8 M IOPS per core, linear scaling would predict 512 M IOPS. Of course, the server CPUs run a lower clockspeed and we know scaling won't be linear, but I also didn't count the SMT threads.

    Of course, that's all very simplistic, but I think it's clear the CPU is still far ahead of storage, leaving plenty of cycles for the network stack and for userspace code to do interesting things with the data.
    64 cores * 3 GHz / (165 MIOP/s) is a little over 1000 CPU cycles per I/O. That doesn't sound like much to me.

    Leave a comment:


  • Space Heater
    replied
    Originally posted by sdack View Post
    What you have is a case of whataboutism.
    It's definitely not "whataboutism".

    You clearly stated the following:
    Originally posted by sdack View Post
    UNIX/Linux systems have always dominated the server market, because of their persistency. No other OS could deliver the reliability and thus uptimes as UNIX/Linux could.
    I then gave examples of other operating systems that have historically had higher uptimes and reliability than Unix and Unix-like systems, but still lost significant marketshare to them. Those examples aren't a "whataboutism" they are examples that show your theory on "persistency" being the key to winning the market is completely wrong.

    Originally posted by sdack View Post
    UNIX/Linux beat the dominance of Microsoft's operating systems, because one cannot run a reliable service when every software update requires a reboot.
    This is hilarious, you're now claiming that when you said "No other OS" you really meant only Windows.

    Originally posted by sdack View Post
    Other OSes did not manage to dominate, not because they did not offer persistency, but they lacked in other qualities, which UNIX/Linux has in addition to its persistency.
    Your attempts at back peddling don't make your original claims any less mendacious.

    Originally posted by sdack View Post
    As you may know, has UNIX also become unpopular and it is now mostly only Linux.
    I clearly wrote "Unix and Unix-like systems", and as you may know Linux is a Unix-like operating system.

    Leave a comment:


  • onlyLinuxLuvUBack
    replied
    a print on demand phoronix tshirt could be "Go AxBoe or go home."

    Leave a comment:


  • sdack
    replied
    Originally posted by Space Heater View Post
    IBM i, z/OS, OpenVMS, and HPE nonstop are clear examples of operating systems that typically have greater availability compared with Unix and Unix-like systems. Yet Unix and Unix-like systems still took over the market.
    What you have is a case of whataboutism. UNIX/Linux beat the dominance of Microsoft's operating systems, because one cannot run a reliable service when every software update requires a reboot. Other OSes did not manage to dominate, not because they did not offer persistency, but they lacked in other qualities, which UNIX/Linux has in addition to its persistency. As you may know, has UNIX also become unpopular and it is now mostly only Linux.

    Leave a comment:


  • coder
    replied
    Originally posted by sdack View Post
    You do not want CPUs to do menial workloads of shuffling memory around. This is part of the point here. Memory is handled by memory controllers and MMUs,
    Memory controllers and MMUs do not remove CPU cores from the direct path of memory copies. DMA engines do that, like what PCIe devices have. When you switch from NVMe drives to using NVDIMMs, you switch from doing DMA copies to PIO. So, it actually ties up the CPU more.

    Of course, someone is probably going to chime in about the new data-streaming accelerator engines, in Ice Lake SP or Sapphire Rapids (I forget which). However, if your goal is to replace DRAM with nonvolatile memory, then you can't outsource all memory accesses to a separate engine - it's got to be the CPU accessing it.

    Originally posted by sdack View Post
    when Axboe drops his Intel box for an AMD one, because the Intel box could not max out the full potential of an Intel storage device ... You did notice it, too, right?!
    Yeah, I probably made some comment to that effect, when Intel first started shipping these drives. At the time, Rocket Lake hadn't even launched (much less Ice Lake SP or Tiger Lake H). So, the only way you could even use them @ PCIe 4.0 speeds was on AMD, POWER, or ARM CPUs.

    Originally posted by sdack View Post
    Intel will solve it not by tweaking the CPUs or relying on software, but they will seek to develop standards with the memory industry to integrate these new technologies on the hardware level without relying too much on a CPU's processing power.
    Intel is the ones developing PMEMFILE, for userspace PIO access to NVDIMMs.

    They are also pushing CXL and probably helped drive the inclusion of memory devices into recent version of the spec.

    Originally posted by sdack View Post
    Consumer SSDs are now reaching 7GB/sec, although weak in IOPS, but as you can see is this limitation falling quickly.
    Top-end prosumer SSDs, yes. However, most consumers are slumming it with SATA or slow QLC NVMe drives.

    Leave a comment:


  • sdack
    replied
    Originally posted by coder View Post
    Of course, that's all very simplistic, but I think it's clear the CPU is still far ahead of storage, leaving plenty of cycles for the network stack and for userspace code to do interesting things with the data.
    It is not actually about CPUs, even when it requires a Zen 3 CPU to push these numbers. You do not want CPUs to do menial workloads of shuffling memory around. This is part of the point here. Memory is handled by memory controllers and MMUs, and a multicore CPU like Zen 3 certainly gets its speed not from the main memory, which it has to share with all cores, but from having plenty of cache and fast cache controllers. Too much gets done in software here obviously and there is a certain irony when Axboe drops his Intel box for an AMD one, because the Intel box could not max out the full potential of an Intel storage device ... You did notice it, too, right?! Intel will solve it not by tweaking the CPUs or relying on software, but they will seek to develop standards with the memory industry to integrate these new technologies on the hardware level without relying too much on a CPU's processing power. This will keep it cheap while being fast and allows them to eventually sell it on the consumer market. However, SATA was clearly not fast enough, and PCIe/M.2 is not going to do it for long either. Consumer SSDs are now reaching 7GB/sec, although weak in IOPS, but as you can see is this limitation falling quickly.

    Leave a comment:


  • polarathene
    replied
    Originally posted by blackshard View Post

    I understand, but it is not the idea of optimizing the api that I'm criticizing, but the numbers!
    As long as there is not a serious benchmark with consistent variables, all those numbers (7M, 7.4M, 8M IOPS...) are just trash...
    I mean: I could take a 5900X and do 8M IOPS. Then I overclock the 5900X to an higher stellar frequency and do 9M IOPS, and so I reach a new record; but what matters? The api/algorithm below isn't any better, just throwing out a bigger useless number.

    Originally posted by blackshard View Post
    the numbers are not contextualized. We don't know the variables in the game so we can't say how much of the bigger number is due to io_uring optimization and how much due to just more powerful and capable hardware.
    https://twitter.com/axboe/status/1443572396095676416

    All by the same guy, he shares more info in tweets that the articles reference.

    He's only used two different systems, but the Optane storage he tested against remained the same until he saturated it's controller. I don't recall the exact amount before he upgraded to the newer system, I think it might have been around 3.8M or something. The linked tweet is about his new record on the upgraded system, where he notes he got it to the point that one of these storage devices alone was the new bottleneck, regardless of CPU speed.

    So with that goal achieved, he added a 2nd Optane disk (same model), and had the goal of seeing how much he could get his particular CPU core to handle across both devices. We're now at 8M and these devices handle around 5M IOPS each.

    So yes, there may be a slight boost during the CPU/system upgrade, but he's made steady improvements on both systems prior and after. You don't have to pay attention to the specific numbers, but the scale/ratio of improvements is worthwhile. We went from like 2M to 8M, a 4x improvement, and that was a big improvement over what AIO was capable of already IIRC.

    As for benchmarking, besides him being the only source with hardware only changing once, he details that he uses FIO, a common disk I/O benchmarking tool. The linked tweet thread even has him share the command he's been using to get these results: taskset -c 0 t/io_uring -b512 -d128 -s32 -c32 -p1 -F1 -B1 -n1 /dev/nvme2n1

    So... not that many "variables in the game"? Bulk of the improvements reported are from optimizations, very little from more powerful hardware.

    Although in order to benefit from those new records, you would need hardware of similar capability (eg without the Optane, you'd bottleneck on the storage device considerably earlier, top NVMe products like WD SN850 or Samsung 980 Pro peak IOPS around 1M, regular SATA SSDs like a Crucial MX500 at around 100k IOPS. If you don't saturate the storage device, then like the developer, you need a CPU that can handle such load.

    In practice, most of us won't perform workloads that push hardware like that... we probably don't even regularly saturate a SATA SSD IOPS. AFAIK, it's only going to matter if you can't already saturate the IOPS capability due to bottleneck of CPU. That shouldn't be an issue for the SATA SSD with the 100k IOPS, but less CPU usage should be required to perform the same amount of IOPS, but for short bursts of random I/O you'd probably not notice... 1k I/O issued on that 100k IOPS device would be done in 10ms? The blip in CPU usage wouldn't be perceived. (I could be completely misunderstanding the topic here, not an expert)

    Leave a comment:

Working...
X