Announcement

Collapse
No announcement yet.

Axboe Achieves 8M IOPS Per-Core With Newest Linux Optimization Patches

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    Originally posted by coder View Post
    sdack is referring to they way they're unloaded by the OS, and then resumed on demand. That's not something every *nix or any non-recent version of Windows did. And I don't consider it equivalent to swapping, because it's more sophisticated than that.
    That would make more sense in this context.

    Comment


    • #42
      Originally posted by coder View Post
      Anyway, the whole debate is happening at a silly level of abstraction. What would make it productive is if we had a specific technology or API with specific performance and functional tradeoffs vs. conventional methods. Without anything concrete, you can debate design philosophy and usage models interminably.
      I do not think of it as silly. If someone were to reimplement an entire OS and its applications just to give people full persistency right now would it at best end as systemd it - people hated it because of the harshness of the change. I am rather seeing it as a gradual development and the technology is also only advancing in small increments. I just see the work by Axboe as one of these increments. When the hardware gets any faster, and it always does, might IO_uring not be able to hold the pace. What gets currently celebrated as glorious gains with each patch set can also be seen as an attempt at catching up. Once it gets any faster than the software can process it will it need additional hardware to make use of it.
      Last edited by sdack; 17 October 2021, 02:55 PM.

      Comment


      • #43
        Originally posted by WorBlux View Post
        Yes there are used for this tech, but it's slower and less resilient than the current tech.
        One thing I also forgot to point out is that NAND flash needs to be used in fairly large blocks. One of Intel's selling points for Optane was supposedly that you have direct bit-level read/write access.

        Another point worth considering is the addition of memory device support in CXL. This is aimed at having coherent memory pools that aren't directly connected to a single processor node. It's how I think NVDIMMs are likely to be deployed in the future, and would represent a new step in the memory hierarchy. This stands opposed to the flatland sdack is envisioning.

        Comment


        • #44
          Originally posted by sdack View Post
          I just see the work by Axboe as one of these increments. When the hardware gets any faster, and it always does, might IO_uring not be able to hold the pace.
          PMEMFILE has been in progress for about 4 years, to enable direct userspace access to persistent memory. That's how long it's been documented on Phoronix, at least.

          I don't see anyone suggesting that the conventional UNIX I/O model is the best hammer to attack all storage problems. However, neither is persistent memory the suitable solution for everything (or even most things). If we just forget about the limitations of our conventional technology and imagine Optane were everything Intel originally billed it as (and then some!), there's the whole issue of having to perform every memory operation with transactional semantics. That's going to be a performance dealbreaker, all by itself. The impact on performance an programming complexity is surely a tradeoff your game developer friends wouldn't like, however annoyed they are with having to load stuff in from storage.

          Originally posted by sdack View Post
          What gets currently celebrated as glorious gains with each patch set can also be seen as an attempt at catching up.
          Not really, because this is just one core and the SSD is new technology that's far-and-away faster than any other NVMe drive. If you scale this up across an entire server CPU, then it'd have no trouble saturating as many SSDs as you could plausibly connect to it.

          To put some numbers to it, I think Axboe said the single SSD could handle only 5.5 M IOPS. If you put 30 of them on a single 64-core Epyc, then that's just 165 M IOPS worth of SSD capacity. At 8 M IOPS per core, linear scaling would predict 512 M IOPS. Of course, the server CPUs run a lower clockspeed and we know scaling won't be linear, but I also didn't count the SMT threads.

          Of course, that's all very simplistic, but I think it's clear the CPU is still far ahead of storage, leaving plenty of cycles for the network stack and for userspace code to do interesting things with the data.
          Last edited by coder; 17 October 2021, 03:21 PM.

          Comment


          • #45
            Originally posted by sdack View Post
            UNIX/Linux systems have always dominated the server market, because of their persistency. No other OS could deliver the reliability and thus uptimes as UNIX/Linux could.
            IBM i, z/OS, OpenVMS, and HPE nonstop are clear examples of operating systems that typically have greater availability compared with Unix and Unix-like systems. Yet Unix and Unix-like systems still took over the market.

            Comment


            • #46
              Can io_uring be used with inotify instead of e/poll?

              Comment


              • #47
                Originally posted by cl333r View Post
                Can io_uring be used with inotify instead of e/poll?
                I don't understand the question. io_uring has its own ioctl for blocking on the completion queue. liburing provides C wrapper functions for this, such as io_uring_wait_cqe() or io_uring_peek_cqe().

                If you're new to io_uring, these docs are a decent primer:

                Comment


                • #48
                  Originally posted by blackshard View Post

                  I understand, but it is not the idea of optimizing the api that I'm criticizing, but the numbers!
                  As long as there is not a serious benchmark with consistent variables, all those numbers (7M, 7.4M, 8M IOPS...) are just trash...
                  I mean: I could take a 5900X and do 8M IOPS. Then I overclock the 5900X to an higher stellar frequency and do 9M IOPS, and so I reach a new record; but what matters? The api/algorithm below isn't any better, just throwing out a bigger useless number.

                  Originally posted by blackshard View Post
                  the numbers are not contextualized. We don't know the variables in the game so we can't say how much of the bigger number is due to io_uring optimization and how much due to just more powerful and capable hardware.


                  All by the same guy, he shares more info in tweets that the articles reference.

                  He's only used two different systems, but the Optane storage he tested against remained the same until he saturated it's controller. I don't recall the exact amount before he upgraded to the newer system, I think it might have been around 3.8M or something. The linked tweet is about his new record on the upgraded system, where he notes he got it to the point that one of these storage devices alone was the new bottleneck, regardless of CPU speed.

                  So with that goal achieved, he added a 2nd Optane disk (same model), and had the goal of seeing how much he could get his particular CPU core to handle across both devices. We're now at 8M and these devices handle around 5M IOPS each.

                  So yes, there may be a slight boost during the CPU/system upgrade, but he's made steady improvements on both systems prior and after. You don't have to pay attention to the specific numbers, but the scale/ratio of improvements is worthwhile. We went from like 2M to 8M, a 4x improvement, and that was a big improvement over what AIO was capable of already IIRC.

                  As for benchmarking, besides him being the only source with hardware only changing once, he details that he uses FIO, a common disk I/O benchmarking tool. The linked tweet thread even has him share the command he's been using to get these results: taskset -c 0 t/io_uring -b512 -d128 -s32 -c32 -p1 -F1 -B1 -n1 /dev/nvme2n1

                  So... not that many "variables in the game"? Bulk of the improvements reported are from optimizations, very little from more powerful hardware.

                  Although in order to benefit from those new records, you would need hardware of similar capability (eg without the Optane, you'd bottleneck on the storage device considerably earlier, top NVMe products like WD SN850 or Samsung 980 Pro peak IOPS around 1M, regular SATA SSDs like a Crucial MX500 at around 100k IOPS. If you don't saturate the storage device, then like the developer, you need a CPU that can handle such load.

                  In practice, most of us won't perform workloads that push hardware like that... we probably don't even regularly saturate a SATA SSD IOPS. AFAIK, it's only going to matter if you can't already saturate the IOPS capability due to bottleneck of CPU. That shouldn't be an issue for the SATA SSD with the 100k IOPS, but less CPU usage should be required to perform the same amount of IOPS, but for short bursts of random I/O you'd probably not notice... 1k I/O issued on that 100k IOPS device would be done in 10ms? The blip in CPU usage wouldn't be perceived. (I could be completely misunderstanding the topic here, not an expert)

                  Comment


                  • #49
                    Originally posted by coder View Post
                    Of course, that's all very simplistic, but I think it's clear the CPU is still far ahead of storage, leaving plenty of cycles for the network stack and for userspace code to do interesting things with the data.
                    It is not actually about CPUs, even when it requires a Zen 3 CPU to push these numbers. You do not want CPUs to do menial workloads of shuffling memory around. This is part of the point here. Memory is handled by memory controllers and MMUs, and a multicore CPU like Zen 3 certainly gets its speed not from the main memory, which it has to share with all cores, but from having plenty of cache and fast cache controllers. Too much gets done in software here obviously and there is a certain irony when Axboe drops his Intel box for an AMD one, because the Intel box could not max out the full potential of an Intel storage device ... You did notice it, too, right?! Intel will solve it not by tweaking the CPUs or relying on software, but they will seek to develop standards with the memory industry to integrate these new technologies on the hardware level without relying too much on a CPU's processing power. This will keep it cheap while being fast and allows them to eventually sell it on the consumer market. However, SATA was clearly not fast enough, and PCIe/M.2 is not going to do it for long either. Consumer SSDs are now reaching 7GB/sec, although weak in IOPS, but as you can see is this limitation falling quickly.

                    Comment


                    • #50
                      Originally posted by sdack View Post
                      You do not want CPUs to do menial workloads of shuffling memory around. This is part of the point here. Memory is handled by memory controllers and MMUs,
                      Memory controllers and MMUs do not remove CPU cores from the direct path of memory copies. DMA engines do that, like what PCIe devices have. When you switch from NVMe drives to using NVDIMMs, you switch from doing DMA copies to PIO. So, it actually ties up the CPU more.

                      Of course, someone is probably going to chime in about the new data-streaming accelerator engines, in Ice Lake SP or Sapphire Rapids (I forget which). However, if your goal is to replace DRAM with nonvolatile memory, then you can't outsource all memory accesses to a separate engine - it's got to be the CPU accessing it.

                      Originally posted by sdack View Post
                      when Axboe drops his Intel box for an AMD one, because the Intel box could not max out the full potential of an Intel storage device ... You did notice it, too, right?!
                      Yeah, I probably made some comment to that effect, when Intel first started shipping these drives. At the time, Rocket Lake hadn't even launched (much less Ice Lake SP or Tiger Lake H). So, the only way you could even use them @ PCIe 4.0 speeds was on AMD, POWER, or ARM CPUs.

                      Originally posted by sdack View Post
                      Intel will solve it not by tweaking the CPUs or relying on software, but they will seek to develop standards with the memory industry to integrate these new technologies on the hardware level without relying too much on a CPU's processing power.
                      Intel is the ones developing PMEMFILE, for userspace PIO access to NVDIMMs.

                      They are also pushing CXL and probably helped drive the inclusion of memory devices into recent version of the spec.

                      Originally posted by sdack View Post
                      Consumer SSDs are now reaching 7GB/sec, although weak in IOPS, but as you can see is this limitation falling quickly.
                      Top-end prosumer SSDs, yes. However, most consumers are slumming it with SATA or slow QLC NVMe drives.

                      Comment

                      Working...
                      X