Announcement

Collapse
No announcement yet.

IO_uring Continues To Prove Very Exciting: Promising io_uring_spawn Announced

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by F.Ultra View Post
    How on earth will it give you 32x the throughput? Do you have that amount of context switch overhead?
    Yes duh. My average packet is about 300 bytes. recvmsg to recvmmsg is a big improvement too, but zero copy UDP on io_uring is another level. Anyone who's written UDP applications has experienced being kernel limited, and syscall limited. You time a normal recvmsg/sendmsg application, and unless you're doing a lot per packet, your sys time will be 5-10x your user time.

    io_uring was originally introduced for IO with files on real filesystems; but I think where it will ultimately give the most value is in UDP applications. Particularly, QUIC is implemented in userspace (and likely will be only in userspace for a while), and is absolutely dominated by the overhead of UDP syscall interfaces (setting aside less sophisticated offload in NICs vs. TCP). io_uring drops the context switching and fine grained copying overheads to very low levels, and exposes applications to the raw performance of the kernel UDP stack and the NIC.
    Last edited by microcode; 14 September 2022, 12:15 PM.

    Comment


    • #22
      Originally posted by Zeioth View Post
      Can we users do something to enable it/test it?
      zig fast db with io_uring that you can run today...

      https://github.com/tigerbeetledb/tig...demos/io_uring

      Comment


      • #23
        Heh, at this rate someone will come up with an io_uring based init.

        Comment


        • #24
          Originally posted by microcode View Post

          Yes duh. My average packet is about 300 bytes. recvmsg to recvmmsg is a big improvement too, but zero copy UDP on io_uring is another level. Anyone who's written UDP applications has experienced being kernel limited, and syscall limited. You time a normal recvmsg/sendmsg application, and unless you're doing a lot per packet, your sys time will be 5-10x your user time.

          io_uring was originally introduced for IO with files on real filesystems; but I think where it will ultimately give the most value is in UDP applications. Particularly, QUIC is implemented in userspace (and likely will be only in userspace for a while), and is absolutely dominated by the overhead of UDP syscall interfaces (setting aside less sophisticated offload in NICs vs. TCP). io_uring drops the context switching and fine grained copying overheads to very low levels, and exposes applications to the raw performance of the kernel UDP stack and the NIC.
          5-10x sounds more reasonable but you wrote 32x hence why I asked about that number specifically.

          And whoever came up with the API for recvmmsg should be taken back behind the shed...
          Last edited by F.Ultra; 15 September 2022, 11:25 AM.

          Comment


          • #25
            Originally posted by Gonk View Post
            Heh, at this rate someone will come up with an io_uring based init.
            That sounds like fun really.

            Comment


            • #26
              Originally posted by F.Ultra View Post
              5-10x sounds more reasonable but you wrote 32x hence why I asked about that number specifically.

              And whoever came up with the API for recvmmsg should be taken back behind the shed...
              Just because it looks like a big number, doesn't mean it's wrong. recvmsg is EXTREMELY inefficient, and if you are receiving very small packets (again, mine are about 300 bytes), and spending only a couple hundred nanoseconds on each of them (I verify a cleartext MAC with polyval, and then send the authenticated packet onto a lock-free channel), then you can easily have system time ratios like this. I'm getting closer to memcpy speeds now, my current test impl is 24x higher throughput than the recvmsg version, and I think there's still room for improvement.

              Comment


              • #27
                Originally posted by Gonk View Post
                Heh, at this rate someone will come up with an io_uring based init.
                init does spawn processes, i.e. it'll benefit from subj

                Comment


                • #28
                  Originally posted by microcode View Post

                  Just because it looks like a big number, doesn't mean it's wrong. recvmsg is EXTREMELY inefficient, and if you are receiving very small packets (again, mine are about 300 bytes), and spending only a couple hundred nanoseconds on each of them (I verify a cleartext MAC with polyval, and then send the authenticated packet onto a lock-free channel), then you can easily have system time ratios like this. I'm getting closer to memcpy speeds now, my current test impl is 24x higher throughput than the recvmsg version, and I think there's still room for improvement.
                  Never claimed that it was wrong, my apologies if my wording made that impression. It was just that 32x was such a huge number that I was interested in hearing the explanation. I read lots of UDP data myself (multicast) in the Gbps range with low latency requirements, however I only use recv/read and not recvmsg (which I assume you had to use for the extra flags).

                  Comment


                  • #29
                    Originally posted by F.Ultra View Post
                    Never claimed that it was wrong, my apologies if my wording made that impression. It was just that 32x was such a huge number that I was interested in hearing the explanation. I read lots of UDP data myself (multicast) in the Gbps range with low latency requirements, however I only use recv/read and not recvmsg (which I assume you had to use for the extra flags).
                    What's the packet rate you're getting? I'm dealing with very small packets, all less than 508B (payload) and while I do manage to get lots of packets in with the normal syscall interface, there is a lot of overhead on that, and it grows per packet vs. payload throughput (i.e. your "Gbps range"), so doubles when you wait on two packets for the same payload. The kernel is responsible for a portion of that, to be sure, but a lot of it comes from the design of the interface.

                    Sorry if I got antsy, there was no reason for me to be a jerk about it. Not-so-great weekend.
                    Last edited by microcode; 19 September 2022, 09:56 AM.

                    Comment


                    • #30
                      Originally posted by microcode View Post

                      What's the packet rate you're getting? I'm dealing with very small packets, all less than 508B (payload) and while I do manage to get lots of packets in with the normal syscall interface, there is a lot of overhead on that, and it grows per packet vs. payload throughput (i.e. your "Gbps range"), so doubles when you wait on two packets for the same payload. The kernel is responsible for a portion of that, to be sure, but a lot of it comes from the design of the interface.

                      Sorry if I got antsy, there was no reason for me to be a jerk about it. Not-so-great weekend.
                      The package rate is probably quite low (haven't measured it), since most stock exchanges are aware of the problem of reading UDP data they tend to implement their own package aggregation ontop of UDP e.g Nasdaq uses this scheme:

                      Code:
                      struct Header {
                              char Session[10];
                              uint64_t SequenceNumber;
                              uint16_t MessageCount;
                      } __attribute__ ((packed)) __attribute__((__may_alias__));
                      
                      struct MessageBlock {
                              uint16_t MessageLength;
                      } __attribute__ ((packed)) __attribute__((__may_alias__));
                      ​
                      struct MessageHeader {
                              char MessageType;
                              uint64_t Timestamp;
                              uint16_t TrackingNumber;
                      } __attribute__ ((packed)) __attribute__((__may_alias__));
                      
                      ​struct AddOrder {
                              struct MessageHeader Header;
                              uint32_t InstrumentId;
                              uint64_t OrderReferenceNumber;
                              char BuySellIndicator;
                              int64_t Price;
                              uint32_t Quantity;
                              uint16_t Rank;
                      } __attribute__ ((packed)) __attribute__((__may_alias__));
                      So I get 20 bytes per UDP packet + 13 bytes per embedded payload message at minimum excluding the actual message (e.g an AddOrder). But then is also runs on a 2xXeon Platinum machine so there are some hw behind it so to speak. And net.core.rmem_max is *large*

                      Comment

                      Working...
                      X