Announcement

**microcode** · 14 September 2022, 12:08 PM

Originally posted by F.Ultra View Post

How on earth will it give you 32x the throughput? Do you have that amount of context switch overhead?

Yes duh. My average packet is about 300 bytes. recvmsg to recvmmsg is a big improvement too, but zero copy UDP on io_uring is another level. Anyone who's written UDP applications has experienced being kernel limited, and syscall limited. You time a normal recvmsg/sendmsg application, and unless you're doing a lot per packet, your sys time will be 5-10x your user time.

io_uring was originally introduced for IO with files on real filesystems; but I think where it will ultimately give the most value is in UDP applications. Particularly, QUIC is implemented in userspace (and likely will be only in userspace for a while), and is absolutely dominated by the overhead of UDP syscall interfaces (setting aside less sophisticated offload in NICs vs. TCP). io_uring drops the context switching and fine grained copying overheads to very low levels, and exposes applications to the raw performance of the kernel UDP stack and the NIC.

**onlyLinuxLuvUBack** · 14 September 2022, 04:00 PM

Originally posted by Zeioth View Post

Can we users do something to enable it/test it?

zig fast db with io_uring that you can run today...

File not found · tigerbeetle/tigerbeetle

https://github.com/tigerbeetledb/tigerbeetle/tree/main/demos/io_uring

The distributed financial transactions database designed for mission critical safety and performance. - File not found · tigerbeetle/tigerbeetle

**Gonk** · 14 September 2022, 10:46 PM

Heh, at this rate someone will come up with an io_uring based init.

**F.Ultra** · 15 September 2022, 11:17 AM

Originally posted by microcode View Post

Yes duh. My average packet is about 300 bytes. recvmsg to recvmmsg is a big improvement too, but zero copy UDP on io_uring is another level. Anyone who's written UDP applications has experienced being kernel limited, and syscall limited. You time a normal recvmsg/sendmsg application, and unless you're doing a lot per packet, your sys time will be 5-10x your user time.

io_uring was originally introduced for IO with files on real filesystems; but I think where it will ultimately give the most value is in UDP applications. Particularly, QUIC is implemented in userspace (and likely will be only in userspace for a while), and is absolutely dominated by the overhead of UDP syscall interfaces (setting aside less sophisticated offload in NICs vs. TCP). io_uring drops the context switching and fine grained copying overheads to very low levels, and exposes applications to the raw performance of the kernel UDP stack and the NIC.

5-10x sounds more reasonable but you wrote 32x hence why I asked about that number specifically.

And whoever came up with the API for recvmmsg should be taken back behind the shed...

**sinepgib** · 15 September 2022, 03:23 PM

Originally posted by Gonk View Post

Heh, at this rate someone will come up with an io_uring based init.

That sounds like fun really.

**microcode** · 16 September 2022, 05:24 PM

Originally posted by F.Ultra View Post

5-10x sounds more reasonable but you wrote 32x hence why I asked about that number specifically.

And whoever came up with the API for recvmmsg should be taken back behind the shed...

Just because it looks like a big number, doesn't mean it's wrong. recvmsg is EXTREMELY inefficient, and if you are receiving very small packets (again, mine are about 300 bytes), and spending only a couple hundred nanoseconds on each of them (I verify a cleartext MAC with polyval, and then send the authenticated packet onto a lock-free channel), then you can easily have system time ratios like this. I'm getting closer to memcpy speeds now, my current test impl is 24x higher throughput than the recvmsg version, and I think there's still room for improvement.

**pal666** · 17 September 2022, 08:24 PM

Originally posted by Gonk View Post

Heh, at this rate someone will come up with an io_uring based init.

init does spawn processes, i.e. it'll benefit from subj

**F.Ultra** · 18 September 2022, 02:33 PM

Originally posted by microcode View Post

Just because it looks like a big number, doesn't mean it's wrong. recvmsg is EXTREMELY inefficient, and if you are receiving very small packets (again, mine are about 300 bytes), and spending only a couple hundred nanoseconds on each of them (I verify a cleartext MAC with polyval, and then send the authenticated packet onto a lock-free channel), then you can easily have system time ratios like this. I'm getting closer to memcpy speeds now, my current test impl is 24x higher throughput than the recvmsg version, and I think there's still room for improvement.

Never claimed that it was wrong, my apologies if my wording made that impression. It was just that 32x was such a huge number that I was interested in hearing the explanation. I read lots of UDP data myself (multicast) in the Gbps range with low latency requirements, however I only use recv/read and not recvmsg (which I assume you had to use for the extra flags).

**microcode** · 19 September 2022, 09:48 AM

Originally posted by F.Ultra View Post

Never claimed that it was wrong, my apologies if my wording made that impression. It was just that 32x was such a huge number that I was interested in hearing the explanation. I read lots of UDP data myself (multicast) in the Gbps range with low latency requirements, however I only use recv/read and not recvmsg (which I assume you had to use for the extra flags).

What's the packet rate you're getting? I'm dealing with very small packets, all less than 508B (payload) and while I do manage to get lots of packets in with the normal syscall interface, there is a lot of overhead on that, and it grows per packet vs. payload throughput (i.e. your "Gbps range"), so doubles when you wait on two packets for the same payload. The kernel is responsible for a portion of that, to be sure, but a lot of it comes from the design of the interface.

Sorry if I got antsy, there was no reason for me to be a jerk about it. Not-so-great weekend.

**F.Ultra** · 19 September 2022, 02:51 PM

Originally posted by microcode View Post

What's the packet rate you're getting? I'm dealing with very small packets, all less than 508B (payload) and while I do manage to get lots of packets in with the normal syscall interface, there is a lot of overhead on that, and it grows per packet vs. payload throughput (i.e. your "Gbps range"), so doubles when you wait on two packets for the same payload. The kernel is responsible for a portion of that, to be sure, but a lot of it comes from the design of the interface.

Sorry if I got antsy, there was no reason for me to be a jerk about it. Not-so-great weekend.

The package rate is probably quite low (haven't measured it), since most stock exchanges are aware of the problem of reading UDP data they tend to implement their own package aggregation ontop of UDP e.g Nasdaq uses this scheme:

Code:

struct Header {
        char Session[10];
        uint64_t SequenceNumber;
        uint16_t MessageCount;
} __attribute__ ((packed)) __attribute__((__may_alias__));

struct MessageBlock {
        uint16_t MessageLength;
} __attribute__ ((packed)) __attribute__((__may_alias__));

struct MessageHeader {
        char MessageType;
        uint64_t Timestamp;
        uint16_t TrackingNumber;
} __attribute__ ((packed)) __attribute__((__may_alias__));

struct AddOrder {
        struct MessageHeader Header;
        uint32_t InstrumentId;
        uint64_t OrderReferenceNumber;
        char BuySellIndicator;
        int64_t Price;
        uint32_t Quantity;
        uint16_t Rank;
} __attribute__ ((packed)) __attribute__((__may_alias__));

So I get 20 bytes per UDP packet + 13 bytes per embedded payload message at minimum excluding the actual message (e.g an AddOrder). But then is also runs on a 2xXeon Platinum machine so there are some hw behind it so to speak. And net.core.rmem_max is *large*

Announcement

IO_uring Continues To Prove Very Exciting: Promising io_uring_spawn Announced

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment