Announcement

**coder** · 14 February 2021, 06:30 AM

Originally posted by uid313 View Post

And Linux got IO_uring support in version 5.1, and I/O_uring is what is generally called completion queues/ports, something that AmigaOS had. Linux is very late to the game with this feature. Windows already had it long ago too.

I don't know too much about I/O-completion ports and admittedly nothing about AmigaOS, but I would point out that Linux had nonblocking I/O for ages. This is just a different mechanism that doesn't require syscalls for each operation.

**elbar** · 14 February 2021, 06:41 AM

There is also deadline scheduler. If under huge(but really relatively small login) DESKTOP load, it loses request...

**Togga** · 14 February 2021, 09:36 AM

Originally posted by ptrwis View Post

Now only to wait for support in networking libraries of popular programming languages

If you want optimal performance, I guess this is a time to migrate away from your "popular" programming language (which I guess optimizes on number of key-strokes) and start getting things done.

**polarathene** · 14 February 2021, 07:52 PM

Originally posted by Danny3 View Post

I have one of the fastest flash drives on the market, but booting distros from it still doesn't seem the same performance as booting them from the internal SSD even though its performance is comparable to the SSD and it's connected through USB 3.

There's a lot of other differences that will be in play here. You should be able to run some performance tests too (perhaps KDiskMark which is unrelated to KDE, or FIO which can test IO_uring perf specifically), which will help you better understand how the two drives perform differently. Sequential I/O is often used to market drives performance along with capacity, but the more technical details that matter are unrelated to that.

For random I/O that's IOPS and it will depend on block size (filesystem, usually 4k these days, disks also have their internal sizes that can differ) and how many I/O threads or queue depth you use, along with max bandwidth that can be supported over an interface (SATA, NVMe, USB, PCIe), note that SATA vs NVMe this is quite a difference in queue depth (32 vs 65k), threads are concurrent I/O requests, so you might want to consider perf of 4KQ1T1 as the slowest baseline.

Next up is cache/buffer. The more expensive and bigger internal disks often have these, usually with more premium layer of NAND, this is generally smaller capacity but faster I/O perf, so your writes can be fast if they don't fill that up and it will internally move it to the slower NAND with larger capacity where reads will usually be from (ignoring kernel caching reads into a system ram buffer when spare memory is available).

Then there's a controller chipset, if you update firmware on the product it's for this and these can make a notable difference in performance quality. On the USB products, these also often talk to a bridge chipset (which has it's own firmware if it's an adapter/case which lets you add your own internal disk and treat it as external via USB), the bridge could be for SATA or M.2, but these can also be nerfed a bit in support and it varies in quality, perf issues can usually come from this. I know some don't support passing through SMART data, others don't support TRIM, some have reduced support for queue depth etc despite what the actual connected disk may offer.

Then there's USB part of the bridge that interfaces to the main system. This performance depends on the USB controller chipset on that system which helps negotiate the compatible USB interface that both can best support, and it's I/O performance. On some embedded like devices such as SBCs (eg the Raspberry Pi) power supply may be limited, if it's insufficient it can result in poor performance or failures/corruption either ruining some data or causing a disconnect/reset of the device.

This is a different topic in it's own right regarding power, but you also need to take into account the USB cable properties, it's length can affect carrying sufficient voltage across (5V is standard unless using USB-PD or similar power delivery protocols) and the cables AWG rating (kinda like it's thickness/quality) affects this as well. AWG affects how many amps can be supported (with volts x amps equating to watts of power, some devices expect specific voltage and amperage to work with and some may adjust what they receive on their end to accommodate that if possible, there's usually some loss in efficiency from that process, similar to powerbank supplies).

Assuming power is all good (which it probably is on a laptop or desktop, especially with devices that don't have USB cables and just plug straight into the port. The port itself btw along with the chipset also come into play with power supply, as other devices can consume power from the same shared reserve that the controller provides to connected devices), then the next part to come into play is how data is transferred over USB. Often it's via BOT in the past or weaker/cheaper devices, this has overheads and other drawbacks, whereas more modern/expensive products will support UASP to some degree, which can better leverage SATA/NVMe at a much reduced overhead.

USB protocol version itself also has different overheads (and power supply abilities), USB 3.1 gen1 (effectively USB 3.0) uses 8b/10b encoding which is a 20% overhead loss in itself, 3.1 gen2 (gen2 in general specifically), apart from supporting double the bandwidth, has 128b/132b encoding (3% overhead), thus far more efficient allowing for additional perf gains, but only if both devices support USB 3.x gen2. USB 3.2 introduces support for an additional lane that doubles bandwidth again which IIRC just requires both controller chipsets to support negotiating USB 3.2 genx with the relevant USB-C gen1/gen2 cable (no new cable).

For USB sticks with no cable, especially the cheaper kind, the manufacturers can make a lot of decisions that reduce supported features and performance in favor of cost/size, they can also succumb to thermal throttling and even with similar capacity for storage probably use cheaper less performant NAND, and unlike an internal disk may have fewer NAND chips inside to provide that capacity (which would increase latency IIRC). Marketing wise though they'll focus on the capacity provided and maybe quote transfer speeds that USB version can support as a maximum, but the device itself being far below that, eg USB 3.0 5,000Mb/s! (That's 5 Gigabits, as in 625 MB/sec, 500MB/sec with USB 3.x gen1 20% efficiency loss, then add on other losses and often you're at 400MB/sec or lower sequential perf, especially sustained).

TL;DR:

There's a lot of differences going on there between whatever USB storage device you have vs the internal one. Overheads from additional protocols, reduced feature set from chipsets involved, perf efficiency losses from poor conditions (thermal, power supply, USB cable quality, USB protocol negotiated, etc).

**coder** · 14 February 2021, 09:34 PM

Originally posted by elbar View Post

There is also deadline scheduler. If under huge(but really relatively small login) DESKTOP load, it loses request...

How is that not a bug?

Either the kernel should complete the operation or it should fail (but not without a good reason). If it failed, maybe the app isn't checking the return code, and just plows ahead as if it succeeded? And if it did fail for a justifiable reason, then the bug would be in the app for not handling the error.

**coder** · 14 February 2021, 09:50 PM

Originally posted by Togga View Post

If you want optimal performance, I guess this is a time to migrate away from your "popular" programming language (which I guess optimizes on number of key-strokes) and start getting things done.

Ultimately, it comes down to using the right language for the job. When it comes to performance, using a lower-level language can sometimes net you a small multiple of performance gains, but using better datastructures and algorithms can be worth orders of magnitude. And those are things low-level languages don't do well.

Regarding the point about error-handling I made above, this is a huge win for languages with exceptions. Error-handling often makes up the majority of C code, if it's done properly. Languages with exceptions let you write mostly straight-line code for the normal case, which makes it easier to follow and understand, and that translates into fewer bugs.

Honestly, I still can't believe how prevalent C still is, in desktop apps, where it typically doesn't even have the potential to provide a performance benefit (*cough* Gnome).

**coder** · 14 February 2021, 10:00 PM

Originally posted by polarathene View Post

There's a lot of differences going on there between whatever USB storage device you have vs the internal one. Overheads from additional protocols, reduced feature set from chipsets involved, perf efficiency losses from poor conditions (thermal, power supply, USB cable quality, USB protocol negotiated, etc).

Traditionally, I've experienced orders of magnitude worse performance with USB flash drives on Linux than Windows. Even with the exact same computer, which rules out most of your explanations.

I once heard someone claim that the Linux kernel didn't support non-synchronous operations on USB-connected storage, the way it did with ATA, SCSI, SATA, etc. If it's not doing any read-ahead or write-backs, that would go some way towards explaining the dismal performance I've experienced (we're talking like 3 MB/sec on a USB stick that Windows would access at > 30 MB/sec). And io_uring shouldn't help that much, if at all, given that even dd with very large block sizes still couldn't eke out remotely decent performance from it.

I find this somewhat ironic, since Linux always had noticeably faster floppy disk performance than Windows!

**elbar** · 15 February 2021, 03:51 AM

Bugs should be eaten. In Asia there is a lot of food on the table... Or under the table...

**polarathene** · 15 February 2021, 09:47 AM

Originally posted by coder View Post

Traditionally, I've experienced orders of magnitude worse performance with USB flash drives on Linux than Windows. Even with the exact same computer, which rules out most of your explanations.

that would go some way towards explaining the dismal performance I've experienced (we're talking like 3 MB/sec on a USB stick that Windows would access at > 30 MB/sec).

Check system buffer by tuning the kernels vm tunables like dirty_bytes or dirty_ratio. Those made a huge difference for me with linux poor perf to usb stick (1GB took 7 hours otherwise), IIRC my 32GB RAM system was allocating 3.2GB buffer (10% default), but setting a 15MB buffer or so meant it would flush data much sooner to the actual storage target. Which gave better perf and more closely matched windows perf.

Beyond that I/O scheduler may be involved.

Chipsets do come into play here, linux blacklists a few that aren't implementing UASP properly, whereas Windows either ignores the fact or provides workarounds/drivers, something like that. I think this was common issue with seagate externals. Likewise for trim support, latest Samsung portable disks (T5, T7) I think have TRIM disabled by linux at mount and require you to explicitly enable it.

My explanations were primarily about differences/gotchas between an internal disk and perf of external disks/sticks.

**Togga** · 24 February 2021, 05:34 PM

Originally posted by coder View Post

but using better datastructures and algorithms can be worth orders of magnitude. And those are things low-level languages don't do well.
...
Regarding the point about error-handling I made above, this is a huge win for languages with exceptions.

Why should not low-level language do "datastructures" well? On the contrary, rather than trying to fit your problem to existing high level structures you can have better control and develop the structure to fit the problem.

Sure, exceptions can simplify things, but if you needs lots of error recovery and control I don't see much improvement, just a runtime cost with the exception handling itself and the number of lines of code to deal with it is in my view not improved either. That said, I find features like destructors in C++ to be a big help for resource management in many cases. But moving to a higher level language with higher level features also comes with a cost, most notably added dependencies and runtime complexity, so for long term projects I rather see the lean and mean approach.

Announcement

IO_uring Will Be Even Faster With Linux 5.12

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment