Announcement

**oiaohm** · 25 August 2022, 09:36 PM

Originally posted by ayumu View Post

Thousands of syscalls (now one more!) and millions of LoCs. Of course, all running in supervisor mode. What could possibly go wrong.

Chromium OS Docs - Linux System Call Table

https://chromium.googlesource.com/chromiumos/docs/+/master/constants/syscalls.md

ayumu what the heck are you talking about. The Linux kernel has a few hundred syscalls. Linux kernel even with all the platform unique syscalls the Linux kernel has not crossed 400 syscalls yet.

Micro-kernels try to go for under 100 syscalls. Monolithic kernels historically have been heavier.

ayumu thousands of syscalls that matches for MS windows with its 2000+ syscalls that have existed since Windows started. But then with Windows there is only really about 1000 syscalls in production releases because roughly 50% of Windows syscalls that have existed have been deprecated and removed.

1000 syscalls appears to be upper limit. I think what you are writing is out by a power of 10. Monolithic kernels syscalls you can roughly count in hundreds and Microkernel you roughly count syscalls in tens.

ayumu I don't know of a OS kernel that in a production release that has 1100+ syscalls. MS Windows is really the biggest I know and it syscall count is way higher the next biggest. Yes Linux kernel is not the next biggest. The FreeBSD and OpenBSD and NetBSD kernels are all heavier in syscalls than the Linux kernel yes all of the BSD past the 500 syscall count in production releases over a decade ago.

Its a surprise to a lot of people that the Linux kernel is not that heavy on syscalls for a monolithic kernel. Yes Linux kernel is between 4-10 times heavier than the general micro-kernel in syscalls.

Code running in Userspace does not make it safer either. Minix proved this along time ago. Hard part is still how to audit everything.

**oiaohm** · 25 August 2022, 09:58 PM

Originally posted by AlanTuring69 View Post

I don't understand why such non-Unix syscalls are even being considered by kernel "maintainers"? Has Micro$oft really infiltrated my kernel?

p.s. if you even considered what I said to be true then re-evaluate yourself. It's obvious that this is to help some niche use-cases that just need to readfile simply. IO_uring is overkill for these use-cases and readile already exists so no reason not to make it more efficient

https://lwn.net/Articles/813827/ Readfile stuff in syscall is stage one. Once readfile is worked out for syscall it will be added to the IO_uring path as well. Readfile turning 3 operations into 1 has advantages for IO_uring as well. So improve direct syscall usage first then improve indirect usage by IO_uring next but the improvement is the same thing.

open read close by IO_uring are 3 independent entries placed on the ring buffer and readfile implement on IO_uring ring buffer would reduce that to one entry on the ring buffer so saving in ring buffer size and processing. Readfile in syscall allows 3 syscalls to be turned into 1 syscall so saving 2 context switches. There is more savings in the Readfile syscall than the IO_uring change but either way there are saving. Of course the lower savings of IO_uring change makes it lower importance to do first.

The reality is from day 1 the Linux kernel included syscalls that are not Unix syscalls.

**discordian** · 26 August 2022, 03:44 AM

Another thing is that this could be implemented more reliable as you don't have to manage a file descriptor and close it both on success/error paths.
Otherwise you always have to expect that you can run out of fds, some attacks focus on exploiting these kinda bugs.

**archkde** · 26 August 2022, 03:52 AM

Originally posted by F.Ultra View Post

One use case was linked in TFA, the last time this was brought up it was claimed that io_ring is not designed for this open->read->close type of sequence of operations, io_ring is also vastly more complex to setup than a simple call to readfile(), but most importantly Greg is the dev/inventor for both readfile() and io_uring so he obviously have good reasons for introducing a new syscall over his io_uring pet project.

io_uring is not a pet project, and as far as I know, its principal developer is not Greg Kroah-Hartman, but Jens Axboe.

**archkde** · 26 August 2022, 03:56 AM

Originally posted by oiaohm View Post

Two new ways to read a file quickly [LWN.net]

https://lwn.net/Articles/813827/

readfile and io_uring are not two different things here. readfile syscall makes it one operation to open read and close a small file. Yes this take 3 operations and make its 1. Readfile can also be implement on io_uring once the readfile syscall works and is in the kernel. Yes same thing that instead of 3 io_uring operations you reduce to 1 one readfile is confirmed as a working syscall..

Yes each of those individual operations open, read and close do result in having to acquire locks. The same locks. Open has you acquire and release locks, read has you acquire and release locks and close has you acquire and release locks.

Yeah, I get that. But when I read one file and it needs 3 syscalls instead of one, so what, it doesn't matter. When I read thousands of files, it still doesn't matter, because it's still 2 syscalls due to io_uring (plus the initial io_uring_setup that only needs to be done only once).

**oiaohm** · 26 August 2022, 04:53 AM

Originally posted by archkde View Post

Yeah, I get that. But when I read one file and it needs 3 syscalls instead of one, so what, it doesn't matter. When I read thousands of files, it still doesn't matter, because it's still 2 syscalls due to io_uring (plus the initial io_uring_setup that only needs to be done only once).

Readfile can also be implement on io_uring once the readfile syscall works and is in the kernel. Yes same thing that instead of 3 io_uring operations you reduce to 1 one readfile is confirmed as a working syscall..

You did not read what I wrote carefully enough. Do notice here I was talking about io_ring. The lwm article https://lwn.net/Articles/813827/ also talked about Readfile will in time be added to io_uring when it proven to to work.

Yes each of those individual operations open, read and close do result in having to acquire locks. The same locks. Open has you acquire and release locks, read has you acquire and release locks and close has you acquire and release locks.

This bit from my prior post is important. Also not all the items reading thousands of small files are going to be setting up io_uring.

archkde you are still treating readfile and io_uring as two different things. Readfile syscall is the prototype to what will be readfile io_uring in future because io_uring has performance issues doing open read close rapidly as small files cause the locks problem. You can be attempting to close a file handle while lock created when it opened has not been cleared yet on a small file with io_uring. Yes io_uring by passed the context switch problem of syscalls leading to the result small files are problems due to locking. Yes with the overhead of a context switch on open read and close there is enough time for the lock open caused to be clear before the close gets called.

Small files are there own unique headache. Readfile and io_uring are not two split things. With the locking fun of file access getting open read and close block right on a syscall is most likely safer than attempt to straight up add Readfile to io_uring where things are moving faster so race conditions and other things come risk.

**oleid** · 26 August 2022, 04:59 PM

Readfile sounds just what I need for my maildir.

**binarybanana** · 26 August 2022, 05:23 PM

The thing is that the open->read->close sequence requires 6 context switches, while a readfile syscall would only require 3 (userspace -> kernel -> userspace). This is especially important for systems that are affected by all the slowdowns of CPU mitigations (most of them). These mitigations have a huge impact on the cost of context switching. By now, with all mitigations enabled, it takes like 3x longer to switch a context, if not even more. So the difference is likely even more pronounced than back when the readfile syscall was proposed.

**oiaohm** · 26 August 2022, 05:47 PM

Originally posted by binarybanana View Post

The thing is that the open->read->close sequence requires 6 context switches, while a readfile syscall would only require 3 (userspace -> kernel -> userspace). This is especially important for systems that are affected by all the slowdowns of CPU mitigations (most of them). These mitigations have a huge impact on the cost of context switching. By now, with all mitigations enabled, it takes like 3x longer to switch a context, if not even more. So the difference is likely even more pronounced than back when the readfile syscall was proposed.

Typo you typed 3 context switches instead of 2. Yes it 6 context switches vs 2 context switches or 3 syscall vs 1 syscall(simple times by 2 to get to context switches). Other thing io_uring and in kernel side locking also has higher overhead due to mitigations if you can reduce locking there is saving here. Of course you don't want to add readfile functionality to io_uring that operate at high speed if the basics are not right. Also testing readfile functionality having to setup io_uring first does not work out.

The heavy future users of the readfile functionality are most likely going to be io_uring users.

There are savings all round once the functionality can be merged and is proved as working. Readfile syscall is basically only step one. Step two is readfile functionality added to io_uring as well. Readfile syscall saves in locking operations and context switches both more costly due to mitigations. Readfile with io_uring will be saving in the locking department only.

**F.Ultra** · 26 August 2022, 05:48 PM

Originally posted by archkde View Post

io_uring is not a pet project, and as far as I know, its principal developer is not Greg Kroah-Hartman, but Jens Axboe.

Sorry I mixed up the two, should have checked that out properly before posting. Also didn't know that pet project had a negative clang in English, what I meant was that it io_uring was important to him (which of course wasn't true since it's Jen:s project).

Announcement

Readfile System Call Revised For Efficiently Reading Small Files

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment