Announcement

Collapse
No announcement yet.

Readfile System Call Revised For Efficiently Reading Small Files

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GreenReaper
    replied
    Some of my nginx caches handle millions of thumbnails, and they all get read on startup, taking time, CPU and IOPS. It would be good if that could be reduced.

    Leave a comment:


  • oiaohm
    replied
    Originally posted by archkde View Post
    Readfile is not going to land in io_uring per the article you mentioned. What they are planning is to add a functionality to chain the openat with the read (and the close) so that everything can be done in 1 io_uring entry instead of 2.
    That there is going to be based off the readfile work.

    Originally posted by archkde View Post
    I don't know where you get the stuff about locks from. I can believe it, but it's probably not a huge problem, since an uncontended lock should be pretty fast (much faster than all the other stuff the kernel has to do when opening/reading/closing a file). If you have more information here, i'd be happy to hear about this.
    Small files turn out to be the case where its not uncontended lock all of the time. open and close end up hitting the file system directory lock. Areas like sysfs activity on this lock can be very active. So open has to take and release the lock and close has to take and release the lock. Large file reads/writes this is not a problem in fact it would be a problem to hold the lock while performing the read/write for a large read/write due to preventing other directory actions.

    Worst case with small files with network file systems or other file systems with performance lag is release from open of the directory lock can delayed. Open read close path the directory lock release is not processed before close is attempting to grab the lock. Small file problem where at times its better not to release the directory lock because the directory lock will not be released anyhow before close will attempt to ask for it again.

    Remember multi applications can be hitting sysfs with stacks of small file reads. io_uring is that one application is doing to do a stack of file operations. Io_uring is not for application needing to perform a single file operation and this happens to happen with a stack of other applications doing the same thing. There need to be two solutions to this problem to reduce the lock thrashing.

    openat with io_uring doing open read and close in one pass solves the same locking problem.

    Leave a comment:


  • archkde
    replied
    Originally posted by oiaohm View Post

    Readfile can also be implement on io_uring once the readfile syscall works and is in the kernel. Yes same thing that instead of 3 io_uring operations you reduce to 1 one readfile is confirmed as a working syscall..

    You did not read what I wrote carefully enough. Do notice here I was talking about io_ring. The lwm article https://lwn.net/Articles/813827/ also talked about Readfile will in time be added to io_uring when it proven to to work.

    Yes each of those individual operations open, read and close do result in having to acquire locks. The same locks. Open has you acquire and release locks, read has you acquire and release locks and close has you acquire and release locks.

    This bit from my prior post is important. Also not all the items reading thousands of small files are going to be setting up io_uring.

    archkde you are still treating readfile and io_uring as two different things. Readfile syscall is the prototype to what will be readfile io_uring in future because io_uring has performance issues doing open read close rapidly as small files cause the locks problem. You can be attempting to close a file handle while lock created when it opened has not been cleared yet on a small file with io_uring. Yes io_uring by passed the context switch problem of syscalls leading to the result small files are problems due to locking. Yes with the overhead of a context switch on open read and close there is enough time for the lock open caused to be clear before the close gets called.

    Small files are there own unique headache. Readfile and io_uring are not two split things. With the locking fun of file access getting open read and close block right on a syscall is most likely safer than attempt to straight up add Readfile to io_uring where things are moving faster so race conditions and other things come risk.
    Readfile is not going to land in io_uring per the article you mentioned. What they are planning is to add a functionality to chain the openat with the read (and the close) so that everything can be done in 1 io_uring entry instead of 2.

    I don't know where you get the stuff about locks from. I can believe it, but it's probably not a huge problem, since an uncontended lock should be pretty fast (much faster than all the other stuff the kernel has to do when opening/reading/closing a file). If you have more information here, i'd be happy to hear about this.

    And it's a pity that applications won't use io_uring where it makes sense. It's rather simple to set up for the synchronous batch submission use case, and the speedup is much larger than what you get from readfile.

    Leave a comment:


  • binarybanana
    replied
    Originally posted by oiaohm View Post

    Typo you typed 3 context switches instead of 2. Yes it 6 context switches vs 2 context switches or 3 syscall vs 1 syscall(simple times by 2 to get to context switches).
    Quite generous of you to call that a typo.
    But yes you're right, you start in user space, switch to kernel space and back, that makes for two context switches.
    Or actually, I'm not sure if a syscall on Linux necessarily means a full context switch happens in all cases. It doesn't seem to be a hard requirement. There is still some cost to switching from user to kernel space of course and the relative increase in cost due to mitigations is quite similar to the cost of a full context switch as far as I know so this is just nitpicking. And I'm not even sure how Linux handles it.

    Leave a comment:


  • F.Ultra
    replied
    Originally posted by archkde View Post

    io_uring is not a pet project, and as far as I know, its principal developer is not Greg Kroah-Hartman, but Jens Axboe.
    Sorry I mixed up the two, should have checked that out properly before posting. Also didn't know that pet project had a negative clang in English, what I meant was that it io_uring was important to him (which of course wasn't true since it's Jen:s project).

    Leave a comment:


  • oiaohm
    replied
    Originally posted by binarybanana View Post
    The thing is that the open->read->close sequence requires 6 context switches, while a readfile syscall would only require 3 (userspace -> kernel -> userspace). This is especially important for systems that are affected by all the slowdowns of CPU mitigations (most of them). These mitigations have a huge impact on the cost of context switching. By now, with all mitigations enabled, it takes like 3x longer to switch a context, if not even more. So the difference is likely even more pronounced than back when the readfile syscall was proposed.
    Typo you typed 3 context switches instead of 2. Yes it 6 context switches vs 2 context switches or 3 syscall vs 1 syscall(simple times by 2 to get to context switches). Other thing io_uring and in kernel side locking also has higher overhead due to mitigations if you can reduce locking there is saving here. Of course you don't want to add readfile functionality to io_uring that operate at high speed if the basics are not right. Also testing readfile functionality having to setup io_uring first does not work out.

    The heavy future users of the readfile functionality are most likely going to be io_uring users.

    There are savings all round once the functionality can be merged and is proved as working. Readfile syscall is basically only step one. Step two is readfile functionality added to io_uring as well. Readfile syscall saves in locking operations and context switches both more costly due to mitigations. Readfile with io_uring will be saving in the locking department only.

    Leave a comment:


  • binarybanana
    replied
    The thing is that the open->read->close sequence requires 6 context switches, while a readfile syscall would only require 3 (userspace -> kernel -> userspace). This is especially important for systems that are affected by all the slowdowns of CPU mitigations (most of them). These mitigations have a huge impact on the cost of context switching. By now, with all mitigations enabled, it takes like 3x longer to switch a context, if not even more. So the difference is likely even more pronounced than back when the readfile syscall was proposed.

    Leave a comment:


  • oleid
    replied
    Readfile sounds just what I need for my maildir.

    Leave a comment:


  • oiaohm
    replied
    Originally posted by archkde View Post
    Yeah, I get that. But when I read one file and it needs 3 syscalls instead of one, so what, it doesn't matter. When I read thousands of files, it still doesn't matter, because it's still 2 syscalls due to io_uring (plus the initial io_uring_setup that only needs to be done only once).
    Readfile can also be implement on io_uring once the readfile syscall works and is in the kernel. Yes same thing that instead of 3 io_uring operations you reduce to 1 one readfile is confirmed as a working syscall..

    You did not read what I wrote carefully enough. Do notice here I was talking about io_ring. The lwm article https://lwn.net/Articles/813827/ also talked about Readfile will in time be added to io_uring when it proven to to work.

    Yes each of those individual operations open, read and close do result in having to acquire locks. The same locks. Open has you acquire and release locks, read has you acquire and release locks and close has you acquire and release locks.

    This bit from my prior post is important. Also not all the items reading thousands of small files are going to be setting up io_uring.

    archkde you are still treating readfile and io_uring as two different things. Readfile syscall is the prototype to what will be readfile io_uring in future because io_uring has performance issues doing open read close rapidly as small files cause the locks problem. You can be attempting to close a file handle while lock created when it opened has not been cleared yet on a small file with io_uring. Yes io_uring by passed the context switch problem of syscalls leading to the result small files are problems due to locking. Yes with the overhead of a context switch on open read and close there is enough time for the lock open caused to be clear before the close gets called.

    Small files are there own unique headache. Readfile and io_uring are not two split things. With the locking fun of file access getting open read and close block right on a syscall is most likely safer than attempt to straight up add Readfile to io_uring where things are moving faster so race conditions and other things come risk.

    Leave a comment:


  • archkde
    replied
    Originally posted by oiaohm View Post

    https://lwn.net/Articles/813827/

    readfile and io_uring are not two different things here. readfile syscall makes it one operation to open read and close a small file. Yes this take 3 operations and make its 1. Readfile can also be implement on io_uring once the readfile syscall works and is in the kernel. Yes same thing that instead of 3 io_uring operations you reduce to 1 one readfile is confirmed as a working syscall..

    Yes each of those individual operations open, read and close do result in having to acquire locks. The same locks. Open has you acquire and release locks, read has you acquire and release locks and close has you acquire and release locks.
    Yeah, I get that. But when I read one file and it needs 3 syscalls instead of one, so what, it doesn't matter. When I read thousands of files, it still doesn't matter, because it's still 2 syscalls due to io_uring (plus the initial io_uring_setup that only needs to be done only once).

    Leave a comment:

Working...
X