Announcement

Collapse
No announcement yet.

Linux Changes Pipe Behavior After Breaking Problematic Android Apps On Recent Kernels

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by ed31337 View Post
    Microsoft would constantly change things (…) drop, deprecate, etc
    Anecdotally, they are also trying ridiculously hard not to break userspace, to the point where it hurts their API:
    https://www.joelonsoftware.com/2004/...t-the-api-war/

    That's an old post though, so I may be outdated on that.

    Comment


    • #32
      Originally posted by indepe View Post
      As far as I understand, only a small part of the original improvement was reverted (a probably minor optimization that avoids checking for readers).

      Without knowing the API or its use, it seems the strange thing is that the API allows blocked readers even when the pipe isn't empty anymore. However as long as the API allows that situation, then what else should wake those readers? So I guess the correct behavior is to wake blocked readers on any write. And I guess an application which doesn't want that behavior would not have any blocked readers waiting for a non-empty pipe, in the first place.

      Again, I am saying this without knowing the API or its use, just from the description here and in the links. However if I understood it correctly, this partial revert shouldn't be a problem.
      Well if the pipe isn't empty then there are no blocked readers. I.e read() on such a pipe would not block, but the realm library in particular didn't read from the pipe until EAGAIN when it got edge triggered by EPOLL that there where data and then expected to get a new edge trigger again when some one did a write to the pipe even though it wasn't cleared in between.

      Comment


      • #33
        Originally posted by F.Ultra View Post

        Well if the pipe isn't empty then there are no blocked readers. I.e read() on such a pipe would not block, but the realm library in particular didn't read from the pipe until EAGAIN when it got edge triggered by EPOLL that there where data and then expected to get a new edge trigger again when some one did a write to the pipe even though it wasn't cleared in between.
        If "waking readers" refers to callers of epoll_wait(), then thanks for the clarification, however looking at the documentation of epoll, I would (still) expect the previous and now reverted-to behavior. Or at least consider it more likely, and then confirm by testing it. Of course, I may be missing something.

        This is the documentation I am looking at, dated "2021-03-22": https://man7.org/linux/man-pages/man7/epoll.7.html

        Specifically this sentence:

        The reason for this is that edge-triggered mode delivers events only when changes occur on the monitored file descriptor.
        I would expect "changes occur on the monitored file descriptor" to include any write, not just writes on an empty file or pipe.

        The doc does mention reading-until-EAGAIN as a "suggested way" of use, but does not say that this is required for correct functioning. It even starts a sentence like this:

        Since even with edge-triggered epoll, multiple events can be generated upon receipt of multiple chunks of data,
        and it sounds like the only case when that doesn't happen is if the EPOLLONESHOT option is used. (Though after reading the latest commit comment, I get the impression that "chunk of data" might refer to something different than I would espect.)

        So my perhaps naive interpretation is that the original behavior (now re-created) was the one that also better corresponds to the documentation, and that changing it should require a change of documentation. Also, I don't see why it would be obviously non-sensical, given the variety of possible use cases.
        Last edited by indepe; 01 August 2021, 04:14 PM.

        Comment


        • #34
          Originally posted by indepe View Post

          If "waking readers" refers to callers of epoll_wait(), then thanks for the clarification, however looking at the documentation of epoll, I would (still) expect the previous and now reverted-to behavior. Or at least consider it more likely, and then confirm by testing it. Of course, I may be missing something.

          This is the documentation I am looking at, dated "2021-03-22": https://man7.org/linux/man-pages/man7/epoll.7.html

          Specifically this sentence:



          I would expect "changes occur on the monitored file descriptor" to include any write, not just writes on an empty file or pipe.

          The doc does mention reading-until-EAGAIN as a "suggested way" of use, but does not say that this is required for correct functioning. It even starts a sentence like this:



          and it sounds like the only case when that doesn't happen is if the EPOLLONESHOT option is used. (Though after reading the latest commit comment, I get the impression that "chunk of data" might refer to something different than I would espect.)

          So my perhaps naive interpretation is that the original behavior (now re-created) was the one that also better corresponds to the documentation, and that changing it should require a change of documentation. Also, I don't see why it would be obviously non-sensical, given the variety of possible use cases.
          The text that you quoted from the epoll man just states that you are not guaranteed to get just a single event even if edge-triggered under special circumstances. I.e lets say that you receive 2k of data but only do one read of 1k, you now have 1k left but instead of calling read() again until EAGAIN you call epoll_wait, since you are edge triggered you will now not receive a new event for the remaining 1k of data and your application might be stuck waiting for an event that will never happen (this is the bug in the Android library). However if you receive 1k more data then epoll_wait will now return again, and this is what the text in the man page describes a bit badly.

          edit: Ok so I simplified the text above a bit too much, when you receive 1K more data you will normally not get a new edge-trigger from epoll. If this new 1K however means that the kernel read buffer for your descriptor is exhausted, then this will lead to a second edge-trigger.

          edit2: This can also happen if you use the same epoll fd in more than one thread. You get 1K of data, so epoll_wait wakes up thread A who does a read (10240), this exhausts the kernel read buffer since you read all of the 1K that was coming in, but now you get 1K more data so epoll_wait wakes up thread B, but now thread A calls read again to see if there where more data originally or if it will receive EAGAIN while thread B also does a read to get the data from what was triggered, now it's random which one of the two that will get the new 1K and which one will get the EAGAIN. And this is the reason behind the EPOLLONESHOT flag since that tells epoil to only wake thread A, after tread A have read EAGAIN it rearms the fd and first after this epoll can wake up A or B when new data arrives.
          Last edited by F.Ultra; 01 August 2021, 07:44 PM.

          Comment


          • #35
            Originally posted by F.Ultra View Post

            The text that you quoted from the epoll man just states that you are not guaranteed to get just a single event even if edge-triggered under special circumstances. I.e lets say that you receive 2k of data but only do one read of 1k, you now have 1k left but instead of calling read() again until EAGAIN you call epoll_wait, since you are edge triggered you will now not receive a new event for the remaining 1k of data and your application might be stuck waiting for an event that will never happen (this is the bug in the Android library). However if you receive 1k more data then epoll_wait will now return again, and this is what the text in the man page describes a bit badly.

            edit: Ok so I simplified the text above a bit too much, when you receive 1K more data you will normally not get a new edge-trigger from epoll. If this new 1K however means that the kernel read buffer for your descriptor is exhausted, then this will lead to a second edge-trigger.
            Are we referring to the same example ? The text example given with 2 kB data written and 1 kB read, has only one write. So you would not get a second event in any case. The question of a new event (for the edge triggered mode) arises only if there is a second write (after a partial read). And the example doesn't have that.

            It does say:

            Since the read operation done in 4 does not consume the whole buffer data, the call to epoll_wait(2) done in step 5 might block indefinitely.
            However it "might block indefinitely" simply because there might never be another write. As in the example.

            The text isn't really definite one way or the other, however I would put weight on that "changes occur on the monitored file descriptor" is more likely interpreted as "any write" than "a write on an empty file". Why would only a write on an empty file count as a "change"?

            Originally posted by F.Ultra View Post
            edit2: This can also happen if you use the same epoll fd in more than one thread. You get 1K of data, so epoll_wait wakes up thread A who does a read (10240), this exhausts the kernel read buffer since you read all of the 1K that was coming in, but now you get 1K more data so epoll_wait wakes up thread B, but now thread A calls read again to see if there where more data originally or if it will receive EAGAIN while thread B also does a read to get the data from what was triggered, now it's random which one of the two that will get the new 1K and which one will get the EAGAIN. And this is the reason behind the EPOLLONESHOT flag since that tells epoil to only wake thread A, after tread A have read EAGAIN it rearms the fd and first after this epoll can wake up A or B when new data arrives.
            The special case of the same epoll fd being used twice (that is, when there is only one epoll_create for multiple threads calling epoll_wait), is mentioned only after both of my quotes. Therefore both quotes are independent of that special case.

            Comment


            • #36
              Originally posted by indepe View Post

              Are we referring to the same example ? The text example given with 2 kB data written and 1 kB read, has only one write. So you would not get a second event in any case. The question of a new event (for the edge triggered mode) arises only if there is a second write (after a partial read). And the example doesn't have that.

              It does say:



              However it "might block indefinitely" simply because there might never be another write. As in the example.

              The text isn't really definite one way or the other, however I would put weight on that "changes occur on the monitored file descriptor" is more likely interpreted as "any write" than "a write on an empty file". Why would only a write on an empty file count as a "change"?



              The special case of the same epoll fd being used twice (that is, when there is only one epoll_create for multiple threads calling epoll_wait), is mentioned only after both of my quotes. Therefore both quotes are independent of that special case.
              yes you are right, I have run some tests. My memory have failed me here.

              Comment


              • #37
                Originally posted by indepe View Post

                Are we referring to the same example ? The text example given with 2 kB data written and 1 kB read, has only one write. So you would not get a second event in any case. The question of a new event (for the edge triggered mode) arises only if there is a second write (after a partial read). And the example doesn't have that.

                It does say:



                However it "might block indefinitely" simply because there might never be another write. As in the example.

                The text isn't really definite one way or the other, however I would put weight on that "changes occur on the monitored file descriptor" is more likely interpreted as "any write" than "a write on an empty file". Why would only a write on an empty file count as a "change"?



                The special case of the same epoll fd being used twice (that is, when there is only one epoll_create for multiple threads calling epoll_wait), is mentioned only after both of my quotes. Therefore both quotes are independent of that special case.
                Ok, done some more tests now to refresh my memory, was far too long ago that I messed around with epoll. You are correct in that new writes will wake epoll_wait(), but multiple writes to the fd will be aggregated to a single event if they happen when you are not inside epoll_wait() so in e.g this scenario:

                Code:
                1. server calls epoll_wait()
                2. client sends 1K data
                3. epoll_wait() returns one event
                4. client sends 1K data
                5. client sends 1K data
                6.  server reads 1K
                7. server calls epoll_wait() which returns one event
                8. server reads 1K
                9. server calls epoll_wait (); will now not return until client does another write, which could be never
                So the realm library expected that epoll_wait() would return with an event 3 times but got just 2 since the events that happened outside epoill_wait was merged into one.

                Not really sure how the kernel change applies here though, I tested this on both 5.8 (Ubuntu 20.04) and 2.6 (centos6) and had exactly the same behaviour and here 5.8 should have the "old" behaviour and 2.6 should have the changed one since the "old" behaviour was introduced in 5.5 unless redhat backported it to 2.6 of course.

                Comment


                • #38
                  Originally posted by indepe View Post

                  Are we referring to the same example ? The text example given with 2 kB data written and 1 kB read, has only one write. So you would not get a second event in any case. The question of a new event (for the edge triggered mode) arises only if there is a second write (after a partial read). And the example doesn't have that.

                  It does say:



                  However it "might block indefinitely" simply because there might never be another write. As in the example.

                  The text isn't really definite one way or the other, however I would put weight on that "changes occur on the monitored file descriptor" is more likely interpreted as "any write" than "a write on an empty file". Why would only a write on an empty file count as a "change"?



                  The special case of the same epoll fd being used twice (that is, when there is only one epoll_create for multiple threads calling epoll_wait), is mentioned only after both of my quotes. Therefore both quotes are independent of that special case.
                  OK, back a third time now. Sorry for spamming you but the edit timeout has since long passed. I made a terrible mistake in my earlier tests where I falsely assumed that stdin was implemented as a pipe in Linux.

                  When I changed to use a proper pipe then the behaviour changed between 5.8.0 and 2.6.32; in 2.6.32 writes to the pipe woke up epoll_wait with the aggregated behaviour that I described in my previous post which also is how epoll_wait wakes up for sockets in both 2.6 and 5.8. In 5.8 however with pipes after two mergers of events (and if one did not empty the pipe by reading until EAGAIN) epoll_wait hanged no matter how many more writes I did after that.

                  So the change back in 5.5 made pipes behave differently than sockets in regards to epoll, so Linus reverting to the old behaviour is perhaps not only due to the breakage of the realm-core library but also to have the same behaviour as sockets.

                  Comment


                  • #39
                    Originally posted by F.Ultra View Post
                    In 5.8 however with pipes after two mergers of events (and if one did not empty the pipe by reading until EAGAIN) epoll_wait hanged no matter how many more writes I did after that.
                    Yes, I think this is the crucial point. (I'm not concerned with aggregation, it is correct to aggregate.)

                    Originally posted by F.Ultra View Post
                    So the change back in 5.5 made pipes behave differently than sockets in regards to epoll, so Linus reverting to the old behaviour is perhaps not only due to the breakage of the realm-core library but also to have the same behaviour as sockets.
                    Yes, and also, I think, better conforming to the man page that I quoted. Good to know that all are in line with each other now: stdin, pipes, and sockets.
                    Last edited by indepe; 04 August 2021, 03:54 AM.

                    Comment


                    • #40
                      Originally posted by indepe View Post

                      Yes, I think this is the crucial point. (I'm not concerned with aggregation, it is correct to aggregate.)



                      Yes, and also, I think, better conforming to the man page that I quoted. Good to know that all are in line with each other now: stdin, pipes, and sockets.
                      Actually I would have preferred if stdin and sockets would be changed to behave like the pipes did before this change, should lessen the internal strain on epoll if it's only woken on empty buffers.

                      Comment

                      Working...
                      X