Announcement

Collapse
No announcement yet.

The Latest On The Linux 5.9 Kernel Regression Stemming From Page Lock Fairness

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Excellent write-up!! This content is exactly why I am a life-time subscriber to this fine site! Keep up the great work Michael!

    Comment


    • #12
      Unfortunately not on the kernel mailing lists, so here are my thoughts.

      There are 2 types of un-fairness described:

      a/ the running process can retake the lock before the designated receiver was able to run.
      b/ if the receiver tries to grab the lock unsuccessfully it is put back at the end of the wait queue.

      What if we try to address only b/ as we see addressing a/ has performance impact? Here is how I would do it:

      In the queue, add another field for the PID of the designated receiver.

      - After unlocking and finding the "receiver", and marking it, also put its PID in the "last designated" field.
      - When trying to lock unsuccessfully, if the PID is the same as the "last designated", put it at the head of the queue, not at the end.

      I think it will address the worst latency outliers but without impacting the overall performance.

      Comment


      • #13
        I really enjoy this series of articles, this is such great work.

        Comment


        • #14
          Nice work Michael!

          Comment


          • #15
            Michael
            nice job bisecting and bringing this to attention

            Comment


            • #16
              Originally posted by GreenByte View Post
              The first idea that came to mind was to switch between the 2 behaviours. Upon one unlock, allow the fastest process to snatch the lock, upon the second one - give it to the queued process. I'm sure the proportions could be tuned, some cleverness could be added, but it could make it the best of 2 worlds (if it is not a horrible idea).
              This is an excellent solution. The problem with Linus's patch is that it is too fair.

              When unlocking the page, you have to decide which process gets it next:
              Process #1: some very old process that has been waiting a long time to get this page lock (Linus's fairness patch), or
              Process #2: some extremely recent process that happens to be wanting to acquire the lock just as it becomes available.

              If you pre-empt #2 in favor of giving the lock to #1, you've just wasted all the cached data that process #2 already has loaded into the CPU. That's likely to cause a huge performance loss that causes this performance regression because you get hit with a double whammy:
              1. Old Process #1 has to "spin up" by loading it's data from the distant slow system memory instead of using the CPU's cache memories, and
              2. in doing so, the CPU's cache memory containing Process #2's data gets clobbered, causing Process #2 to have to "spin up" too, once it finally gets the lock much later on.

              Your solution solves these problems.

              Give the lock to process #2 by default, as it is already primed and ready-to-go from the CPU's cache memories. Once #2 is done and there are no more processes "just on the verge" of acquiring the lock, go back to the "fairness" algorithm and give the lock to Process #1. At this point, there is no cache advantage among any of the waiting processes, and there is less chance of clobbering good data in the CPU's caches too.
              Last edited by ed31337; 13 September 2020, 05:51 PM.

              Comment


              • #17
                Interestingly enough this issue is not new and not unknown or unexpected. If one looks at the Java's ReentrantLock implementation you would see this note:

                The constructor for this class accepts an optional fairness parameter. When set true, under contention, locks favor granting access to the longest-waiting thread. Otherwise this lock does not guarantee any particular access order. Programs using fair locks accessed by many threads may display lower overall throughput (i.e., are slower; often much slower) than those using the default setting, but have smaller variances in times to obtain locks and guarantee lack of starvation. Note however, that fairness of locks does not guarantee fairness of thread scheduling. Thus, one of many threads using a fair lock may obtain it multiple times in succession while other active threads are not progressing and not currently holding the lock. Also note that the untimed tryLock method does not honor the fairness setting. It will succeed if the lock is available even if other threads are waiting.
                https://docs.oracle.com/javase/7/doc...ntLock(boolean)

                This issue has been known for as long as the locks with wait-lists existed. It's not unique to Linux or even operating systems per se.

                Couple of ways to do mediate this in kernel (something that would not be available in Java) would be to have a hook in the scheduler for wait queue to transfer the lock and force-wake a thread to guarantee that a woke thread that holds a lock will be executed in short order (with some guaranteed real-time priority for a few tens of ms); or to inspect the wait queue and only give a lock to a thread that is actually awake or is about to be scheduled (again, polling the scheduler for that info).

                There are no easy/great solutions because, again, it's a very well-known problem that has been looked at extensively in parallel computing.
                Last edited by arcivanov; 13 September 2020, 08:12 PM.

                Comment


                • #18
                  Originally posted by arcivanov View Post
                  There a no easy/great solutions because, again, it's a very well-known problem that has been looked at extensively in parallel computing.
                  Specific situations sometimes allow specific solutions which would not be used in a generally available lock type.

                  Have you, for example, seen the recent FUTEX_SWAP proposal? (Even though it probably doesn't directly apply in this case, just as an example).

                  Comment


                  • #19
                    Originally posted by ed31337 View Post
                    Give the lock to process #2 by default, as it is already primed and ready-to-go from the CPU's cache memories. Once #2 is done and there are no more processes "just on the verge" of acquiring the lock, go back to the "fairness" algorithm and give the lock to Process #1. At this point, there is no cache advantage among any of the waiting processes, and there is less chance of clobbering good data in the CPU's caches too.
                    That seems to be no different than the previous behavior. Or how is it?

                    However one could let process #2 go through the lock with its cached memory, and in the unlock let it do a futex_swap - like operation with process #1 (if both are able to swap run conditions). Just another option, not sure how it would work out.

                    EDIT: Or another possibility: The CPU running process #2 swaps to run process #1 through the lock, and then swaps back to run process #2, with hopefully most of the cache still valid.
                    Last edited by indepe; 13 September 2020, 08:40 PM.

                    Comment


                    • #20
                      Originally posted by indepe View Post
                      Have you, for example, seen the recent FUTEX_SWAP proposal? (Even though it probably doesn't directly apply in this case, just as an example).
                      FUTEX_SWAP solves a different problem entirely, unfortunately. Firstly, it specifically refers to two threads interacting scenario, and secondly, the problem isn't lock exchange but rather resolution of wait-list priority and the temporal artifact of passing a lock to a thread that is unconscious and then waiting for it to regain consciousness for unbounded period of time. With unfair behavior total progress is virtually never obstructed in such a way, that's why it's faster by the factor of 2.

                      Again, not a new problem, not specifically related to page locking, it's just general "lock with cold waiters". You almost want to start spinning up the head of the queue transitioning them into spin-loop as they start approaching the head, but that'll be terrible for CPU utilization and power management.

                      Comment

                      Working...
                      X