Announcement

Collapse
No announcement yet.

Linux 5.9 To Allow Controlling Page Lock Unfairness In Addressing Performance Regression

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by yoshi314 View Post

    this is because most of those non-benchmark articles here are just copy-paste from official announcements, with no added value. i'd rather have something done with a quarter of effort LWN articles are done with. writing with with some insights, even on a small range of topics.
    Phoronix does excellent HW reviews IMO.

    Comment


    • #22
      Originally posted by indepe View Post
      And if you have 250 user threads but only 32 or 64 CPUs, and an unfair lock that gets a number of threads out of the way (burying them in waiting lists), then that may not have the same negative effect as it could otherwise. In the Apache test this may be balanced by network response for 200/250 users being better with more threads active.
      We've all been discussing that sort of scenario, but I realised last night - while performing an rsync that consistently makes my HTPC *utterly unresponsive*, despite < 10% CPU load (and 3 of its 4 cores completely idle), no memory pressure, etc - that you don't even need that. All you need is ONE task hammering a global lock, and you've got a machine that is literally unusable regardless of how many idle resources it has.

      Most of the time this gets blamed on the scheduler sucking, but my guess is the real problem in this case at least seems to just be that the network stack is permanently locking out almost the entire system from *something*, and by favoritism / warmth whatever, it's KEEPING that lock 99.9% of the time.

      So a 30Mb/s rsync, on a quad-core machine, still results in a system where even the mouse doesn't work. And absolutely NOTHING you can do as a user (nice, ionice, etc) has any impact on it at all. It's amazing to me that Linux can still fall over SO completely for such trivial reasons.

      This has been going on forever: it didn't get worse in 5.9, and won't be affected by this at all. But it's an example of how a simple fairness issue can have a massive impact on an entire system even if that system is overwhelmingly IDLE.

      Comment


      • #23
        Originally posted by arQon View Post
        We've all been discussing that sort of scenario, but I realised last night - while performing an rsync that consistently makes my HTPC *utterly unresponsive*, despite < 10% CPU load (and 3 of its 4 cores completely idle), no memory pressure, etc - that you don't even need that. All you need is ONE task hammering a global lock, and you've got a machine that is literally unusable regardless of how many idle resources it has.
        [...]
        Right, you don't need that, it's just what I think might happen in the Apache test. Although I am surprised that such a global lock (still) exists, it is not a surprise that an unfair implementation can allow a single thread to "occupy" an unfair lock all for itself. And a good point.

        I'd consider the existence of such a lock a performance bug...I don't know if it would be possible to write an automated test for it.

        Comment


        • #24
          Originally posted by indepe View Post
          Right, you don't need that, it's just what I think might happen in the Apache test.
          I was somewhat OT with that post - sorry for the confusion. I think you've understood the Apache case just fine, I was just pointing out that there's a variation on this specific problem is actually not just far worse in theory, but also in practice. (And also isn't just a benchmark / oversubscribed HW issue).

          Originally posted by indepe View Post
          I'd consider the existence of such a lock a performance bug...I don't know if it would be possible to write an automated test for it.
          I'm sure it would be trivial to, given that it's 100% reproducible and only requires a single command to trigger.
          It may be that it requires a specific driver though, which makes *running* that AT harder.

          That's a secondary aspect though: the point is really that, to this day, even with the "Big Kernel Lock" now years gone, there are things like page tables that can act as one - and when the code controlling ownership of those resources is as broken as this code was / can be, you're ALWAYS going to keep running into utter trainwrecks like my "rsync over wifi" case.
          And frankly, that's a joke: it's the "fullscreen video" scenario again. (I don't remember where I first saw it, but it's an Onion-like article about a kernel release: 'Linux kernel x.y supports up to 4096 CPUs and 14EB /boot partitions. Torvalds says being able to play 720p video at 30fps without tearing or dropped frames "should be possible in another decade or two".' Something like that. :P).

          Even if the rsync problem on that machine IS a driver issue we're still talking about a single-threaded piece of code that, for whatever reason, can kill a multi-core machine. I honestly can't even come up with a way for that to happen at all (other than a global lock, basically) - moving a mouse cursor REALLY doesn't require multiple GB of RAM and multiple GHz of CPU (although some DEs are getting close :P). That's not "meh, the driver sucks", it's "there's a massive fundamental problem lurking in the OS itself". So not only is this one not going to be the last such problem, but this one is only barely above "toy" status by comparison to at least one of others.

          Still, maybe this incident will prompt someone to go looking for other such cases. At least Linus understands that *deliberately* ratf**king the entire system for the sake of an extra 0.1% on a synthetic benchmark is not an acceptable way to do things, so if someone does unearth them they'll probably get fixed eventually.

          Comment


          • #25
            Originally posted by arQon View Post
            I was somewhat OT with that post - sorry for the confusion. I think you've understood the Apache case just fine, I was just pointing out that there's a variation on this specific problem is actually not just far worse in theory, but also in practice. (And also isn't just a benchmark / oversubscribed HW issue).
            Yes, also Linus Torvalds mentioned that there are several cases where unfair locks trigger watchdogs as a result. That's maybe not quite as bad as your example, but quite bad already. None of this is theoretical. Even if we don't know what _exactly_ happens in that specific Apache test, and why _exactly_ for a specific setting the numbers get better or worse, we do know that all these variations happen in some cases.

            Speaking in general about locks, I think there is a lack of benchmarks that are good at simulating "real" use cases. Many of those highlighting lock performance are testing locks either without contention, or with nothing else than contention (cases where a single thread would be faster or at least have a big advantage).

            As I said before, fairness in a lock is not just about reducing the worst-case latency, but also about keeping as many threads (CPUs) "in business" as possible, for both latency and throughput. When possible, lockless solutions are usually best at that.

            Your example might often be seen as a case where worst-case latency becomes catastrophic for all threads but one. However that is superficial, since it is also a good example of all but one CPU getting blocked, including those that maybe need the lock only once in a while, resulting in bad overall throughput.

            Comment


            • #26
              Originally posted by indepe View Post
              Speaking in general about locks
              (snip)
              Absolutely. There's always going to be SOME extra load on the machine just from IO etc, but if a benchmark is really only measuring something "what's the max performance of Apache on an otherwise-idle server?", you're always potentially going to be missing something utterly trivial that cut that performance in half in the real world, or worse. (Like, say, a wifi driver that's apparently locking and releasing something that's effectively the BKL on a per-byte basis, let alone a per-packet basis! :P).

              That's not to say such benchmarks aren't both valid and valuable in its own right, but it DOES mean you're really just evaluating a single application and a fraction of the kernel. (And, I would bet, also simply simulating the exact same environment that the software was already tested and tuned in).

              The problem with "unfairness" (and let's call it what it really is: bad code) is that it DOES make pathological cases viable. I've NEVER worked on a system where worst-case / predictable throughput wasn't far more important than *peak* throughput - and that includes everything from trading systems all the way down to toys like webservers and NAS's etc.

              re lockless, I think that's a slightly different situation: there, you're really looking at optimising the app's "INTERNAL" costs (that is, avoid paying for the context switch and the kspace primitives, etc) rather than improving the overall throughput of the SYSTEM. I don't think the two are really related, other than that improving one is likely to also improve the other just as a byproduct. Just writing "bad" locking code (e.g. naive read locks) is also simply wasting cycles, and ultimately that's a finite resource across the machine as a whole.

              But the problem we have right now is not that processes are sub-optimal internally, it's that ONE of them, coupled with these defects in the kernel, can effectively take down the whole machine. The solution to the Apache problem isn't to put braindead (sorry Linus) hacks into the kernel so that Apache can KEEP behaving badly, it's to figure out exactly what piece of Apache IS behaving badly, and fix it. Like I say, cycles are a finite resource, and the critical piece here is NOT "Apache is x% slower on the new kernel", it's "The OLD kernel is letting Apache MASSIVELY out-consume resources that it *absolutely should not have the right to* barring a nice/etc EXPLICIT permission to do so". neh?

              Comment


              • #27
                Originally posted by arQon View Post
                Absolutely. There's always going to be SOME extra load on the machine just from IO etc, but if a benchmark is really only measuring something "what's the max performance of Apache on an otherwise-idle server?", you're always potentially going to be missing something utterly trivial that cut that performance in half in the real world, or worse. (Like, say, a wifi driver that's apparently locking and releasing something that's effectively the BKL on a per-byte basis, let alone a per-packet basis! :P).

                That's not to say such benchmarks aren't both valid and valuable in its own right, but it DOES mean you're really just evaluating a single application and a fraction of the kernel. (And, I would bet, also simply simulating the exact same environment that the software was already tested and tuned in).
                Of course benchmarks and tests need to be done at different levels. In this case, a kernel change became the focus of attention with caused a specific regression in performance. This specific change appears to have the question of unfair vs fair locking at its center. Sure, this perspective may just be a partial understanding of "the problem", depending on which scope you are wiling to look at. However (so far) it appears to be an avoidable regression on its own. So I'd think both perspectives are valid. I guess that means we mostly agree.

                Originally posted by arQon View Post
                The problem with "unfairness" (and let's call it what it really is: bad code) is that it DOES make pathological cases viable. I've NEVER worked on a system where worst-case / predictable throughput wasn't far more important than *peak* throughput - and that includes everything from trading systems all the way down to toys like webservers and NAS's etc.
                Yes, I'd expect this to be true for most larger systems. Of course, with exceptions, that I would expect to be mostly applications using batch processing to solve one specific computation. And even then this will probably change when such an application becomes more complex.

                Originally posted by arQon View Post
                re lockless, I think that's a slightly different situation: there, you're really looking at optimising the app's "INTERNAL" costs (that is, avoid paying for the context switch and the kspace primitives, etc) rather than improving the overall throughput of the SYSTEM. I don't think the two are really related, other than that improving one is likely to also improve the other just as a byproduct. Just writing "bad" locking code (e.g. naive read locks) is also simply wasting cycles, and ultimately that's a finite resource across the machine as a whole.
                I don't know why you would say that about lockless solutions. They can be used at all levels (but not in all cases) where locks are used. That is, in the interaction between threads and processes. Be that in the kernel or in the application, potentially at the global level, the file level, or the page level. It depends more on the logic than the context. And unfortunately, they often require a _lot_of thinking unless generalized functions are available that apply to a specific use case.

                Originally posted by arQon View Post
                But the problem we have right now is not that processes are sub-optimal internally, it's that ONE of them, coupled with these defects in the kernel, can effectively take down the whole machine. The solution to the Apache problem isn't to put braindead (sorry Linus) hacks into the kernel so that Apache can KEEP behaving badly, it's to figure out exactly what piece of Apache IS behaving badly, and fix it. Like I say, cycles are a finite resource, and the critical piece here is NOT "Apache is x% slower on the new kernel", it's "The OLD kernel is letting Apache MASSIVELY out-consume resources that it *absolutely should not have the right to* barring a nice/etc EXPLICIT permission to do so". neh?
                I'm not sure. So far I don't see a reason to think that the Apache test brings down the "whole machine", although one could imagine that it might if that has happened in similar situations. I wouldn't know how similar it is to the "rsync" problem you are referring to.

                The Apache problem appears to be to handle the situation efficiently where many threads (users) access the same file ("test.html") for read-only purposes. At least that's my understanding of it. As I previously wrote (I think in the comments on a preceding article), Apache might approach this problem with a cache of shared read-only data. I don't know if it already does so in one way or another, but it appears that in this specific test the "user" threads appear to eventually access kernel page locking with contention, and this becomes an issue. Independently of how Apache as an application might address this with other means, it appears that there is a certain interest in the kernel handling this situation well, or at least to keep it from regressing unnecessarily. I suppose there is a general question regarding the efficiency of page locking.

                EDIT: The problem here doesn't seem to be a global lock, but the locks specific to each page/file. If that is correct, I would expect a problem for the whole machine only if there are specific files where the same files are accessed by all processes and if these accesses are subject to the same problem.

                Perhaps a good solution of this case will also be somewhat applicable to other situations (such as those which you mention), so I'm curious what the solution will be.
                Last edited by indepe; 26 September 2020, 06:52 PM.

                Comment

                Working...
                X