Announcement

Collapse
No announcement yet.

FUTEX2 Spun Up A Fifth Time For This Linux Interface To Help Windows Games

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by indepe View Post
    If you are indeed referring to the benchmarks above, there were two that used yield(): "spinlock" and "ticket_spinlock".

    "spinlock" actually had much better latencies on Linux, whereas "ticket_spinlock" had much better latencies on Windows. (I wonder what the results of a really good benchmark would have been.)

    All this got very confused by the dominant treatment of the ficticious "idle times", which made Windows look better than it deserved. However the whole test and especially the lock implementations where not really done by experts, so all of this needs to be taken in context altogether.
    No I was referring to a long list of not disclosed windows games which heavily uses the behaviour of the Windows scheduler to time actions between threads. The particular case that you are referring to where a similar case where the devs used spinlocks in userspace (bad fucking idea) and called sched_yield() after releasing the spinlock with the false idea that this would put this thread at the back of the scheduler queue and put the other thread waiting on the spinlock back in running state. On Linux however this would (more frequently that on Windows) just have the first thread keep on running (the Linux scheduler had no idea that the thread just signalled a spinlock and that another thread was waiting on it) since the heuristics that the Linux scheduler keeps told it that this thread needed more cpu time.

    On Windows however it seems that the scheduler always (or at least more frequently) simply switches threads when sched_yield() is called and as Linus wrote to the dev that also only happened since the game was the only app on the system, run a few other apps on the same system and "schedule the next thread" will not be the thread that the game intended but that of some other app and the latencies goes out the window so to speak.

    This is why mutexes are a far better choice since when you lock it or wait on it you also inform the kernel/scheduler of what you do so it will take the correct action.

    That benchmark didn't really measure that it thought that it measured, hopefully it looks like the dev took the hints from Linus correctly and became a better developer due to it, so in the end all turned to good from that.

    Comment


    • #32
      Originally posted by F.Ultra View Post
      ...
      Nice summary. Thank you.

      Comment


      • #33
        Originally posted by coder View Post
        It really wasn't. I read all of Malte Skarupke & Linus' posts on the RWT forums (as well as the original blog posts with more details) and it basically comes down to the fact that userspace spinlocks are just a bad idea. Also, sched_yield() wasn't equivalent to Sleep(0) or whatever other idiom windows games used to force a context switch.
        So did I, and I agree with everything you say here.

        Nevertheless his benchmarks don't show that "userspace spinlocks that didn't work so well, in Linux" in a general sense. Except, mostly, for one that doesn't work very well on Windows either. And they don't show that Windows has "scheduling code to make crap like that work well", compared to Linux, in a general sense. Even if in some cases it does.

        Originally posted by coder View Post
        That's completely happenstance! It's highly subject to the specific system under test and what else is going on!
        That's part of what I said.
        Last edited by indepe; 11 July 2021, 05:57 PM.

        Comment


        • #34
          Originally posted by F.Ultra View Post
          No I was referring to a long list of not disclosed windows games which heavily uses the behaviour of the Windows scheduler to time actions between threads.
          Then that is something I don't know about and might be interested in.

          Originally posted by F.Ultra View Post
          The particular case that you are referring to where a similar case where the devs used spinlocks in userspace (bad fucking idea) and called sched_yield() after releasing the spinlock with the false idea that this would put this thread at the back of the scheduler queue and put the other thread waiting on the spinlock back in running state.
          In the case that I was referring to, sched_yield() was called while trying to acquire the lock.

          Originally posted by F.Ultra View Post
          On Linux however this would (more frequently that on Windows) just have the first thread keep on running (the Linux scheduler had no idea that the thread just signalled a spinlock and that another thread was waiting on it) since the heuristics that the Linux scheduler keeps told it that this thread needed more cpu time.

          On Windows however it seems that the scheduler always (or at least more frequently) simply switches threads when sched_yield() is called and as Linus wrote to the dev that also only happened since the game was the only app on the system, run a few other apps on the same system and "schedule the next thread" will not be the thread that the game intended but that of some other app and the latencies goes out the window so to speak.
          That's the theory however only part of what happens in practice. It doesn't mean that spinlocks generally and necessarily perform better on Windows than on Linux.

          Originally posted by F.Ultra View Post
          This is why mutexes are a far better choice since when you lock it or wait on it you also inform the kernel/scheduler of what you do so it will take the correct action.
          Agreed. However I'd like to add that even mutexes (at least on Linux) are usually executed purely in user space as long as there is no contention, for performance reasons. The kernel is called only when there is contention.

          Originally posted by F.Ultra View Post
          That benchmark didn't really measure that it thought that it measured, hopefully it looks like the dev took the hints from Linus correctly and became a better developer due to it, so in the end all turned to good from that.
          Yes.
          Last edited by indepe; 11 July 2021, 06:11 PM.

          Comment


          • #35
            Originally posted by indepe View Post
            Agreed. However I'd like to add that even mutexes (at least on Linux) are usually executed purely in user space as long as there is no contention, for performance reasons. The kernel is called only when there is contention.
            Yes, sorry that I didn't mention that, wasn't sure how deep I should go . If there is no contention then the scheduler does not have to be involved since that means that there is no idle thread waiting on the lock, however if you set PTHREAD_MUTEX_ADAPTIVE_NP then the mutex will spin a number of (default 100) times on contention before calling the futex syscall so depending on your specific needs this can increase or decrease performance but it will never be as bad as that idiotic spinlocks in userspace code from the blogpost.

            edit: looks like Microsoft patented futexes in 2013 long after they appeared in Linux (2003), oboy.
            Last edited by F.Ultra; 11 July 2021, 08:39 PM.

            Comment


            • #36
              Originally posted by F.Ultra View Post
              edit: looks like Microsoft patented futexes in 2013 long after they appeared in Linux (2003), oboy.
              Apparently the latest is that Microsoft joined the OIN in 2018, so you can forget that....

              I suppose the patent shouldn't have been granted in the first place.

              Comment


              • #37
                Originally posted by F.Ultra View Post
                [...], however if you set PTHREAD_MUTEX_ADAPTIVE_NP then the mutex will spin a number of (default 100) times on contention before calling the futex syscall so depending on your specific needs this can increase or decrease performance but it will never be as bad as that idiotic spinlocks in userspace code from the blogpost.
                Yes, adaptive mutexes are in a sense a combination of spinlocks and mutexes, where even in contention the kernel isn't called until that spin counter has elapsed.

                Using something you might call "adaptive semaphores", I made the experience that the spin counter shouldn't be too high, since that can eat enormously into CPU usage.

                Comment


                • #38
                  Originally posted by indepe View Post
                  Yes, adaptive mutexes are in a sense a combination of spinlocks and mutexes, where even in contention the kernel isn't called until that spin counter has elapsed.
                  What's weird is that it's process-global. What you really want is customization on a per-instance basis. If you think about when it makes any sense to do a short spinlock, it would be for mutexes that tend to be in fairly high contention and held for a very short amount of time, such those used to serialize access to data structures.

                  BTW, something I was never entirely clear about is whether Windows' CRITICAL_SECTION actually does anything to defer preemption. Because, that's something else you'd want, when using a spinlock. Obviously, it couldn't entirely defeat preemption, but maybe it could just borrow a little time from future timeslices in order to try and delay it until the critical section is exited.

                  Originally posted by indepe View Post
                  Using something you might call "adaptive semaphores", I made the experience that the spin counter shouldn't be too high, since that can eat enormously into CPU usage.
                  Right. Time the typical ownership period of the specific mutex. If it's shorter than a syscall, use it to calibrate the mutex' spinlock time. If it's much longer than a syscall, don't even bother spinning.

                  A further enhancement would be to spin only if you knew the owner was executing. Otherwise, just go ahead and context switch right away.

                  Comment


                  • #39
                    Originally posted by coder View Post
                    What's weird is that it's process-global. What you really want is customization on a per-instance basis. If you think about when it makes any sense to do a short spinlock, it would be for mutexes that tend to be in fairly high contention and held for a very short amount of time, such those used to serialize access to data structures.
                    In some cases, that's what I do with semaphores, although so far it expresses more the priority of the thread and the willingness to sacrifice CPU usage at that specific point, so to speak.

                    Originally posted by coder View Post
                    BTW, something I was never entirely clear about is whether Windows' CRITICAL_SECTION actually does anything to defer preemption. Because, that's something else you'd want, when using a spinlock. Obviously, it couldn't entirely defeat preemption, but maybe it could just borrow a little time from future timeslices in order to try and delay it until the critical section is exited.
                    Or perhaps a function that says: "If this thread is close to the end of its time slice, better preempt it now". Of course you'd also want to do that with locks in general, since it's generally not good if a thread gets preempted inside a lock. But then, such a function would consume some time in itself. Maybe something that you would do with a low priority thread to prevent it from blocking a thread with higher priority. Maybe that's a case where it would make sense to call sched_yield(), before taking a lock. <--joke.

                    Originally posted by coder View Post
                    Right. Time the typical ownership period of the specific mutex. If it's shorter than a syscall, use it to calibrate the mutex' spinlock time. If it's much longer than a syscall, don't even bother spinning.
                    Yes, something like that. For many applications it won't make much of a difference (unless the value is far too high), while for some it may be very worthwhile to have such options in specific places. Maybe a very small value would good in general, I don't know.

                    Comment


                    • #40
                      Originally posted by coder View Post
                      What's weird is that it's process-global. What you really want is customization on a per-instance basis. If you think about when it makes any sense to do a short spinlock, it would be for mutexes that tend to be in fairly high contention and held for a very short amount of time, such those used to serialize access to data structures.

                      BTW, something I was never entirely clear about is whether Windows' CRITICAL_SECTION actually does anything to defer preemption. Because, that's something else you'd want, when using a spinlock. Obviously, it couldn't entirely defeat preemption, but maybe it could just borrow a little time from future timeslices in order to try and delay it until the critical section is exited.
                      I don't think it does, CRITICAL_SECTION in Windows works just like an adaptive mutex in Linux in that it spins for just a while (4000 cycles if I'm not mistaken) before it enters the kernel and puts the thread to idle so there is no need to defer preemption.

                      Comment

                      Working...
                      X