Announcement

Collapse
No announcement yet.

Futex2 Proposed In Latest Effort For Linux Kernel Optimization That Can Benefit Gamers

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by duby229 View Post

    There is -no- use case for spinlocks in userspace. If you do use spinlocks in userspace then your entire concept is wrong.
    Well, there totally are some usecases for spinlocks, expasially when you almost never actually try to aquire a locked spinlock. The problem with standard mutex is that it's pretty slow to lock. Both dxvk and proton's vkd3d use spinlock, vkd3d got massive cpu limited performance gains by using them (https://github.com/HansKristian-Work/vkd3d/pull/57). Just because you (or Linus fwiw) don't see use cases does not mean there aren't some.

    Comment


    • #22
      Originally posted by indepe View Post

      No, user space is more performant when there is no contention, or only short time contention. Which is why glibc implements those parts of lock code in user space.
      While this is true for a simple mutex, this is no longer true when waiting on multiple mutexes, and providing a fast and correct implementation in user space (either a WAIT-ALL ou WAIT-ONE-OF) is going to be at least very challenging (for the record, you're rally welcome if you do; I tried, I failed, because user space synchronisation on multiple mutexes is really hard to get right in the first place).

      Originally posted by indepe View Post
      "Futex" literally is short of "fast userspace mutex". Futexes have the purpose of allowing much of the locking code to be implemented in user space, because that is faster. That's the point. As I explained above and in previous discussions, that proposal is missing a a good public reasoning to be implemented in the kernel.
      The locking takes place in user space ; the waiting (i.e. the synchronisation, i.e. the hard part) takes place in the kernel. Locking the underlying objet can be as simple as an atomic store ; checking the lock before calling futex can be as simple as an atomic read ; but spinning on the lock when there is contention is simply not something you want to do while in userspace, and I speak from experience in that very specific case.

      Comment


      • #23
        Originally posted by Emmanuel Deloget View Post
        While this is true for a simple mutex, this is no longer true when waiting on multiple mutexes,
        This is incorrect. What makes you think so?

        Originally posted by Emmanuel Deloget View Post
        and providing a fast and correct implementation in user space (either a WAIT-ALL ou WAIT-ONE-OF) is going to be at least very challenging (for the record, you're rally welcome if you do; I tried, I failed, because user space synchronisation on multiple mutexes is really hard to get right in the first place).
        Yes, It requires experience with implementing locks and also with implementing lock-free algorithms. But so does a good implementation inside the kernel. It also requires a good idea, so to speak, it is not just application of known patterns. This is not something for application programmers, it is for implementers of multithreaded libraries (including glibc).

        Originally posted by Emmanuel Deloget View Post
        The locking takes place in user space ; the waiting (i.e. the synchronisation, i.e. the hard part) takes place in the kernel. Locking the underlying objet can be as simple as an atomic store ; checking the lock before calling futex can be as simple as an atomic read ; but spinning on the lock when there is contention is simply not something you want to do while in userspace, and I speak from experience in that very specific case.
        I don't know what exactly you are talking about here. Even for a spinlock, locking requires an atomic read-modify-write instruction, not just a store. However in this case you would not use a spinlock. Nevertheless, you would spin for limited number of checks in userspace in order to avoid a kernel call when contention is short. Unless you design the lock for a specific use case, and in your use case contention is rarely short.

        Comment


        • #24
          Originally posted by DadSchoorse View Post
          Well, there totally are some usecases for spinlocks, expasially when you almost never actually try to aquire a locked spinlock. The problem with standard mutex is that it's pretty slow to lock. Both dxvk and proton's vkd3d use spinlock, vkd3d got massive cpu limited performance gains by using them (https://github.com/HansKristian-Work/vkd3d/pull/57). Just because you (or Linus fwiw) don't see use cases does not mean there aren't some.
          Use futex then. Just because locking in userspace happens doesn't mean it -should- happen. The problem is that you are coming from a windows background which doesn't really have a process scheduler, but linux -DOES-. Windows has something that is more or less an overdeveloped glorified time slicer at best. Look at how much context switching Windows does in comparison to how much context switching linux does as an example.

          You might want to read this...
          https://docs.microsoft.com/en-us/pre...ectedfrom=MSDN


          Locking in userspace is a HUGE problem with modern multicore CPU's. All you are doing is introducing massive amounts of context switching. And that's literally it. Nothing else. In user space on linux you aren't a kernel level process scheduler so all you -can- do is introduce massive amounts of context switching. (The actual process scheduler in the kernel will almost definitely work against your locking, resulting in huge amounts of context switching and that's the problem. If you really need to do locking then use futex)
          Last edited by duby229; 13 June 2020, 03:44 PM.

          Comment


          • #25
            Originally posted by DadSchoorse View Post
            Well, there totally are some usecases for spinlocks, expasially when you almost never actually try to aquire a locked spinlock. The problem with standard mutex is that it's pretty slow to lock. Both dxvk and proton's vkd3d use spinlock, vkd3d got massive cpu limited performance gains by using them (https://github.com/HansKristian-Work/vkd3d/pull/57). Just because you (or Linus fwiw) don't see use cases does not mean there aren't some.
            In that link, simply removing "d3d12_device_get_descriptor_mutex" and the array appears to be an optimization. Also, have they tried using PTHREAD_MUTEX_ADAPTIVE_NP ? It might have some overhead due to the generality of PTHREAD_MUTEX, but otherwise possibly solves their performance problem, assuming that a better performance of spinlock was due to short critical sections causing short-time contention.

            The spinlock as implemented in your link is simplistic and lacks a back off algorithm (and a pause call in general). Even so, I think a performance optimized adaptive mutex (userspace oriented) is preferable for an application even in cases where a spinlock is tempting. However, I have not benchmarked PTHREAD_MUTEX_ADAPTIVE_NP to test if the performance of its specific implementation fulfills that expectation.

            Comment


            • #26
              Originally posted by DadSchoorse View Post
              Well, there totally are some usecases for spinlocks, expasially when you almost never actually try to aquire a locked spinlock. The problem with standard mutex is that it's pretty slow to lock. Both dxvk and proton's vkd3d use spinlock, vkd3d got massive cpu limited performance gains by using them (https://github.com/HansKristian-Work/vkd3d/pull/57). Just because you (or Linus fwiw) don't see use cases does not mean there aren't some.
              Is there really a significant performance difference between a custom spinlock implementation like the above and pthread_spin_lock/unlock?

              Comment


              • #27
                Originally posted by DadSchoorse View Post
                Well, there totally are some usecases for spinlocks, expasially when you almost never actually try to aquire a locked spinlock. The problem with standard mutex is that it's pretty slow to lock. Both dxvk and proton's vkd3d use spinlock, vkd3d got massive cpu limited performance gains by using them (https://github.com/HansKristian-Work/vkd3d/pull/57). Just because you (or Linus fwiw) don't see use cases does not mean there aren't some.
                A standard mutex is not pretty slow to lock (on Linux) since it works exactly like a spinlock if the lock is oncontested, then after a few spins if contested it switches to a FUTEX so up till then it's all in userspace and works just like a spinlock.

                That said there are of course still some ns overhead with a custom spinlock and a library function like pthread_mutex uses and if the locks are used many many times per second then it all adds up of course, which is also why e.g Linus never have claimed that a spinlock is never to be used in user-space. He just meant that for 99% of the cases where you think that a spin-lock is the correct answer, it in practice isn't.

                edit: Just to put some real numbers in here I just tested pthread_mutex vs pthread_spin_lock where I did lock+unlock in a loop 100M times and after 10 rounds each the number of clock cycles on my CPU was 1481M for mutex and 1844M for splinlock (with PTHREAD_PROCESS_PRIVATE set) so for complete uncontented cases mutex are not only just as fast as a spinlock, it can actually be somewhat faster.
                Last edited by F.Ultra; 14 June 2020, 12:05 PM.

                Comment


                • #28
                  Originally posted by indepe View Post
                  Nowadays many locks are a combination of spinlock and kernel lock. They first spin in user space for a certain time, before they resign to do a kernel wait.

                  They combine advantages of each, at the cost of an additional atomic read-modify-write instruction in the unlock path. That's just a few nanoseconds which bugs those who seek an ideal lock, but avoids the problems of pure spinlocks. In practice these few nanoseconds are usually of no consequence.

                  In a similar way, the futex2 logic for wait-for-one-of-many should instead be implemented in user-space, where it will usually avoid the need for a kernel call in the absence of contention. Once that is done, there is no need to duplicate the required lists and logic in the kernel.

                  Instead, it may be useful to have an awake-one-of-many kernel API, to allow the kernel scheduler to pick the best candidate for waking one of many. However, that is not necessary, just optional.

                  So, as in a previous discussion, I don't see a need or advantage for the "futex2" concept inside the kernel. If that exists, it is non-obvious and should be stated explicitly. But the proposal doesn't seem to state such a need or advantage, as far as I know.
                  My take on this patch is that they are only going to use it when there is contention just like how you use the regular FUTEX today. And there is no way that you can write an equivalent function in 100% userspace that handles contention that will be able to match what the kernel can do here (like e.g suspending threads) for that the Windows applications use the WaitForMultibleObjects() call.

                  Comment


                  • #29
                    Originally posted by Space Heater View Post

                    Is there really a significant performance difference between a custom spinlock implementation like the above and pthread_spin_lock/unlock?
                    Actually on my machine their custom version was slower (2643M cycles vs 1844M cycles for 100M rounds of lock+unlock), which actually is kindof strange, however since this is a test for 100% non-contention then each call generates once call to both atomic_load_explicit() and atomic_exchange_explicit() for their custom code which leads to two loads and one store for each lock, if the pthread version uses only the CAS then they have only one load and one store per such case which makes the non-contented case faster but the contented case slower (have not looked at how pthread implements their spinlocks).

                    edit: looked up the actual glibc-code and it's actually quite clever here:

                    Code:
                    int pthread_spin_lock (pthread_spinlock_t *lock)
                    {
                      int val = 0;
                    
                      if (__glibc_likely (atomic_exchange_acquire (lock, 1) == 0))
                        return 0;
                    
                      do {
                        do {
                          atomic_spin_nop ();
                          val = atomic_load_relaxed (lock);
                        } while (val != 0);
                      } while (!atomic_compare_exchange_weak_acquire (lock, &val, 1));
                    
                      return 0;
                    }
                    So I see no good reason to use a custom version unless you want to build for a non pthread environment (and perhaps DXVK can be built on other systems as well?!).
                    Last edited by F.Ultra; 14 June 2020, 12:35 PM.

                    Comment


                    • #30
                      Originally posted by F.Ultra View Post
                      My take on this patch is that they are only going to use it when there is contention just like how you use the regular FUTEX today. And there is no way that you can write an equivalent function in 100% userspace that handles contention that will be able to match what the kernel can do here (like e.g suspending threads) for that the Windows applications use the WaitForMultibleObjects() call.
                      No, not in 100% userspace, but mostly in userspace using the existing futex. We have discussed this in more detail in a previous phoronix article on a previous version of this patch. An efficient userspace implementation that does some spinning in userspace before using a futex will (or at least can) have the logic to resolve the signaling calls to a single wait point per thread such that a standard existing futex can be used, so that it won't require an elaborate kernel patch like this, which adds an unnecessary larger burden on the kernel, also with the use of dynamic memory for the lists and possibly locking shared data structures.
                      Last edited by indepe; 14 June 2020, 02:18 PM.

                      Comment

                      Working...
                      X