Announcement

Collapse
No announcement yet.

Futex2 Proposed In Latest Effort For Linux Kernel Optimization That Can Benefit Gamers

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    Originally posted by ryao View Post
    Windows has a synchronization primitive that operates across processes. That is what the Wine developers are trying to implement efficiently. The best that they have managed so far is using eventfd, but it suffers from a file descriptor exhaustion problem and it is slower than patching the kernel to extend futexes. The alternative is to do IPC to the wine server, which is slow in the same sense FUSE is slow,
    Although I was wondering about that, I saw no indication of it. While I lack practical experience with using futexes across processes, in abstract I would assume (or at least hope) that the combination of shared memory and existing Futex API would fulfill this requirement as well. Shared memory would contain the information necessary to dispatch "multiple" signalling calls to "individual" futexes. I don't know if passing that information would make things more complicated, but some information would have to be passed anyway. On the plus side, I would hope that having this information in shared memory makes the non-blocking cases much faster.
    Last edited by indepe; 16 June 2020, 03:29 AM.

    Comment


    • #42
      Originally posted by ryao View Post

      Do you want to send a patch fixing it for them or shall I when I find time? I had been thinking of maybe modifying their code to be implement spin locks the right way, but there really is no reason to do that when pthread’s spin lock implementation is available.

      By the way, that atomic_load_relaxed() call will turn into a pause instruction on Intel/AMD processors. The pause instruction will stop the hardware thread from executing for a short period during which the other hardware thread sharing the core will see a performance boost as all execution resources become available to it.
      In pthread’s spin lock implementation, "atomic_spin_nop ();" is the pause call. "atomic_load_relaxed()" is probably just a MOV on x86.
      It still lacks a back-off algorithm in the version I'm looking at: "/* TODO Back-off. */"

      Comment


      • #43
        Originally posted by indepe View Post

        In pthread’s spin lock implementation, "atomic_spin_nop ();" is the pause call. "atomic_load_relaxed()" is probably just a MOV on x86.
        It still lacks a back-off algorithm in the version I'm looking at: "/* TODO Back-off. */"
        Good catch. I copy and pasted the wrong function name. Anyway, this is how I know that people are reading what I wrote.

        Comment


        • #44
          Originally posted by F.Ultra View Post

          Actually on my machine their custom version was slower (2643M cycles vs 1844M cycles for 100M rounds of lock+unlock), which actually is kindof strange, however since this is a test for 100% non-contention then each call generates once call to both atomic_load_explicit() and atomic_exchange_explicit() for their custom code which leads to two loads and one store for each lock, if the pthread version uses only the CAS then they have only one load and one store per such case which makes the non-contented case faster but the contented case slower (have not looked at how pthread implements their spinlocks).

          edit: looked up the actual glibc-code and it's actually quite clever here:

          Code:
          int pthread_spin_lock (pthread_spinlock_t *lock)
          {
          int val = 0;
          
          if (__glibc_likely (atomic_exchange_acquire (lock, 1) == 0))
          return 0;
          
          do {
          do {
          atomic_spin_nop ();
          val = atomic_load_relaxed (lock);
          } while (val != 0);
          } while (!atomic_compare_exchange_weak_acquire (lock, &val, 1));
          
          return 0;
          }
          So I see no good reason to use a custom version unless you want to build for a non pthread environment (and perhaps DXVK can be built on other systems as well?!).
          I spoke to the developer that wrote that spinlock code. According to him, the loops calling the lock are so tight that function call overhead causes a slowdown. Furthermore, the spin locks virtually never spin because there is almost zero contention. This seems to be one of the few use cases where a spinlock in user space makes sense. The lack of PAUSE instructions should not be an issue given how the spinlock is said to be used.

          Comment


          • #45
            indepe See my previous post, but it seems that back off is not needed for VKD3D’s spin lock. As for back off in the pthread implementation, please propose a patch.

            Comment


            • #46
              Originally posted by indepe View Post

              Although I was wondering about that, I saw no indication of it. While I lack practical experience with using futexes across processes, in abstract I would assume (or at least hope) that the combination of shared memory and existing Futex API would fulfill this requirement as well. Shared memory would contain the information necessary to dispatch "multiple" signalling calls to "individual" futexes. I don't know if passing that information would make things more complicated, but some information would have to be passed anyway. On the plus side, I would hope that having this information in shared memory makes the non-blocking cases much faster.
              You might want to talk to the wine developer that has been doing the esync work about that. He seems to have been trying everything he could possibly do to make this work in a sane way. I don’t remember why, but I vaguely recall hearing that shared memory was not a workable solution for his use case.

              Comment


              • #47
                Originally posted by ryao View Post
                indepe See my previous post, but it seems that back off is not needed for VKD3D’s spin lock. As for back off in the pthread implementation, please propose a patch.
                Thanks for the encouragement, however although it is easy to write a back off that is better than none at all in a specific situation (with lots of examples on the web), an implementation that is worthy of pthread glibc requires extensive testing and tuning on many platforms, architectures and in many different situations.

                It should include an eventual context switching function (such as nano sleep) in case the critical section is preempted on the same CPU.

                The complexity is hinted at here (quick search for an example):
                https://www.boost.org/doc/libs/1_63_.../tweaking.html

                For my own use, I'm experimenting with a performance optimized adaptive mutex, which requires a different back off.

                Comment


                • #48
                  Originally posted by ryao View Post

                  I spoke to the developer that wrote that spinlock code. According to him, the loops calling the lock are so tight that function call overhead causes a slowdown. Furthermore, the spin locks virtually never spin because there is almost zero contention. This seems to be one of the few use cases where a spinlock in user space makes sense. The lack of PAUSE instructions should not be an issue given how the spinlock is said to be used.
                  Still their custom code was slower on my Ryzen 1600X than the pthread code, but that could of course be down to my (too) quick'n dirty test.

                  Comment


                  • #49
                    Originally posted by ryao View Post
                    I spoke to the developer that wrote that spinlock code. According to him, the loops calling the lock are so tight that function call overhead causes a slowdown. Furthermore, the spin locks virtually never spin because there is almost zero contention. This seems to be one of the few use cases where a spinlock in user space makes sense. The lack of PAUSE instructions should not be an issue given how the spinlock is said to be used.
                    Possibly so, especially if the lock is usually only accessed by a single thread, however in general I would caution that in a case where the function call overhead seems to matter, there is also a probability that a certain percentage of the time is spent inside the lock, so that the critical section will be preempted once in a while. In that case, longer contention becomes likely if there are other threads accessing the lock at all.

                    Comment


                    • #50
                      Originally posted by ryao View Post
                      You might want to talk to the wine developer that has been doing the esync work about that. He seems to have been trying everything he could possibly do to make this work in a sane way. I don’t remember why, but I vaguely recall hearing that shared memory was not a workable solution for his use case.
                      Thanks for the info. I think the group proposing the patch that is the subject of this article are not the WINE developers themselves. Since their proposal is to extend Futexes, it would seem that their solution would also require shared memory. I've been reading up a bit on that, and it appears that using Futexes across processes generally requires shared memory. This would seem to apply to the proposal and to the existing Futex API in the same way. So I don't know how that proposal would fit together with the WINE requirements. Perhaps reading up on esync will provide some insight into that.

                      EDIT:
                      It turns out esync is using shared memory itself. It seems to me at least the basic principle of the event mechanism can be implemented efficiently using the existing Futex API. To be completely certain about it would require much more information. The best I can tell, the existing Futex API is more than flexible enough, and its shared memory support allows carrying that flexibility across processes.
                      Last edited by indepe; 16 June 2020, 06:28 PM.

                      Comment

                      Working...
                      X