Announcement

Collapse
No announcement yet.

Futex2 Proposed In Latest Effort For Linux Kernel Optimization That Can Benefit Gamers

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #51
    In a sense, the proposal is just using the kernel as a fancy way to do a global lock.

    Comment


    • #52
      Originally posted by indepe View Post
      In a sense, the proposal is just using the kernel as a fancy way to do a global lock.
      Sorry, but that was a really strange conclusion. What is your reasoning here? It's just a way for a thread to sleep on more than one futex.

      Comment


      • #53
        Originally posted by F.Ultra View Post
        Sorry, but that was a really strange conclusion. What is your reasoning here? It's just a way for a thread to sleep on more than one futex.
        It's a way of expressing that there is no point in implementing that inside the kernel. (Unless there is a really surprising reason.) That includes the "bWaitAll" option of WFMO. I can see why it might look like a good idea at first, but it isn't. Understanding my joke more directly unfortunately would require a lot of explanation and thinking about how to implement an event mechanism.

        EDIT: However they are right about using Futexes for this purpose. It's just that the existing API is flexible and efficient enough.
        EDIT2: And I think they kind of have it backwards in some sense: they appear to want to use futexes almost like senders of events (or intermediates), but they are receivers (of wake calls).

        Situations where threads are waiting for multiple events are actually very common. Usually they can be solved easily with semaphores (mutexes can be implemented on top of semaphores), or message queues (for example the io_uring API). I could go on, but it will probably sound like blablabla for many.

        By the way, I think you are right about pthread_spinlock being fast in a single thread lock/unlock loop. At least the version as it comes on my system appears to be 1 cycle faster. Surprisingly, also 1 cycle faster than using the source code that you posted (or any other pthread source that I found so far), given that atomic operations usually compile to single instructions on x86.
        Last edited by indepe; 17 June 2020, 05:01 PM.

        Comment


        • #54
          Originally posted by indepe View Post

          It's a way of expressing that there is no point in implementing that inside the kernel. (Unless there is a really surprising reason.) That includes the "bWaitAll" option of WFMO. I can see why it might look like a good idea at first, but it isn't. Understanding my joke more directly unfortunately would require a lot of explanation and thinking about how to implement an event mechanism.

          EDIT: However they are right about using Futexes for this purpose. It's just that the existing API is flexible and efficient enough.
          EDIT2: And I think they kind of have it backwards in some sense: they appear to want to use futexes almost like senders of events (or intermediates), but they are receivers (of wake calls).

          Situations where threads are waiting for multiple events are actually very common. Usually they can be solved easily with semaphores (mutexes can be implemented on top of semaphores), or message queues (for example the io_uring API). I could go on, but it will probably sound like blablabla for many.

          By the way, I think you are right about pthread_spinlock being fast in a single thread lock/unlock loop. At least the version as it comes on my system appears to be 1 cycle faster. Surprisingly, also 1 cycle faster than using the source code that you posted (or any other pthread source that I found so far), given that atomic operations usually compile to single instructions on x86.
          Implementing this with semaphores forces you to either busy spin or intermittently call sleep() causing random latencies, wasting cpu cycles and power. io_uring have horrible latencies overall as with all fd-related interfaces in Linux.

          Still don't really understand your aversion for being able to sleep on multiple futexes, ignore WINE for a moment and think of what possibilities you would get by being able to wait for multiple futexes.

          Comment


          • #55
            Originally posted by indepe View Post

            No, not in 100% userspace, but mostly in userspace using the existing futex. We have discussed this in more detail in a previous phoronix article on a previous version of this patch. An efficient userspace implementation that does some spinning in userspace before using a futex will (or at least can) have the logic to resolve the signaling calls to a single wait point per thread such that a standard existing futex can be used, so that it won't require an elaborate kernel patch like this, which adds an unnecessary larger burden on the kernel, also with the use of dynamic memory for the lists and possibly locking shared data structures.
            So now I actually took the time to go back and check the actual patch and it contains this text:
            Before calling the syscall again, userspace should traverse the list, trying to re-acquire any of the other futexes, to prevent an immediate -EWOULDBLOCK return code from the kernel.
            So your whole premise that "In a similar way, the futex2 logic for wait-for-one-of-many should instead be implemented in user-space, where it will usually avoid the need for a kernel call in the absence of contention." was wrong as I earlier assumed. This call is intended to be used by WINE only when there is contention.

            Comment


            • #56
              Originally posted by F.Ultra View Post
              Implementing this with semaphores forces you to either busy spin or intermittently call sleep() causing random latencies, wasting cpu cycles and power. io_uring have horrible latencies overall as with all fd-related interfaces in Linux.
              Why would you think that? Instead, you make it so that any of many event sources will signal the semaphore. No polling.

              You can use a semaphore to wait for signaling by multiple events, first in userspace, and then by blocking on a futex, without having to do repeated polling.

              No CPU cycles are wasted. On the contrary, polling is avoided. And mutexes ("m" not "f") can be implemented on top of semaphores (lock) and event sources (unlock). That is, conceptually, in practice you would implement it more directly at a lower level. (Not as an application writer, but as a library implementor).

              Originally posted by F.Ultra View Post
              Still don't really understand your aversion for being able to sleep on multiple futexes, ignore WINE for a moment and think of what possibilities you would get by being able to wait for multiple futexes.
              Why would you want to wait for multiple futexes, if you can use a single futex to wait for multiple event sources more efficiently?

              Originally posted by F.Ultra View Post
              So your whole premise that "In a similar way, the futex2 logic for wait-for-one-of-many should instead be implemented in user-space, where it will usually avoid the need for a kernel call in the absence of contention." was wrong as I earlier assumed. This call is intended to be used by WINE only when there is contention.
              The text you quoted refers to the need to find out which additional event source was triggered (if any), since that proposed API will apparently tell you only about one. With a good implementation (in user space), it is not necessary to do this.

              The logic of associating one thread with multiple event sources can be done in user space using the existing API. There is no need for additional code, complexity and inefficiency inside the kernel. I expect that in most use cases the thread will repeatedly wait for the same set of event sources, so this association has to be done only once, not for each call.

              All that is what I have been trying to say in the first place.

              Comment


              • #57
                Originally posted by indepe View Post

                Why would you think that? Instead, you make it so that any of many event sources will signal the semaphore. No polling.

                You can use a semaphore to wait for signaling by multiple events, first in userspace, and then by blocking on a futex, without having to do repeated polling.

                No CPU cycles are wasted. On the contrary, polling is avoided. And mutexes ("m" not "f") can be implemented on top of semaphores (lock) and event sources (unlock). That is, conceptually, in practice you would implement it more directly at a lower level. (Not as an application writer, but as a library implementor).
                Then you are back with your global lock again ;-).

                You have to understand that the WaitForMultipleObjects is a heavily used (and abused) API in Windows applications/games so WINE are exposed to situations where you can have 100 threads waiting for 50k different events where each thread waits for a mixed subset of those events.

                So a single semaphore would of course wreck havoc here, trying to partition the events over multiple semaphores is going to be a real nightmare since the WFMO is one of those stupid API:s where the complete watch-list can change dynamically between each call.

                LWN covers it in a currently paywalled article at https://lwn.net/Articles/823513/ (which should be free in a week I think) where there are comment by the patch author himself. But I can quote him from his last comment there:

                Yes, it can be done in userspace, as mentioned in my original submission of Futex Wait Multiple last year. Still, we experimented with all that, and the performance of a userspace solution or even of a solution based on eventfd (which would be the most obvious way to implement these semantics in userspace with the existing kernel support), is a bottleneck on our real-world scenarios of emulation of windows applications over Wine.
                So they have tried the userspace road (and several different implementations of it) and the futex_waitv() road was what gave them accepted performance.

                Originally posted by indepe View Post
                The text you quoted refers to the need to find out which additional event source was triggered (if any), since that proposed API will apparently tell you only about one. With a good implementation (in user space), it is not necessary to do this.
                Not really, he explained this further on LWN:
                For the uncontested case, a waiter can grab a futex without going into the kernel. The semantics for FWM are "any of the futexes in the list" so if the consumer acquires any futex, it doesn't call futex_waitv.
                So again their use case is to only use this for when all the events are contested.
                Last edited by F.Ultra; 19 June 2020, 02:38 AM.

                Comment


                • #58
                  Originally posted by F.Ultra View Post
                  You have to understand that the WaitForMultipleObjects is a heavily used (and abused) API in Windows applications/games so WINE are exposed to situations where you can have 100 threads waiting for 50k different events where each thread waits for a mixed subset of those events.

                  So a single semaphore would of course wreck havoc here, trying to partition the events over multiple semaphores is going to be a real nightmare since the WFMO is one of those stupid API:s where the complete watch-list can change dynamically between each call.
                  None of that means that potentially using up to 50,000 futexes is better than using up to 100. Why would it? Why would using 100 futexes instead of 50,000 wreak havoc? Why would havoc in user space be worse than havoc in the kernel?

                  I don't know what do you mean with "partitioning" events.

                  Originally posted by F.Ultra View Post
                  LWN covers it in a currently paywalled article at https://lwn.net/Articles/823513/ (which should be free in a week I think) where there are comment by the patch author himself. But I can quote him from his last comment there:
                  I don't have access yet, I would need to read the full article before responding to it. Does it have something that is not in the texts we have posted links for here?

                  Originally posted by F.Ultra View Post
                  So they have tried the userspace road (and several different implementations of it) and the futex_waitv() road was what gave them accepted performance.
                  There is no single user space road where you could claim something like... been there done that. There hundreds of ways of doing something like that.

                  In the software I'm working on, there are several different ways (more than 3) in which threads wait for multiple events, all very performant in my opinion, and all in user space except for contention. However I don't know all the requirements and difficulties potentially encountered specifically in WINE.

                  Originally posted by F.Ultra View Post
                  Not really, he explained this further on LWN
                  [...]
                  So again their use case is to only use this for when all the events are contested.
                  That's a different point: one is without contention, the other is after waking up from blocking due to contention. Regarding the former: you should always determine contention before making a futex call, that is not the point of debate. The point of debate is if you need a complex kernel API extension to achieve an efficient solution. I don't think you do, and I still haven't heard a real argument why you would. Someone I don't know saying 'we tried and failed' is not an argument for me. I'm curious if there will be one in the article.

                  Comment


                  • #59
                    Here is something that could be done to resolve this: They could post the source for a performance test that demonstrates the typical situation in which they think they need an API extension to achieve necessary performance, as simple as possible, if possible in less than 200 lines. along with the timings they think are required, and the timings that their solutions actually achieves.

                    Comment


                    • #60
                      Originally posted by indepe View Post
                      I don't have access yet, I would need to read the full article before responding to it. Does it have something that is not in the texts we have posted links for here?
                      The article itself is mostly about proposed changes to the futex systemcall itself that is disconnected with this very patch/issue but I saw that the patch author commented there, and he might be a far better person to discuss this with than us random dude on the Internet.

                      Originally posted by indepe View Post
                      There is no single user space road where you could claim something like... been there done that. There hundreds of ways of doing something like that.

                      In the software I'm working on, there are several different ways (more than 3) in which threads wait for multiple events, all very performant in my opinion, and all in user space except for contention. However I don't know all the requirements and difficulties potentially encountered specifically in WINE.
                      True, I was perhaps a bit quick to the conclusion there. It sounds like they have tried out a few different solutions, but then again we have not seen the source for any of them except the eventfd one.

                      My main interest in all of this is to be able to efficiently waiting on multiple futexes for a lock-less bounded queue where due to the very nature of such things the multiple producers versions is so much slower than the single producer one that performance would be much better to have multiple queues instead of just one for consumers that have to listen to more than one producer. The consumer waits on a futex once the queue has been determined empty for a while so to not waste CPU on off-hours but still be low latency and messing around with dispatcher threads issuing semaphores does not cut it for my use case (the queue pushes roughly 500M TPS per core and the latency requirements are in the sub microsecond range) which is why I used a futex in the first place.

                      Comment

                      Working...
                      X