Announcement

Collapse
No announcement yet.

Futex2 Proposed In Latest Effort For Linux Kernel Optimization That Can Benefit Gamers

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Originally posted by F.Ultra View Post
    My main interest in all of this is to be able to efficiently waiting on multiple futexes for a lock-less bounded queue where due to the very nature of such things the multiple producers versions is so much slower than the single producer one that performance would be much better to have multiple queues instead of just one for consumers that have to listen to more than one producer. The consumer waits on a futex once the queue has been determined empty for a while so to not waste CPU on off-hours but still be low latency and messing around with dispatcher threads issuing semaphores does not cut it for my use case (the queue pushes roughly 500M TPS per core and the latency requirements are in the sub microsecond range) which is why I used a futex in the first place.
    Sounds about right for a SPSC queue, with a MPSC queue more than 200M/s on my system. In most cases that's more than enough. Of course, if you only block off-hours, a semaphore wouldn't be too bad.

    Assuming that a SPSC queue actually running at 500M TPS cannot use RMW operations, how would it know when to issue a FUTEX_WAKE call, how does it know reliably (without race condition) if the consumer has decided to block? After all, modifying the memory location isn't enough. It could check in certain time intervals, but then you don't need a wait-for-multiple-futexes API either. Besides, that would be the end of latency.

    EDIT: And also: Whenever you make a FUTEX_WAKE call, a little dispatching won't hurt.
    Last edited by indepe; 19 June 2020, 05:17 PM.

    Comment


    • #62
      Originally posted by indepe View Post
      Assuming that a SPSC queue actually running at 500M TPS cannot use RMW operations, how would it know when to issue a FUTEX_WAKE call, how does it know reliably (without race condition) if the consumer has decided to block? After all, modifying the memory location isn't enough. It could check in certain time intervals, but then you don't need a wait-for-multiple-futexes API either. Besides, that would be the end of latency.
      What you do is that the when the consumer determines that the queue is empty it does not block on the futex right away, instead (and here be dragons; do note that I don't do this for general purpose software so this is designed to work on a specific piece of hw with a specific use case) you call SYS_futex with the smallest amount of wait time a few times in a loop until you are "sure" that you have determined the queue to be empty long enough for any producer to also notice this and thus know that he should wake a futex after the next write.

      And yes this increases the latency to double digit microseconds in this specific case since SYS_futex obviously have a much higher overhead than a atomic_load_explicit(memory_order_acquire) but for my use case the system is basically 0% utilization during some hours of the day and then close to 100% and taking a small latency hit once per day when the traffic goes from 0 to 100 is a price that I'm willing to pay over having to busy-wait when the load is 0 to save energy and heat.

      Comment


      • #63
        Originally posted by F.Ultra View Post

        What you do is that the when the consumer determines that the queue is empty it does not block on the futex right away, instead (and here be dragons; do note that I don't do this for general purpose software so this is designed to work on a specific piece of hw with a specific use case) you call SYS_futex with the smallest amount of wait time a few times in a loop until you are "sure" that you have determined the queue to be empty long enough for any producer to also notice this and thus know that he should wake a futex after the next write.

        And yes this increases the latency to double digit microseconds in this specific case since SYS_futex obviously have a much higher overhead than a atomic_load_explicit(memory_order_acquire) but for my use case the system is basically 0% utilization during some hours of the day and then close to 100% and taking a small latency hit once per day when the traffic goes from 0 to 100 is a price that I'm willing to pay over having to busy-wait when the load is 0 to save energy and heat.
        This sounds like a good solution if you prefer using multiple SPSC queues over a single MPSC queue. At the same time it seems that even then, using a single futex is more performant than multiple futexes would be. Apparently you have in fact, as I wrote, separated the producer's FUTEX_WAKE logic from the queue writing itself. In so far as you have described the situation, different producers should have no difficulty waking the same futex. So the interest in using multiple futexes per consumer seems separate and still opaque. If I had to guess, a wake-multiple API might be of more interest than a wait-multiple API.

        Comment


        • #64
          Linux 5.9.0-gentoo
          is it implemented on kernel 3.9.0?








          Comment

          Working...
          X