Announcement

Collapse
No announcement yet.

FUTEX2 Spun Up A Fifth Time For This Linux Interface To Help Windows Games

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • indepe
    replied
    Originally posted by coder View Post
    Why couldn't the kernel just keep a timeslice counter (if it doesn't already) in userspace memory, that you could read? So, if the timeslice counter changed, then you'd know one or more preemptions occurred.
    When I said thread-local memory, that does mean user space memory (like "errno"). So I think it could (if it doesn't already). However this would be the first use case that I'm aware of. I don't know if this question has been explored before.

    Originally posted by coder View Post
    If I had time to spend optimizing multithreading performance, I think workstealing is an area ripe for improvements. There seems to be an ever-increasing number of libraries that each spin up their own worker threads, which compete with each other & other processes for CPU time on all of the cores. We'd ideally have kernel support for this sort of thing.
    Workstealing does look like a valid concept to me, depending on the use case. Probably especially if there are many similar work items of each kind. Personally I find that dedicated (and named) threads result in a workable mental model of the application structure. In so far as available, lock-free synchronization algorithms reduce blocking through preemption of another thread, or completely eliminate it. However they are usually more specialized and have a steep learning curve. I expect that will improve within the next 10 years, such that more will be available ready for use, with implementation details hidden from the user.
    Last edited by indepe; 13 July 2021, 10:16 PM.

    Leave a comment:


  • coder
    replied
    Originally posted by indepe View Post
    For benchmark purposes, within a tight test loop, you might be able to keep track of preemption by detecting larger gaps in the TSC,
    Why couldn't the kernel just keep a timeslice counter (if it doesn't already) in userspace memory, that you could read? So, if the timeslice counter changed, then you'd know one or more preemptions occurred.

    Originally posted by indepe View Post
    (And ideally somehow ignore any occasional interrupt.)
    Interrupts could be handled similarly, but they're necessarily a lot more sensitive to overhead than context switches. So, I'd just count on there not being enough interrupts to interfere much with the lock ownership stats.

    Originally posted by indepe View Post
    although I find optimizing locks very interesting, it currently isn't really important to my own situation anymore.
    If I had time to spend optimizing multithreading performance, I think workstealing is an area ripe for improvements. There seems to be an ever-increasing number of libraries that each spin up their own worker threads, which compete with each other & other processes for CPU time on all of the cores. We'd ideally have kernel support for this sort of thing.

    Leave a comment:


  • indepe
    replied
    Originally posted by coder View Post
    Some of the hints in the other direction could operate on the basis of a high-resolution timestamp indicating the beginning of the timeslice. I'm sure the kernel already has this information, so it just needs to put it where userspace can read it. Then, subtract it off the current vaule of the TSC register and you can tell how deep you are into the timeslice. Of course, there's the potential for an interrupt to come along and invalidate your calculations, but that should be rare.
    That may be worth thinking about (the thought occurred to me as well). I guess the kernel would already know the end of the time slice and be able to store that in a thread-local variable. I think most developers usually at first check the single threaded non-contention performance of an almost empty lock, and RDTSC would approx. double the time. So it would need some convincing and benchmarks to show the value it may have.

    For benchmark purposes, within a tight test loop, you might be able to keep track of preemption by detecting larger gaps in the TSC, and remember the common duration between the gaps. (And ideally somehow ignore any occasional interrupt.) And then use that info to occasionally re-initiate the time slice by calling sched_yield() or so. That way you might be able to find out if it is worth additional exploration.

    As I said elsewhere, personally I have the long-term intention to replace as many locks as possible with other synchronization techniques, especially where performance matters. Currently it looks very good for that, however it might be just luck with the current specific situation. So although I find optimizing locks very interesting, it currently isn't really important to my own situation anymore.

    Leave a comment:


  • coder
    replied
    Originally posted by F.Ultra View Post
    I don't think it does, CRITICAL_SECTION in Windows works just like an adaptive mutex in Linux
    I'm not surprised, but I think we agree that there's some opportunity to offer the kernel hints about preemption. The trick is just to do it in a way that doesn't create potential for more bugs or performance pitfalls.

    Some of the hints in the other direction could operate on the basis of a high-resolution timestamp indicating the beginning of the timeslice. I'm sure the kernel already has this information, so it just needs to put it where userspace can read it. Then, subtract it off the current vaule of the TSC register and you can tell how deep you are into the timeslice. Of course, there's the potential for an interrupt to come along and invalidate your calculations, but that should be rare.

    Leave a comment:


  • F.Ultra
    replied
    Originally posted by coder View Post
    What's weird is that it's process-global. What you really want is customization on a per-instance basis. If you think about when it makes any sense to do a short spinlock, it would be for mutexes that tend to be in fairly high contention and held for a very short amount of time, such those used to serialize access to data structures.

    BTW, something I was never entirely clear about is whether Windows' CRITICAL_SECTION actually does anything to defer preemption. Because, that's something else you'd want, when using a spinlock. Obviously, it couldn't entirely defeat preemption, but maybe it could just borrow a little time from future timeslices in order to try and delay it until the critical section is exited.
    I don't think it does, CRITICAL_SECTION in Windows works just like an adaptive mutex in Linux in that it spins for just a while (4000 cycles if I'm not mistaken) before it enters the kernel and puts the thread to idle so there is no need to defer preemption.

    Leave a comment:


  • indepe
    replied
    Originally posted by coder View Post
    What's weird is that it's process-global. What you really want is customization on a per-instance basis. If you think about when it makes any sense to do a short spinlock, it would be for mutexes that tend to be in fairly high contention and held for a very short amount of time, such those used to serialize access to data structures.
    In some cases, that's what I do with semaphores, although so far it expresses more the priority of the thread and the willingness to sacrifice CPU usage at that specific point, so to speak.

    Originally posted by coder View Post
    BTW, something I was never entirely clear about is whether Windows' CRITICAL_SECTION actually does anything to defer preemption. Because, that's something else you'd want, when using a spinlock. Obviously, it couldn't entirely defeat preemption, but maybe it could just borrow a little time from future timeslices in order to try and delay it until the critical section is exited.
    Or perhaps a function that says: "If this thread is close to the end of its time slice, better preempt it now". Of course you'd also want to do that with locks in general, since it's generally not good if a thread gets preempted inside a lock. But then, such a function would consume some time in itself. Maybe something that you would do with a low priority thread to prevent it from blocking a thread with higher priority. Maybe that's a case where it would make sense to call sched_yield(), before taking a lock. <--joke.

    Originally posted by coder View Post
    Right. Time the typical ownership period of the specific mutex. If it's shorter than a syscall, use it to calibrate the mutex' spinlock time. If it's much longer than a syscall, don't even bother spinning.
    Yes, something like that. For many applications it won't make much of a difference (unless the value is far too high), while for some it may be very worthwhile to have such options in specific places. Maybe a very small value would good in general, I don't know.

    Leave a comment:


  • coder
    replied
    Originally posted by indepe View Post
    Yes, adaptive mutexes are in a sense a combination of spinlocks and mutexes, where even in contention the kernel isn't called until that spin counter has elapsed.
    What's weird is that it's process-global. What you really want is customization on a per-instance basis. If you think about when it makes any sense to do a short spinlock, it would be for mutexes that tend to be in fairly high contention and held for a very short amount of time, such those used to serialize access to data structures.

    BTW, something I was never entirely clear about is whether Windows' CRITICAL_SECTION actually does anything to defer preemption. Because, that's something else you'd want, when using a spinlock. Obviously, it couldn't entirely defeat preemption, but maybe it could just borrow a little time from future timeslices in order to try and delay it until the critical section is exited.

    Originally posted by indepe View Post
    Using something you might call "adaptive semaphores", I made the experience that the spin counter shouldn't be too high, since that can eat enormously into CPU usage.
    Right. Time the typical ownership period of the specific mutex. If it's shorter than a syscall, use it to calibrate the mutex' spinlock time. If it's much longer than a syscall, don't even bother spinning.

    A further enhancement would be to spin only if you knew the owner was executing. Otherwise, just go ahead and context switch right away.

    Leave a comment:


  • indepe
    replied
    Originally posted by F.Ultra View Post
    [...], however if you set PTHREAD_MUTEX_ADAPTIVE_NP then the mutex will spin a number of (default 100) times on contention before calling the futex syscall so depending on your specific needs this can increase or decrease performance but it will never be as bad as that idiotic spinlocks in userspace code from the blogpost.
    Yes, adaptive mutexes are in a sense a combination of spinlocks and mutexes, where even in contention the kernel isn't called until that spin counter has elapsed.

    Using something you might call "adaptive semaphores", I made the experience that the spin counter shouldn't be too high, since that can eat enormously into CPU usage.

    Leave a comment:


  • indepe
    replied
    Originally posted by F.Ultra View Post
    edit: looks like Microsoft patented futexes in 2013 long after they appeared in Linux (2003), oboy.
    Apparently the latest is that Microsoft joined the OIN in 2018, so you can forget that....

    I suppose the patent shouldn't have been granted in the first place.

    Leave a comment:


  • F.Ultra
    replied
    Originally posted by indepe View Post
    Agreed. However I'd like to add that even mutexes (at least on Linux) are usually executed purely in user space as long as there is no contention, for performance reasons. The kernel is called only when there is contention.
    Yes, sorry that I didn't mention that, wasn't sure how deep I should go . If there is no contention then the scheduler does not have to be involved since that means that there is no idle thread waiting on the lock, however if you set PTHREAD_MUTEX_ADAPTIVE_NP then the mutex will spin a number of (default 100) times on contention before calling the futex syscall so depending on your specific needs this can increase or decrease performance but it will never be as bad as that idiotic spinlocks in userspace code from the blogpost.

    edit: looks like Microsoft patented futexes in 2013 long after they appeared in Linux (2003), oboy.
    Last edited by F.Ultra; 11 July 2021, 08:39 PM.

    Leave a comment:

Working...
X