Originally posted by ryao
View Post
Announcement
Collapse
No announcement yet.
Futex2 Proposed In Latest Effort For Linux Kernel Optimization That Can Benefit Gamers
Collapse
X
-
Last edited by indepe; 16 June 2020, 03:29 AM.
-
Originally posted by ryao View Post
Do you want to send a patch fixing it for them or shall I when I find time? I had been thinking of maybe modifying their code to be implement spin locks the right way, but there really is no reason to do that when pthread’s spin lock implementation is available.
By the way, that atomic_load_relaxed() call will turn into a pause instruction on Intel/AMD processors. The pause instruction will stop the hardware thread from executing for a short period during which the other hardware thread sharing the core will see a performance boost as all execution resources become available to it.
It still lacks a back-off algorithm in the version I'm looking at: "/* TODO Back-off. */"
Comment
-
Originally posted by indepe View Post
In pthread’s spin lock implementation, "atomic_spin_nop ();" is the pause call. "atomic_load_relaxed()" is probably just a MOV on x86.
It still lacks a back-off algorithm in the version I'm looking at: "/* TODO Back-off. */"
Comment
-
Originally posted by F.Ultra View Post
Actually on my machine their custom version was slower (2643M cycles vs 1844M cycles for 100M rounds of lock+unlock), which actually is kindof strange, however since this is a test for 100% non-contention then each call generates once call to both atomic_load_explicit() and atomic_exchange_explicit() for their custom code which leads to two loads and one store for each lock, if the pthread version uses only the CAS then they have only one load and one store per such case which makes the non-contented case faster but the contented case slower (have not looked at how pthread implements their spinlocks).
edit: looked up the actual glibc-code and it's actually quite clever here:
Code:int pthread_spin_lock (pthread_spinlock_t *lock) { int val = 0; if (__glibc_likely (atomic_exchange_acquire (lock, 1) == 0)) return 0; do { do { atomic_spin_nop (); val = atomic_load_relaxed (lock); } while (val != 0); } while (!atomic_compare_exchange_weak_acquire (lock, &val, 1)); return 0; }
Comment
-
Originally posted by indepe View Post
Although I was wondering about that, I saw no indication of it. While I lack practical experience with using futexes across processes, in abstract I would assume (or at least hope) that the combination of shared memory and existing Futex API would fulfill this requirement as well. Shared memory would contain the information necessary to dispatch "multiple" signalling calls to "individual" futexes. I don't know if passing that information would make things more complicated, but some information would have to be passed anyway. On the plus side, I would hope that having this information in shared memory makes the non-blocking cases much faster.
Comment
-
It should include an eventual context switching function (such as nano sleep) in case the critical section is preempted on the same CPU.
The complexity is hinted at here (quick search for an example):
https://www.boost.org/doc/libs/1_63_.../tweaking.html
For my own use, I'm experimenting with a performance optimized adaptive mutex, which requires a different back off.
Comment
-
Originally posted by ryao View Post
I spoke to the developer that wrote that spinlock code. According to him, the loops calling the lock are so tight that function call overhead causes a slowdown. Furthermore, the spin locks virtually never spin because there is almost zero contention. This seems to be one of the few use cases where a spinlock in user space makes sense. The lack of PAUSE instructions should not be an issue given how the spinlock is said to be used.
Comment
-
Originally posted by ryao View PostI spoke to the developer that wrote that spinlock code. According to him, the loops calling the lock are so tight that function call overhead causes a slowdown. Furthermore, the spin locks virtually never spin because there is almost zero contention. This seems to be one of the few use cases where a spinlock in user space makes sense. The lack of PAUSE instructions should not be an issue given how the spinlock is said to be used.
Comment
-
Originally posted by ryao View PostYou might want to talk to the wine developer that has been doing the esync work about that. He seems to have been trying everything he could possibly do to make this work in a sane way. I don’t remember why, but I vaguely recall hearing that shared memory was not a workable solution for his use case.
EDIT:
It turns out esync is using shared memory itself. It seems to me at least the basic principle of the event mechanism can be implemented efficiently using the existing Futex API. To be completely certain about it would require much more information. The best I can tell, the existing Futex API is more than flexible enough, and its shared memory support allows carrying that flexibility across processes.
Last edited by indepe; 16 June 2020, 06:28 PM.
Comment
Comment