Announcement

**indepe** · 20 January 2021, 12:57 AM

Originally posted by oiaohm View Post

What you are missing in sections of the Windows API/NT API there are sections that are only the slow path. Items like CMPXCHG cannot be used because its multi values in the object/handle structure that need to change but not with values the threads wanted there but with values kernel wanted there. We are dealing with sections of the Windows NT design that predate the 486 and predate CMPXCHG existing and still in active use in new applications for Windows 10. So there are structures in Windows NT design that is in modern day windows that make no sense if you are thinking performance but windows coded applications are expecting the behaviour.

About Handles and Objects - Win32 apps

https://docs.microsoft.com/en-us/windows/win32/sysinfo/about-handles-and-objects

The system uses objects and handles to regulate access to system resources for two main reasons.

Gets more wacky when you find that some of old stuff have you checking ACL on handles and objects to work out if a program is in fact allowed to take out a lock this kind of stuff you cannot use CMPXCHG or equal to implement correctly either because the ACL value is allowed to change while the program is running so allowed to take out the lock once and the next time you attempt to take lock get permission denied.

Its really simple to think hey we have all this modern stuff it can do all the same things then miss you need to duplicate the old behavour before the modern times because that is what applications expect.

I guess you can count yourself lucky that Windows isn't actually running...

I think what you are saying would perhaps mean that in the extreme you need a global lock at the entry point of each emulation of related Windows API calls, which would be unfortunate in so far as it prevents parallel execution. So that any Windows API call will only encounter the completed state from another API call. Instead of the "big kernel lock" that you mention.

However, a global lock would also simplify implementation *a lot*. So maybe not that bad overall. And BTW, even a global lock requires a syscall only when there is contention.

And why would you want the code, handling all that mess, to go inside the Linux kernel? A recipe for disaster...

I think I'm going to watch a video about eBPF now, I just read it got atomic operations added...

**oiaohm** · 20 January 2021, 06:57 AM

Originally posted by indepe View Post

I think what you are saying would perhaps mean that in the extreme you need a global lock at the entry point of each emulation of related Windows API calls, which would be unfortunate in so far as it prevents parallel execution. So that any Windows API call will only encounter the completed state from another API call. Instead of the "big kernel lock" that you mention.

However, a global lock would also simplify implementation *a lot*. So maybe not that bad overall. And BTW, even a global lock requires a syscall only when there is contention.

Except that does not work. Its the data structure nightmare. That you have a lock on a NT object and information in that object structure is updated to include your current process information. This is a case where the process should not be setting this it self.

Really what is need it warped. Something goes to take out lock X code has to run before lock is given to process to change the data structures. This is horrible stuff that was thought a good idea before atomic locking. But does have it usages at times.

**indepe** · 20 January 2021, 08:15 AM

Originally posted by oiaohm View Post

Except that does not work. Its the data structure nightmare. That you have a lock on a NT object and information in that object structure is updated to include your current process information. This is a case where the process should not be setting this it self.

What? Which "current process information"? Why would that be a problem?

If taking a global lock on each API call does not work, which seems the most extreme of measures, then what is left to do?

Originally posted by oiaohm View Post

Really what is need it warped. Something goes to take out lock X code has to run before lock is given to process to change the data structures. This is horrible stuff that was thought a good idea before atomic locking. But does have it usages at times.

"Before atomic locking"? With all due respect, what are you talking about?

**Cybmax** · 20 January 2021, 08:28 AM

Originally posted by Cybmax View Post

Yeah, i am sorry for my bad semantics. "Futex wait multiple" or whatever.

Let me ask this then:
Does (wine)proton (with the fsync patchset) work with BOTH these patchsets (not at the same time, but separately):
1. https://github.com/sirlucjan/kernel-...ev-patches-sep
2. https://github.com/sirlucjan/kernel-...unk-patches-v2

The "futex dev patches" is the one that popped up a year or whatnot ago, and the "futex2-trunk" is the "new" patchset i have yet to try.
I was kinda under the impression they were not the same, and the reason i ask is i wonder if they DO the same.

It is not automatic for me to understand that "futex wait multiple" is exactly the same as "futex2" (Cos if you CALL a bloody patch futex2, it is bloody well that i am going to refer to it).

Well, to answer myself (maybe), the two DIFFERENT patchsets - futex wait multiple and fsync2 - is not directly interchangable as i see it. (And from a wee bit of reading, there is also a kernel config to enable futex2 in the latter).

This seems to require a patched fsync version for proton/wine : https://github.com/Frogging-Family/c...futex2.mypatch

I find this interesting, so ill try hack together something and do my own tests, as it does not seem too widespread in use. To the best of my knowledge and understanding - I would venture a guess that you cannot patch the kernel with BOTH "futex wait multiple" and "futex2"? Thus, if you patch ONLY with futex2, this will not work with proton's fsync usage.
Probably not gamebreaking in itself, but most ppl tend to use pre-packaged steam, and not compile their own. (Steam(runtime) is after all a much heavier beast to self-compile compared to wine).

Unless ofc anyone has any tips/experiences regarding this tho

**indepe** · 20 January 2021, 09:34 AM

Originally posted by Linuxxx View Post

I'd be surprised if this even gets a reply from any of the kernel developers; the chances of it materializing any time soon are even slimmer...

Anyway, for anyone wondering about what is wrong with ESYNC or FSYNC (futex2), here's a quote from Zebediah Figura:

However, "esync" has its problems. There are some areas where eventfd just doesn't provide the necessary interfaces for NT kernel APIs, and we have to badly emulate them. As a result there are some applications that simply don't work with it. It also relies on shared memory being mapped read/write into every application; as a result object state can easily be corrupted by a misbehaving process. These problems have led to its remaining out of tree. There are also some operations that need more than one system call and hence could theoretically be improved performance-wise. I later developed a second out-of-tree patch set, the awfully named "fsync", which uses futexes instead of eventfds. It was developed by request, on the grounds that futexes can be faster than eventfds (since an uncontended object needs no syscalls). In practice, there is sometimes a positive performance difference, sometimes a negative one, and often no measurable difference at all; it also sometimes makes performance much less consistent. It shares all of the same problems as "esync", including (to a degree) some inefficiencies when executing contended waits.

For those interested, this sounds half-baked to me. Putting the problem with corruptible memory aside for a moment, the combination of shared memory, *existing* FUTEX API, and atomic operations, should allow achieving optimal performance and the fullest possible feature set. If it doesn't, then the performance issues and operational problems probably haven't been sufficiently researched yet. It should be uniformly faster than using eventfd, unless there is something wrong in the kernel's implementation of the existing FUTEX mechanisms or the resulting scheduling (which should then be fixed). I somehow doubt that this is because of the lack of ability to do multiple FUTEX_WAKE with a single syscall, although that might be one possible improvement.

**Guest** · 20 January 2021, 11:03 AM

Originally posted by F.Ultra View Post

Yes this is (several) syscalls called by Windows application of which several can be used for inter-process synchronisation. I think the major one is the WaitForSingleObjects syscall that Windows applications use to wait for up to 64 objects at the same time where objects can be (among others) mutexes, semaphores, file descriptors and sockets so unfortunately not something that can be 1:1 replaced by select/poll/epoll. The mail linked in the article contains info on why the current syscalls in Linux does not really fit the bill here.

Well that sounds like a bad API, clubbing together multiple distinct things as objects. They could've just used some ReactiveX implementation to do that instead........

**oiaohm** · 20 January 2021, 11:27 AM

Originally posted by indepe View Post

What? Which "current process information"? Why would that be a problem?

If taking a global lock on each API call does not work, which seems the most extreme of measures, then what is left to do?

Because the locking is not that style. You have what I call what I call a lock mugging in the Windows NT design. This is where you find you cannot get the lock but you can get what process/thread has the lock and have it terminated then get the lock. This is not something atomic locking is built to-do.

Originally posted by indepe View Post

"Before atomic locking"? With all due respect, what are you talking about?

Its exactly what I said the locking solutions prior to atomic locking is found in the old sections of Windows NT design and are also found in VMS that the lead developer of NT comes from. Some of these don't map into atomic locking at all. The ability to perform a mugging to get a lock is something you find in different solutions prior to the existence of atomic locking. The means to perform the mugging means you need to write secure data of what is the current process/thread when you take out lock so the information exists so the right process can be killed to get the lock this in reality means syscall of some form is not really avoidable because you need protected code writing this information so it cannot be spoofed.

indepe you are thinking this will be fast. Problem here is in NT design is not this will be fast but we can kill off threads holding a lock doing less important things yes this cost performance on one hand but it can improve responsiveness on the other.

indepe look at all the different locks in Linux none of them are really designed to be mugged by another thread that has higher privilege/priority inside the process.

Really its the methods you can acquire the lock is the difference atomic locking is not designed that you will be acquiring the lock by brute force some locations the ability to acquire a lock by brute force is useful even if it is a slower path to implement.

**indepe** · 20 January 2021, 12:24 PM

Originally posted by oiaohm View Post

Because the locking is not that style. You have what I call what I call a lock mugging in the Windows NT design. This is where you find you cannot get the lock but you can get what process/thread has the lock and have it terminated then get the lock. This is not something atomic locking is built to-do.

Its exactly what I said the locking solutions prior to atomic locking is found in the old sections of Windows NT design and are also found in VMS that the lead developer of NT comes from. Some of these don't map into atomic locking at all. The ability to perform a mugging to get a lock is something you find in different solutions prior to the existence of atomic locking. The means to perform the mugging means you need to write secure data of what is the current process/thread when you take out lock so the information exists so the right process can be killed to get the lock this in reality means syscall of some form is not really avoidable because you need protected code writing this information so it cannot be spoofed.

indepe you are thinking this will be fast. Problem here is in NT design is not this will be fast but we can kill off threads holding a lock doing less important things yes this cost performance on one hand but it can improve responsiveness on the other.

indepe look at all the different locks in Linux none of them are really designed to be mugged by another thread that has higher privilege/priority inside the process.

Really its the methods you can acquire the lock is the difference atomic locking is not designed that you will be acquiring the lock by brute force some locations the ability to acquire a lock by brute force is useful even if it is a slower path to implement.

First of all, all locking on x86 is atomic, be it inside any kernel or in user space. And I have no idea what else it could be on VMS. What kind of CPU instructions would it use?

I can generally understand the desire to have a lock's data in protected memory, though, especially for inter-process locks. However that is likely to cost a lot of performance if you also want it on the so-called "fast path" in the absence of contention. Especially video games, major use case of WINE, seem to usually want to use the fastest possible implementation. (Although some of these use cases might use separate APIs in places.)

The ability to perform what you call "mugging" is surely an emergency feature, if it terminates those threads, and not a performance feature or other feature used during common operation. In any case, in principle it can also be implemented in a user space structure, and maybe some of the pthread locks actually do so, or use a thread id for other purposes.

I actually said that before, in a previous response to you: "(The information which thread is holding a lock can very well be maintained in user space.)"

What I am saying is that this is not a question of being "atomic", just a question of whether you want the data protection (which you might also want without a "mugging" feature). You just generally need to decide if you want optimal performance, or protection.

**Weasel** · 20 January 2021, 01:28 PM

Originally posted by indepe View Post

Not directly, however it can be implemented on top of the existing futex syscall. Maybe that's not obvious to everyone.

No you can't, well not "fully", without introducing race conditions or performance degradation. The problem is that some of those things need to be atomic, or one syscall (for performance), because that's how Windows is designed. And obviously you can't do that from userspace, it is something that needs to be available in the kernel. "Emulating" it in userspaces requires many more syscalls than one, especially for multiple objects. And that's the problem.

BTW I suggest you stop taking what oiaohm says seriously. He's just spouting buzzwords and technobabble, he literally has no idea what he's talking about. I've argued with him about locks, futexes and multi-threading in the past and it was obvious as fuck. With what he said I don't think he even codes (low-level) software tbh.

**indepe** · 20 January 2021, 02:12 PM

Originally posted by Weasel View Post

No you can't, well not "fully", without introducing race conditions or performance degradation. The problem is that some of those things need to be atomic, or one syscall (for performance), because that's how Windows is designed. And obviously you can't do that from userspace, it is something that needs to be available in the kernel. "Emulating" it in userspaces requires many more syscalls than one, especially for multiple objects. And that's the problem.

Why would it be a problem if things need to be atomic? I don't know all Windows specifics that you might be thinking of, but waiting for multiple objects in general does not require special non-existing syscalls. In previous proposals, only a "FUTEX_WAIT_MULTIPLE" syscall was mentioned as needed. This functionality, however, can definitely be implemented on top the existing FUTEX facilities.

As long as you are willing to use shared memory (that we already talked about many times), what else do you think can be done inside the kernel, yet not in or from user space?

Announcement

Wine Developers Are Working On A New Linux Kernel Sync API To Succeed ESYNC/FSYNC

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment