Announcement

**indepe** · 20 January 2021, 03:13 PM

Originally posted by Weasel View Post

"Emulating" it in userspaces requires many more syscalls than one, especially for multiple objects. And that's the problem.

I didn't respond to this part yet. Again, unless there are some surprising Windows specifics that need to be emulated on top of the main functionality, things should generally not require many syscalls. Functions such as SET_EVENT should require a syscall only if a thread(s) needs a WAKE call, or if there is contention during the execution of the function. And functions such as WAIT_ANY/WAIT_ALL should require a syscall only if they actually need to wait (or if there is contention). Otherwise such functions should often execute without the need for any syscall.

**Weasel** · 20 January 2021, 06:11 PM

Originally posted by indepe View Post

Why would it be a problem if things need to be atomic? I don't know all Windows specifics that you might be thinking of, but waiting for multiple objects in general does not require special non-existing syscalls. In previous proposals, only a "FUTEX_WAIT_MULTIPLE" syscall was mentioned as needed. This functionality, however, can definitely be implemented on top the existing FUTEX facilities.

Yeah, that's fsync, not esync. esync can't do it, because they are event fds, not futexes.

This isn't just about simple futexes—those can work with "critical sections" on Windows anyway. Sometimes you need an operation to be atomic to avoid race conditions. NtPulseEvent is an example: it wakes threads waiting for the event without (re)setting the event. esync emulates this, and has race conditions. esync is known to break some games for this reason, it does it in the name of performance, but it's not perfect.

Here's a list from the mailing list:

* A blocking operation, like poll() that optionally consumes things, or like read() on a vectored set of file descriptors. This doesn't necessarily mean we have to replicate the manual/auto distinction in the kernel; we can handle that in user space. This by itself doesn't actually seem all that unreasonable, but...

* A blocking operation like the above, but corresponding to "wait-all"; i.e. which atomically reads from all descriptors. Just from skimming the code surrounding things like read() and poll(), this seems very ugly to implement.

* A way to atomically write() to an eventfd and retrieve its current count [for semaphore and event operations].

* A way to signal an eventfd, such that waiters are woken, but without changing its current count [i.e. an operation corresponding to NtPulseEvent].

* A way to read the current count of an eventfd without changing it [for NtQuerySemaphore and NtQueryMutant; for NtQueryEvent we can use poll.]

And before you say, no, writing and THEN reading is not atomic, it's inherent to race conditions, AND it's two syscalls!

**Weasel** · 20 January 2021, 06:14 PM

Originally posted by indepe View Post

I didn't respond to this part yet. Again, unless there are some surprising Windows specifics that need to be emulated on top of the main functionality, things should generally not require many syscalls. Functions such as SET_EVENT should require a syscall only if a thread(s) needs a WAKE call, or if there is contention during the execution of the function. And functions such as WAIT_ANY/WAIT_ALL should require a syscall only if they actually need to wait (or if there is contention). Otherwise such functions should often execute without the need for any syscall.

Yeah, contention is the performance problem here, obviously.

**F.Ultra** · 20 January 2021, 07:08 PM

Originally posted by indepe View Post

That would be WaitForMultipleObjects.

Yeah that was just a small typo :-(

**indepe** · 20 January 2021, 08:04 PM

Originally posted by Weasel View Post

Yeah, that's fsync, not esync. esync can't do it, because they are event fds, not futexes.

This isn't just about simple futexes—those can work with "critical sections" on Windows anyway. Sometimes you need an operation to be atomic to avoid race conditions. NtPulseEvent is an example: it wakes threads waiting for the event without (re)setting the event. esync emulates this, and has race conditions. esync is known to break some games for this reason, it does it in the name of performance, but it's not perfect.

Here's a list from the mailing list:And before you say, no, writing and THEN reading is not atomic, it's inherent to race conditions, AND it's two syscalls!

Those are problems with using eventfd, which doesn't have anything to do with what I am talking about: using futex.

Originally posted by Weasel View Post

Yeah, contention is the performance problem here, obviously.

You are using my own words without any indication that you understand them. In this case, even contention will usually just be a few spins and not require a syscall, and would unlikely be any better if it were done inside the kernel. (However you need to be careful to limit spinning to an upper limit when outside the kernel. Which is a detail that probably doesn't make much sense to you, and isn't very relevant to this discussion.)

EDIT: And this is nothing in comparison to needing a syscall even when there is no contention.

**oiaohm** · 20 January 2021, 09:52 PM

Originally posted by indepe View Post

First of all, all locking on x86 is atomic, be it inside any kernel or in user space. And I have no idea what else it could be on VMS. What kind of CPU instructions would it use?

This is your first major mistake. All locking on x86 is not atomic. i286 does not have the machine code/assemble in the cpu to in fact do atomic locking. Windows NT 3.1 targeted DEC Alpha and MIPS (R4000 and R4400) cpu as well as x86.
https://devblogs.microsoft.com/oldne...17-00/?p=96835

That DEC Alpha is a truly does lack the instructions to safely perform atomic operations yes those MIPS cpu of that time frame is just as bad. So your early Windows NT has a lot of pre atomic locking methods using the kernel as the lock master.

Your problem here you are thinking modern locking methods indepe problem here is parts of Windows NT design that is still in windows 10 are pre atomic instruction methods.

The pre atomic locking methods work on a CPU supporting atomic just not at ideal efficiency but the atomic methods on a cpu that does not support atomic is problem child.

It really simple to miss the way windows NT design at core is not 100% modern locking because windows NT 3.1 supported platforms that did not support modern locking methods. Horrible part here is atomic methods cannot be used to emulate all these pre atomic methods without massive overhead. This is just the way it is.

**indepe** · 20 January 2021, 10:27 PM

Originally posted by oiaohm View Post

This is your first major mistake. All locking on x86 is not atomic. i286 does not have the machine code/assemble in the cpu to in fact do atomic locking. Windows NT 3.1 targeted DEC Alpha and MIPS (R4000 and R4400) cpu as well as x86.
https://devblogs.microsoft.com/oldne...17-00/?p=96835

That DEC Alpha is a truly does lack the instructions to safely perform atomic operations yes those MIPS cpu of that time frame is just as bad. So your early Windows NT has a lot of pre atomic locking methods using the kernel as the lock master.

Your problem here you are thinking modern locking methods indepe problem here is parts of Windows NT design that is still in windows 10 are pre atomic instruction methods.

The pre atomic locking methods work on a CPU supporting atomic just not at ideal efficiency but the atomic methods on a cpu that does not support atomic is problem child.

It really simple to miss the way windows NT design at core is not 100% modern locking because windows NT 3.1 supported platforms that did not support modern locking methods. Horrible part here is atomic methods cannot be used to emulate all these pre atomic methods without massive overhead. This is just the way it is.

I'm not going to spend a lot of time on this. A quick search found this:

Intel Instruction Set - LOCK

https://web.itu.edu.tr/kesgin/mul06/intel/instr/lock.html

The 286 always asserts lock during an XCHG with memory operands.

This means that XCHG is an atomic instruction.

**oiaohm** · 20 January 2021, 10:48 PM

Originally posted by indepe View Post

I'm not going to spend a lot of time on this. A quick search found this:

Intel Instruction Set - LOCK

https://web.itu.edu.tr/kesgin/mul06/intel/instr/lock.html

This means that XCHG is an atomic instruction.

No its not atomic on the 286. Multi threads on a 286 using XCHG like that could explode in your face. The out of order protection is added in 386+ read the page you quoted a little closer there is a 386+ note there for a reason.

**indepe** · 20 January 2021, 11:01 PM

Originally posted by oiaohm View Post

No its not atomic on the 286. Multi threads on a 286 using XCHG like that could explode in your face. The out of order protection is added in 386+ read the page you quoted a little closer there is a 386+ note there for a reason.

It says that XCHG always takes a lock, which means even without a prefix. The 386 note is regarding the prefix.

**Weasel** · 21 January 2021, 09:28 AM

Originally posted by indepe View Post

Those are problems with using eventfd, which doesn't have anything to do with what I am talking about: using futex.

No, futex solves ONLY the wait-multiple problem compared to esync, but nothing else listed.

Originally posted by indepe View Post

You are using my own words without any indication that you understand them. In this case, even contention will usually just be a few spins and not require a syscall, and would unlikely be any better if it were done inside the kernel. (However you need to be careful to limit spinning to an upper limit when outside the kernel. Which is a detail that probably doesn't make much sense to you, and isn't very relevant to this discussion.)

EDIT: And this is nothing in comparison to needing a syscall even when there is no contention.

"usually" doesn't mean "always". If you don't understand the problem that contention is the reason for the performance loss then feel free to live in your imaginary world.

Some games work well with esync or fsync, but not every game. Let's say 99% of games work well. That's your "usually". But we don't care about those games in this thread; this isn't about them. We care about the other 1%. The games that use too many threads and have too much contention don't. And that's the whole point of this thing. Going from 100 fps to 80 fps for example.

We're not talking about the "average" here. We're talking about this specific situation where certain games suffer from it, due to whatever design they have. Call it wrong, call it buggy, I don't care. The games are built that way and there's nothing you or Wine can do about it, other than emulate Windows better (since they run better on Windows), which is one atomic syscall for these operations.

Announcement

Wine Developers Are Working On A New Linux Kernel Sync API To Succeed ESYNC/FSYNC

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment