Announcement

**Weasel** · 21 January 2021, 09:30 AM

Originally posted by oiaohm View Post

This is your first major mistake. All locking on x86 is not atomic. i286 does not have the machine code/assemble in the cpu to in fact do atomic locking. Windows NT 3.1 targeted DEC Alpha and MIPS (R4000 and R4400) cpu as well as x86.
https://devblogs.microsoft.com/oldne...17-00/?p=96835

That DEC Alpha is a truly does lack the instructions to safely perform atomic operations yes those MIPS cpu of that time frame is just as bad. So your early Windows NT has a lot of pre atomic locking methods using the kernel as the lock master.

Your problem here you are thinking modern locking methods indepe problem here is parts of Windows NT design that is still in windows 10 are pre atomic instruction methods.

The pre atomic locking methods work on a CPU supporting atomic just not at ideal efficiency but the atomic methods on a cpu that does not support atomic is problem child.

It really simple to miss the way windows NT design at core is not 100% modern locking because windows NT 3.1 supported platforms that did not support modern locking methods. Horrible part here is atomic methods cannot be used to emulate all these pre atomic methods without massive overhead. This is just the way it is.

Damn dude I didn't realize you were still using a 286 with Windows 10. Can you sell me your magic?

**indepe** · 21 January 2021, 01:06 PM

Originally posted by Weasel View Post

No, futex solves ONLY the wait-multiple problem compared to esync, but nothing else listed.

I don't know if you perhaps think "futex" == "fsync". However , I am not talking about fsync, which I don't know much about except that I think it tried to use a kernel patch referred to as "futex2".

I am talking about what is possible with the existing futex kernel API and atomic operations, not about some existing specific implementation like fsync.

Most of those problems in that list explicitly referred to "eventfd", which I am not talking about either.

So I don't know what principle problem you think there would be in implementing the necessary functions using the exisiting kernel API.

Or why you would think that.

Originally posted by Weasel View Post

"usually" doesn't mean "always". If you don't understand the problem that contention is the reason for the performance loss then feel free to live in your imaginary world.

Some games work well with esync or fsync, but not every game. Let's say 99% of games work well. That's your "usually". But we don't care about those games in this thread; this isn't about them. We care about the other 1%. The games that use too many threads and have too much contention don't. And that's the whole point of this thing. Going from 100 fps to 80 fps for example.

We're not talking about the "average" here. We're talking about this specific situation where certain games suffer from it, due to whatever design they have. Call it wrong, call it buggy, I don't care. The games are built that way and there's nothing you or Wine can do about it, other than emulate Windows better (since they run better on Windows), which is one atomic syscall for these operations.

This part is why I think you might assume I am talking about fsync: "Some games work well with esync or fsync, but not every game."

However, I am not talking about fsync at all.

**indepe** · 21 January 2021, 01:20 PM

Originally posted by Weasel View Post

Damn dude I didn't realize you were still using a 286 with Windows 10. Can you sell me your magic?

Obviously it is the magic of "pre atomic methods", which "atomic methods cannot be used to emulate". Which is why it would be no problem to run WINE on DEC Alpha. It has that magic.

**indepe** · 21 January 2021, 04:32 PM

Originally posted by Weasel View Post

The games are built that way and there's nothing you or Wine can do about it, other than emulate Windows better (since they run better on Windows), which is one atomic syscall for these operations.

Just noticed there is again an indication that you seem to suggest that syscalls are needed to make things atomic. As was suggested before when you wrote: "Sometimes you need an operation to be atomic to avoid race conditions."

That isn't true in two ways:

a) You don't need a syscall to make multiple operations atomic. For example a spinlock that doesn't use syscalls at all (not even under contention), and will do just fine. That's not a recommendation for spinlocks, though.

b) Just using a syscall doesn't make things atomic. You still need a lock inside the syscall, or other atomic operations.

**Weasel** · 21 January 2021, 04:56 PM

Originally posted by indepe View Post

Just noticed there is again an indication that you seem to suggest that syscalls are needed to make things atomic. As was suggested before when you wrote: "Sometimes you need an operation to be atomic to avoid race conditions."

That isn't true in two ways:

a) You don't need a syscall to make multiple operations atomic. For example a spinlock that doesn't use syscalls at all (not even under contention), and will do just fine. That's not a recommendation for spinlocks, though.

b) Just using a syscall doesn't make things atomic. You still need a lock inside the syscall, or other atomic operations.

Yeah of course you don't need syscalls to make things atomic. I actually love atomic instructions that can be used for simple locks in userspace. I'm a fan of userspace synchronization, but unfortunately that can work reliably only on apps that I actually develop, not 3rd party (like the Windows games).

The syscalls need to do everything because, if you only use the syscalls themselves to signal whatever sync operations you need, it must "finish" before userspace gets a chance to get scheduled. The kernel can control this scheduling, but userspace can't. I'll give a practical example below.

Ok so, I doubt what you mean can actually work, but I'm open to be proven wrong and always like to learn new stuff, maybe you have some genius idea. So let's just look at one example where the current esync/fsync patches can't handle properly: PulseEvent.

Basically what it does is, it wakes up any threads waiting for the event, without changing the event's state. Sound simple, right? (please don't start about it being a "badly designed API". I'm fully aware about it, and by NO means do I advocate using it! unfortunately, existing applications and games DO use it, that's why it must be implemented)

One way you can badly emulate it is like setting the event (which wakes up a waiting thread) followed up by resetting the event. Obviously, this is a race condition, because you have TWO syscalls here, and there's no guarantee that your "reset event" state happens before any other thread uses the event, even though it should. In fact, your thread could get completely halted right after the first syscall for ages.

So, you can protect it with a lock, right?

Code:

PulseEvent()
{
  acquire_lock();  // <-- spinlock to avoid another syscall
  set_event_and_wake_thread();
  reset_event();
  release_lock();
}

Let's ignore the fact we have two syscalls here, this is still a major problem for contention performance. You see, if another thread uses PulseEvent on the same event at the time we have the lock acquired, it will have to wait as much as TWO SYSCALLS worth before it even begins (first thread must finish).

If the whole thing was just ONE syscall (something that fully implements PulseEvent), none of this would be an issue. No thread would ever wait, there would be at most one syscall, etc.

So I fail to understand what you mean. Can you show a simple pseudo-code example for PulseEvent, just so I can see what you're trying to say better?

Again, I understand you can do almost any (or even any) synchronization with just futexes or spinlocks, but that's only when you control the design of the application yourself, this isn't such a case.

**oiaohm** · 21 January 2021, 05:50 PM

Originally posted by Weasel View Post

Damn dude I didn't realize you were still using a 286 with Windows 10. Can you sell me your magic?

I was referring to how old the method is.

Originally posted by indepe View Post

Obviously it is the magic of "pre atomic methods", which "atomic methods cannot be used to emulate". Which is why it would be no problem to run WINE on DEC Alpha. It has that magic.

No that would not work. Because you have pre atomic methods and other code in modern day windows applications that are atomic methods. Anyone porting code to DEC Alpha back in Windows 4.0 days use to run into that problem as well where you code areas that were using atomic methods that was fine on i386 x86 or better was now totally screwed.

Pre atomic methods for doing locking can be used on a CPU that support atomic methods but atomic methods cannot be used to replace them in all cases.

Yes pre atomic CPU have their fair share of fun emulating atomic methods with lots of odd ball Corner cases.

This is one of these things if it not broken don't fix it path. Using pre atomic locking methods work perfectly as fine as what they did on pre atomic locking on atomic locking supporting cpus with the same level performance problems.

The hard case is your pre attomic locks that are mug-able locks. This where a process take out a lock and another process wants the lock so it now uses information about the lock to kill the process holding the lock so it can take the lock or free the lock. How do you safely record what process has the lock without a syscall.

https://www.ryadel.com/en/unlock-fil...ocess-windows/

The unlocker tool here exploit the muggable locks.

You modern atomic locking is design with acquire and release most of your modern course on locking teach only these two. When you go back to pre-attomic locking you have a extra called Take. Take is your process will get the lock no matter what even if another process has the lock currently.

Pre atomic locking model:
Take acquire lock by force included killing who has the lock if required this also will mean you will be wanting to check ACL and other security things to see if a process is allowed to brute force it way on a lock.
Acquire wait until lock can be got.
Release let go of lock.

This pre atomic locking model is inside Windows in different areas. Atomic locking was designed to be nicely high performing but was not design to be brute forced with take lock methods.

There are even subsets inside take locks methods.

Like you take a lock by force from a process and the process is not killed just suspended until the higher privilege process releases the lock yes there are a few places inside windows where you can totally valid do this. Like anti-virus scanning a locked file.

indepe this is a horrible one to consider you pre atomic locking support this. How are you going todo a take lock operation with atomic locking that does not kill the process that holding the lock when a higher privilege process takes the lock for its own usage that is meant to be returned to the lower privilege process latter without the lower privilege process not to know that the lock was pick pocketed from it and put back.

This is the problem with the idea that all locking is atomic locking is wrong people doing computer courses are taught that and that is not true when you get into the tricker sections of Windows. Because Windows contains pre attomic locking in places and pre attomic locking allows the horrible take a lock by force with different levels slide of hand.

**indepe** · 21 January 2021, 06:10 PM

Originally posted by Weasel View Post

Yeah of course you don't need syscalls to make things atomic. I actually love atomic instructions that can be used for simple locks in userspace. I'm a fan of userspace synchronization, but unfortunately that can work reliably only on apps that I actually develop, not 3rd party (like the Windows games).

The syscalls need to do everything because, if you only use the syscalls themselves to signal whatever sync operations you need, it must "finish" before userspace gets a chance to get scheduled. The kernel can control this scheduling, but userspace can't. I'll give a practical example below.

Ok so, I doubt what you mean can actually work, but I'm open to be proven wrong and always like to learn new stuff, maybe you have some genius idea. So let's just look at one example where the current esync/fsync patches can't handle properly: PulseEvent.

Basically what it does is, it wakes up any threads waiting for the event, without changing the event's state. Sound simple, right? (please don't start about it being a "badly designed API". I'm fully aware about it, and by NO means do I advocate using it! unfortunately, existing applications and games DO use it, that's why it must be implemented)

One way you can badly emulate it is like setting the event (which wakes up a waiting thread) followed up by resetting the event. Obviously, this is a race condition, because you have TWO syscalls here, and there's no guarantee that your "reset event" state happens before any other thread uses the event, even though it should. In fact, your thread could get completely halted right after the first syscall for ages.

So, you can protect it with a lock, right?

Code:

PulseEvent()
{
acquire_lock(); // <-- spinlock to avoid another syscall
set_event_and_wake_thread();
reset_event();
release_lock();
}

Let's ignore the fact we have two syscalls here, this is still a major problem for contention performance. You see, if another thread uses PulseEvent on the same event at the time we have the lock acquired, it will have to wait as much as TWO SYSCALLS worth before it even begins (first thread must finish).

If the whole thing was just ONE syscall (something that fully implements PulseEvent), none of this would be an issue. No thread would ever wait, there would be at most one syscall, etc.

So I fail to understand what you mean. Can you show a simple pseudo-code example for PulseEvent, just so I can see what you're trying to say better?

Again, I understand you can do almost any (or even any) synchronization with just futexes or spinlocks, but that's only when you control the design of the application yourself, this isn't such a case.

Maybe looking at a practical example is a good way to go. However it might take some work to get on the same page, looking at what you wrote. At first it sounds indeed simple, too simple, so I wonder, and that maybe because I'm not familiar with the Windows side of things. However it appeared that esync and fsync were able to translate the logic and operations which they support, without requiring special syscalls other than WAIT_MULTIPLE in the case of fsync. And there was the claim that these things also make sense when translated into the Linux world.

So maybe PulseEvent poses special problems that are hidden from plain view. For example, why do you think there need to be 2 syscalls unless the whole thing is a single syscall?

In the absence of knowing about any Windows specifics, I would think of something like:

Code:

{
   acquire_lock_of_event(e);
   thread t = thread_waiting_for_event(e);
   clear_waitlist_of_event(e);
   reset_event(e);
   release_lock_of_event(e);
   if (t != NULL) {
      wake_thread( t );
   }
}

Only wake_thread would invoke a syscall, and only if there is actually a thread waiting.
(EDIT: This is of course a simplification. For example instead of the "thread t", it would be something like "wait_entry_for_thread".)
(EDIT 2: Note that the potential syscall in wake_thread is outside the lock.)

**Weasel** · 22 January 2021, 08:41 AM

Originally posted by oiaohm View Post

No that would not work. Because you have pre atomic methods and other code in modern day windows applications that are atomic methods. Anyone porting code to DEC Alpha back in Windows 4.0 days use to run into that problem as well where you code areas that were using atomic methods that was fine on i386 x86 or better was now totally screwed.

Pre atomic methods for doing locking can be used on a CPU that support atomic methods but atomic methods cannot be used to replace them in all cases.

WTF is a pre atomic method? Give actual examples.

**Weasel** · 22 January 2021, 08:45 AM

Originally posted by indepe View Post

So maybe PulseEvent poses special problems that are hidden from plain view. For example, why do you think there need to be 2 syscalls unless the whole thing is a single syscall?

In the absence of knowing about any Windows specifics, I would think of something like:

Code:

{
acquire_lock_of_event(e);
thread t = thread_waiting_for_event(e);
clear_waitlist_of_event(e);
reset_event(e);
release_lock_of_event(e);
if (t != NULL) {
wake_thread( t );
}
}

Only wake_thread would invoke a syscall, and only if there is actually a thread waiting.
(EDIT: This is of course a simplification. For example instead of the "thread t", it would be something like "wait_entry_for_thread".)
(EDIT 2: Note that the potential syscall in wake_thread is outside the lock.)

Ok, I understand what you're trying to say now. You want to implement the whole thing in userspace, just with locks to protect the code from races.

Tbh, it sounds pretty nice in practice, but I'm guessing this approach suffers from contention or some other thing they measured. I don't know—they seem to be specifically looking for "kernel options". BTW esync currently emulates some of these things pretty badly, not just in terms of performance, but having race conditions. That's why some weird games don't even work with esync on.

Anyway you gave me some ideas to try for my "lockless" (as in, syscalls, not atomic locks) design of an app I have.

**indepe** · 22 January 2021, 01:09 PM

Originally posted by Weasel View Post

Ok, I understand what you're trying to say now. You want to implement the whole thing in userspace, just with locks to protect the code from races.

Sounds about right!

Originally posted by Weasel View Post

Tbh, it sounds pretty nice in practice, but I'm guessing this approach suffers from contention or some other thing they measured. I don't know—they seem to be specifically looking for "kernel options". BTW esync currently emulates some of these things pretty badly, not just in terms of performance, but having race conditions. That's why some weird games don't even work with esync on.

For an efficient implementation, I think it will be most important to recycle any dynamic memory, instead of de-allocating and re-allocating it, and to do even that and everything else outside the locked sections as much as possible. (And perhaps in other specific places use lockfree atomics whenever possible.)

Originally posted by Weasel View Post

Anyway you gave me some ideas to try for my "lockless" (as in, syscalls, not atomic locks) design of an app I have.

Yep....

Announcement

Wine Developers Are Working On A New Linux Kernel Sync API To Succeed ESYNC/FSYNC

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment