Announcement

**coder** · 13 July 2021, 02:29 PM

Originally posted by F.Ultra View Post

I don't think it does, CRITICAL_SECTION in Windows works just like an adaptive mutex in Linux

I'm not surprised, but I think we agree that there's some opportunity to offer the kernel hints about preemption. The trick is just to do it in a way that doesn't create potential for more bugs or performance pitfalls.

Some of the hints in the other direction could operate on the basis of a high-resolution timestamp indicating the beginning of the timeslice. I'm sure the kernel already has this information, so it just needs to put it where userspace can read it. Then, subtract it off the current vaule of the TSC register and you can tell how deep you are into the timeslice. Of course, there's the potential for an interrupt to come along and invalidate your calculations, but that should be rare.

**indepe** · 13 July 2021, 07:31 PM

Originally posted by coder View Post

Some of the hints in the other direction could operate on the basis of a high-resolution timestamp indicating the beginning of the timeslice. I'm sure the kernel already has this information, so it just needs to put it where userspace can read it. Then, subtract it off the current vaule of the TSC register and you can tell how deep you are into the timeslice. Of course, there's the potential for an interrupt to come along and invalidate your calculations, but that should be rare.

That may be worth thinking about (the thought occurred to me as well). I guess the kernel would already know the end of the time slice and be able to store that in a thread-local variable. I think most developers usually at first check the single threaded non-contention performance of an almost empty lock, and RDTSC would approx. double the time. So it would need some convincing and benchmarks to show the value it may have.

For benchmark purposes, within a tight test loop, you might be able to keep track of preemption by detecting larger gaps in the TSC, and remember the common duration between the gaps. (And ideally somehow ignore any occasional interrupt.) And then use that info to occasionally re-initiate the time slice by calling sched_yield() or so. That way you might be able to find out if it is worth additional exploration.

As I said elsewhere, personally I have the long-term intention to replace as many locks as possible with other synchronization techniques, especially where performance matters. Currently it looks very good for that, however it might be just luck with the current specific situation. So although I find optimizing locks very interesting, it currently isn't really important to my own situation anymore.

**coder** · 13 July 2021, 08:55 PM

Originally posted by indepe View Post

For benchmark purposes, within a tight test loop, you might be able to keep track of preemption by detecting larger gaps in the TSC,

Why couldn't the kernel just keep a timeslice counter (if it doesn't already) in userspace memory, that you could read? So, if the timeslice counter changed, then you'd know one or more preemptions occurred.

Originally posted by indepe View Post

(And ideally somehow ignore any occasional interrupt.)

Interrupts could be handled similarly, but they're necessarily a lot more sensitive to overhead than context switches. So, I'd just count on there not being enough interrupts to interfere much with the lock ownership stats.

Originally posted by indepe View Post

although I find optimizing locks very interesting, it currently isn't really important to my own situation anymore.

If I had time to spend optimizing multithreading performance, I think workstealing is an area ripe for improvements. There seems to be an ever-increasing number of libraries that each spin up their own worker threads, which compete with each other & other processes for CPU time on all of the cores. We'd ideally have kernel support for this sort of thing.

**indepe** · 13 July 2021, 10:13 PM

Originally posted by coder View Post

Why couldn't the kernel just keep a timeslice counter (if it doesn't already) in userspace memory, that you could read? So, if the timeslice counter changed, then you'd know one or more preemptions occurred.

When I said thread-local memory, that does mean user space memory (like "errno"). So I think it could (if it doesn't already). However this would be the first use case that I'm aware of. I don't know if this question has been explored before.

Originally posted by coder View Post

If I had time to spend optimizing multithreading performance, I think workstealing is an area ripe for improvements. There seems to be an ever-increasing number of libraries that each spin up their own worker threads, which compete with each other & other processes for CPU time on all of the cores. We'd ideally have kernel support for this sort of thing.

Workstealing does look like a valid concept to me, depending on the use case. Probably especially if there are many similar work items of each kind. Personally I find that dedicated (and named) threads result in a workable mental model of the application structure. In so far as available, lock-free synchronization algorithms reduce blocking through preemption of another thread, or completely eliminate it. However they are usually more specialized and have a steep learning curve. I expect that will improve within the next 10 years, such that more will be available ready for use, with implementation details hidden from the user.

Announcement

FUTEX2 Spun Up A Fifth Time For This Linux Interface To Help Windows Games

Comment

Comment

Comment

Comment