Announcement

**mdedetrich** · 27 May 2022, 06:37 AM

Originally posted by oiaohm View Post

Depends on the regression test failed. Some regression tests are there because some application has required that feature. So if it one of those tests that fail you have in fact broken backwards compatibility. In this case Nvidia did fail tests that exist because someone screwed up and broken userspace the same way in the past and the kernel had to be fixed.

If you want to be technically correct, breaking backwards compatibility has a very specific meaning in software development. For example changing the signature of a C function (lets say by adding a parameter) along with its C header is breaking backwards compatibility because it means that currently existing code wont work with it (i.e. linking) and it forces all code that uses that said C function to modify their C source and recompile against it.

This is different to change of behavior or just basic regressions in tests. If there was broken backwards compatibility the tests wouldn't even run.

Also an important to thing to note is that the tests themselves could be partially invalid which can happen legitimately for many reasons (i.e. incorrect assumptions, new alpha/beta written software which isn't complete yet, etc etc). Considering we are talking about code paths which no one is even really using (because it currently doesn't work irrespective of the the current change) this isn't that surprising and its the exact same thing that piotrj3 was talking about earlier.

There is also the concept of breaking userspace which is again different and this does sometimes happen (even outside the context of NVidia) and again sometimes it can legitimately be done, typically for security reasons or if something is so broken that barely anyone uses it anyways. While the Kernel is typically really strict about not breaking userspace there have been exceptions and what we are talking about right now is one of those exceptions because currently it doesn't even work.

**piotrj3** · 27 May 2022, 08:34 AM

Originally posted by oiaohm View Post

dma-buf: Add an API for exporting sync files (v12) [LWN.net]

https://lwn.net/Articles/859290/

This is not exactly hack this is expose sync file the explicit sync behind the implicit sync to user-space. This made DMABUF compatible with Vulkan without any
any over- synchronization .

Vulkan was designed for Userspace not for OS kernels. This is really like this history pthread solution that run in userspace as well. Same problem with explicit sync in graphical and the historic pthread implementations that run in userspace wasting CPU cycles and context switches on processes that cannot do anything due to what they ware waiting on not being ready.

The reality here is both implicit sync and explicit sync have their downside. What we need is the graphical equal of a futex. Futex you can use as implicit sync or explicit sync or hybrid. Remember a futex in best performing mode only using the syscall when the when the application needs to wait then the kernel only wakes the application up again when the process can proceed. Of course you code using futex functions in not best performing way and replicate implicit sync and explicit sync.

We have multi levels of NIH here. Nvidia and those who created vulkan want for OS neutral so they totally ignored what was required for best performance of the OS just instead focused on what was required for the GPU side. Since you are designing OS neutral you avoid anything that need a kernel modification. This end up designing something that does not ideally fit into the Linux kernel.

The reality here you need to stop ignoring the explicit sync downsides because otherwise you sound exactly like the people who said userspace pthread was perfect.

Implicit sync and explicit sync functionality is both required. Futex in your normal locking is basically both of those glued into one. So the OS scheduler can allocate CPU time to tasks that can proceed forwards kernel side the required implicit sync feature. So userspace code does not have to syscall/context switch all the time you need you need explicit sync as well. This is the problem you need both behaviours like it or not for operating system to perform well.

There is one very big problem with this claim vkfence sits straight on top of dma_fence structure that Linux kernel implicit sync uses and that is exposed by the file sync patch above. So in reality there is no difference in guarantee if you driver is implemented using the Linux kernel dma fence structures. Wait Nvidia cross platform driver they have done their own unique pile of crud here right.

There is a problem with attempt to make code too generic between OS. You end up coding in incompatible designs.

The open source drivers in the mainline kernel using the Linux kernel dma fence structure put them in the right order of operations right off the start line. This is a problem Nvidia is running into because they are trying to be as cross platform as possible in their driver code. So they need to covert from dma fence structs in the Linux kernel to their own and back. Nvidia has attempt to make their sync code way to generic. Notice AMD, Intel and everyone else is not doing these conversions their drivers are native DMABUF supporting. Yes these conversions of sync structures end up with a sync issue between sync structures so making your life hell.

This is the problem Nvidia is not right in their objective of explicit sync only because explicit sync itself is flawed. Implicit sync is like kernel based locking and userspace pthread is like explicit sync. Yes the defects of both are absolutely the same. Yes it took 2 decades for someone to come up with a futex that took the best of both for CPU only workloads.

Nvidia is not proposing graphical equal to a Futex. AMD and Intel and others are slowly working in the direction of a graphical futex. The correct solution will be some form of graphical futex for sync.

I am not talking about Vkfence. I am talking about VkSemaphone that one is hacky in oss because no .

But regardless I don't think you can understand what you wrote.

1st) Jason Ekstrand patch is not the hack it is patch to remove the need for a lot of hacks in vulkan driver in OSS stack.

2nd) Jason Ekstrand proposes DMA_BUF_IOCTL_SYNC. What exactly Nvidia complained about? About lack of Ioctl call that latch the synchronization on dma_buf. Moment it is pretty much same thing.

3rd) patch is not yet merged. So all the hacks to implement Vulkan quirks are there, Jason patch will allow to drop them

The only concern i have is does point 2) is compatible with Nvidia isssue of

There's no clear command that signals the GL driver to "latch" the implicit sync fence from the dma-buf by using the ioctl() that converts it to an explicit sync FD, so we just have to continuously monitor all textures/images/whatever in use by any command buffer and try extracting an implicit fence from any of them that are backed by imported dmabufs every time we want to submit commands down the the hardware, in every application, not just Xwayland+glamor, because Xwayland+glamor use the GL driver just like any other app does.

I have impression it is, although with some additional will have to be done probably on Nvidia driver side/xwayland but i think it is exactly what we want. But again Nvidia is willing to do that work. So again i don't think implicit sync is needed :

- no userspace breakage (unless we combine something wierd like outdated nvidia driver supporting Wayland running updated XWayland and kernel lacking DMA_BUF_IOCTL_SYNC?),

- moving entirly to explicit is option (if they don't have bugs in other cases of implicit -> explicit convertion)

- it gives new options of OSS stack to evolve.

Same problem with explicit sync in graphical and the historic pthread implementations that run in userspace wasting CPU cycles and context switches on processes that cannot do anything due to what they ware waiting on not being ready.

And how implicit sync is better then that? In no locking scenario of course is similar but in locking :

- implicit sync - you have to pool over DMA_BUF to wait until it is done, Is it done? no. (wait some time context switching) Is it done? No (wait again context switching) (many context switches later) YES done,
- explicit sync - operations are ordered so wait and context switch happen one time.
- futex - same waiting from process (and context switching) until condition is true. When condition is true you wake up process that was waiting on it. Same context switch as explicit sync with compability of implicit sync.

If you need implicit sync compability futex is great. And nvidia with futex changes also would be happy because essentially it would be 1-1 with explicit sync. But there is no benefit of futex over explicit sync except one is more from pespective of function calls and another is made from pespective of state that gets transfered around which is what DMA_BUF_IOCTL_SYNC will do.

**oiaohm** · 27 May 2022, 08:44 AM

Originally posted by mdedetrich View Post

There is also the concept of breaking userspace which is again different and this does sometimes happen (even outside the context of NVidia) and again sometimes it can legitimately be done, typically for security reasons or if something is so broken that barely anyone uses it anyways. While the Kernel is typically really strict about not breaking userspace there have been exceptions and what we are talking about right now is one of those exceptions because currently it doesn't even work.

Sorry the "does not even work only applies" if the code is mainline. Third party drivers like Nvidia with the Linux kernel cannot be used to make it does not even work arguement as a reason not to implement feature.

The reality is to merge mainline Nvidia will have to pass the existing regression tests. If there is some fault in the regression tests must go back the the application the regression test was based on and prove that it was wrong with the Linux kernel.

Something interesting about the Linux kernel userspace ABI is that even security reasons fail to be enough reason to break userspace ABI. There are places in the Linux kernel where buffer overflow has to be supported because that was what applications expected at the time the syscall was first made.

Remember cgroupv1 is horrible busted but those doing cgroupv2 had to work out how to support all the applications using cgroupv1 when required.

The reality is AMD and Intel in mainline drivers get implicit sync and explicit sync to in fact work with each other. Glamor working with AMD and Intel and other mainline drivers basically kills Nvidia arguement by the Linux kernel rules. There is a very high price for not being mainline with the Linux kernel. Big one is your request for kernel changes are straight can be vetoed by anyone who had mainline code.

**MorrisS.** · 27 May 2022, 08:47 AM

I've got this article: wayland supports explicit syncronization or eventually it could.
Immagine 2022-05-27 144607.png

**MorrisS.** · 27 May 2022, 08:59 AM

Linux explicit synchronization (dma-fence) protocol | Wayland Explorer

**piotrj3** · 27 May 2022, 09:13 AM

Originally posted by MorrisS. View Post

Linux explicit synchronization (dma-fence) protocol | Wayland Explorer

Jason Ekstrand tells you exactly why it is not useful.

From my perspective, as a Vulkan driver developer, I have to deal with the fact that Vulkan is an explicit sync API but Wayland and X11 aren't. Unfortunately, the Wayland extension solves zero problems for me because I can't really use it unless it's implemented in all of the compositors. Until every Wayland compositor I care about my users being able to use (which is basically all of them) supports the extension, I have to continue carry around my pile of hacks to keep implicit sync and Vulkan working nicely together. From the perspective of a Wayland compositor (I used to play in this space), they'd love to implement the new explicit sync extension but can't. Sure, they could wire up the extension, but the moment they go to flip a client buffer to the screen directly, they discover that KMS doesn't support any explicit sync APIs. So, yes, they can technically implement the extension assuming the EGL stack they're running on has the sync_file extensions but any client buffers which come in using the explicit sync Wayland extension have to be composited and can't be scanned out directly. As a 3D driver developer, I absolutely don't want compositors doing that because my users will complain about performance issues due to the extra blit. Ok, so let's say we get KMS wired up with implicit sync. That solves all our problems, right? It does, right up until someone decides that they wan to screen capture their Wayland session via some hardware media encoder that doesn't support explicit sync. Now we have to plumb it all the way through the media stack, gstreamer, etc. Great, so let's do that! Oh, but gstreamer won't want to plumb it through until they're guaranteed that they can use explicit sync when displaying on X11 or Wayland. Are you seeing the problem?

You grasp now scale of changes to move entirly to explicit sync? Well Nvidia in some cases properly guess synchronization fences so in case of Wayland program on Wayland it works, the issue is more about Xwayland and interactions between Wayland and for example screen capture. So yes the problem is way bigger then I suggested above to properly solve *all* issues. But many of those issues are also plaguing OSS drivers or were plaguing OSS drivers in the past. Keep in mind Nvidia is bleeding edge, while Intel/AMD spend more then 8 years on ironing Wayland bugs.

**MorrisS.** · 27 May 2022, 09:17 AM

Originally posted by piotrj3 View Post

Jason Ekstrand tells you exactly why it is not useful.

In your opinion, for what reason wayland and its compositors are not developed by explicit sync criteria since the beginning?

**oiaohm** · 27 May 2022, 09:32 AM

Originally posted by piotrj3 View Post

1st) Jason Ekstrand patch is not the hack it is patch to remove the need for a lot of hacks in vulkan driver in OSS stack.

2nd) Jason Ekstrand proposes DMA_BUF_IOCTL_SYNC. What exactly Nvidia complained about? About lack of Ioctl call that latch the synchronization on dma_buf. Moment it is pretty much same thing.

3rd) patch is not yet merged. So all the hacks to implement Vulkan quirks are there, Jason patch will allow to drop them

Buffer Sharing and Synchronization — The Linux Kernel documentation

https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html

DMA_BUF_IOCTL_SYNC is in fact merged quite a while back. Yes that in the 5.9 documentation. Most of jason latter patch is test suite and documentation on what was already in the kernel. Yes the horrible Linux kernel problem a feature gets created and nobody documents how to use it. DMA_BUF_IOCTL_SYNC traces back to Android and it explicit sync.

Originally posted by piotrj3 View Post

There's no clear command that signals the GL driver to "latch" the implicit sync fence from the dma-buf by using the ioctl() that converts it to an explicit sync FD, so we just have to continuously monitor all textures/images/whatever in use by any command buffer and try extracting an implicit fence from any of them that are backed by imported dmabufs every time we want to submit commands down the the hardware, in every application, not just Xwayland+glamor, because Xwayland+glamor use the GL driver just like any other app does.

Android being explicit sync would have to solve the same problem when using DMA-BUF right.

Originally posted by piotrj3 View Post

- implicit sync - you have to pool over DMA_BUF to wait until it is done, Is it done? no. (wait some time context switching) Is it done? No (wait again context switching) (many context switches later) YES done,

Buffer Sharing and Synchronization (dma-buf) — The Linux Kernel documentation

https://docs.kernel.org/driver-api/dma-buf.html#implicit-fence-poll-support

Implicit Fence Poll Support

To support cross-device and cross-driver synchronization of buffer access implicit fences (represented internally in the kernel with struct dma_fence) can be attached to a dma_buf. The glue for that and a few related things are provided in the dma_resv structure.
Userspace can query the state of these implicitly tracked fences using poll() and related system calls:

Checking for EPOLLIN, i.e. read access, can be use to query the state of the most recent write or exclusive fence.
Checking for EPOLLOUT, i.e. write access, can be used to query the state of all attached fences, shared and exclusive ones.

Note that this only signals the completion of the respective fences, i.e. the DMA transfers are complete. Cache flushing and any other necessary preparations before CPU access can begin still need to happen.

You need to read this. Note the last sentence the first bit is a is clear only signals on completion this is only returns from poll when its done. The way implicit sync is implement here is you don't do many context switches because you call poll you lose your application cpu slice and get a new cpu slice when the implicit fence is resolved. Your process does not get CPU slices it is not doing many context switches. What you just wrote is not how the DMABUF implicit sync works.

There is not multi queries and wait in the DMABUF implicit sync implementation this is why it can deadlock.

Originally posted by piotrj3 View Post

- explicit sync - operations are ordered so wait and context switch happen one time.

Sync files the Linux explicit sync is not linked to context switches. So you might spinlock waiting on a Sync file consuming insane amount of CPU processing power at the worst.

Originally posted by piotrj3 View Post

- futex - same waiting from process (and context switching) until condition is true. When condition is true you wake up process that was waiting on it. Same context switch as explicit sync with compability of implicit sync.

If you need implicit sync compability futex is great. And nvidia with futex changes also would be happy because essentially it would be 1-1 with explicit sync. But there is no benefit of futex over explicit sync except one is more from pespective of function calls and another is made from pespective of state that gets transfered around which is what DMA_BUF_IOCTL_SYNC will do.

There is a benefit if a futex that is explicit sync compatible is that while waiting on sync you are not giving the application time slices after time slices of cpu time that basically spin locking on the explicit sync waiting for it to complete. This is exactly what DMABUF implicit sync is designed to prevent. DMABUF implicit sync due to using the Poll syscalls is basically integrated with the Linux kernel CPU scheduler.

This is the problem doing implicit sync on top of Nvidia explicit sync is not going to emulate DMABUF implicit sync because you are missing the CPU scheduler integration.

**piotrj3** · 27 May 2022, 09:35 AM

Originally posted by MorrisS. View Post

In your opinion, for what reason wayland and its compositors are not developed by explicit sync criteria?

To be honest problem is 4 part:

- Wayland by itself is smallest part of problem because it actually is supposed to support explicit sync so someone though about that, but more laziness we have one route why make another

- Wayland and open source stack around it was developed before DX12/Vulkan. I don't believe any sane developer would go implicit sync route when DX12/Vulkan was around,

- Linux and principle "Everything is a file" and file with locks on it, is implicit sync like structure. DMA_BUF and GBM are file like structure, so maybe it felt natural, KMS is implicit too,

- maybe there were existing hardware/drivers that are only made with implicit sync in mind?

**MorrisS.** · 27 May 2022, 09:50 AM

Originally posted by piotrj3 View Post

To be honest problem is 4 part:

- Wayland by itself is smallest part of problem because it actually is supposed to support explicit sync so someone though about that, but more laziness we have one route why make another

- Wayland and open source stack around it was developed before DX12/Vulkan. I don't believe any sane developer would go implicit sync route when DX12/Vulkan was around,

- Linux and principle "Everything is a file" and file with locks on it, is implicit sync like structure. DMA_BUF and GBM are file like structure, so maybe it felt natural, KMS is implicit too,

- maybe there were existing hardware/drivers that are only made with implicit sync in mind?

So the two environments are structurally different. At this point a natural based Os cannot become what it is not. If Nvidia cannot implement a solution and Linux cannot switch to explicit sync because of its nature as you state ("why android can?") the only solution is that Linux users don't use Nvidia video cards.

Announcement

NVIDIA's List Of Known Wayland Issues From SLI To VDPAU, VR & More

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment