"Guilty" API Proposed For Better Communicating Why Radeon GPUs Hang/Reset
André Almeida of consulting firm Igalia, which works with Valve and other parties, has sent out these draft patches for introducing a new "guilty" ioctl information for communicating information about GPU hang/reset events. This patch series proposes the AMDGPU_INFO_GUILTY_APP ioctl to communicate information about why a GPU reset occurred with the related resources that induced the GPU hang.
Currently the process of tracking down GPU hangs with the open-source driver stack can be a bit challenging, especially if trying to just obtain information in passing after the fact of encountering a hang/reset.
Over in user-space there is this Mesa merge request for implementing "get_guilty_info()" within the Mesa Radeon Vulkan driver RADV. Almeida commented in that MR:
The goal of this draft is to gather feedback about the proposed DRM interface and the dumped information. As stated in the commit description, the goal is to make easier for Mesa devs to figure out why the GPU has crashed with less overhead, so Mesa will run umr -di [vmid@]address length on the exact IB that caused the hang if the app is the guilty one.
...
Currently, when a app crashes on Mesa there's not much information available for the bug report. Users can run the app with RADV_DEBUG=hang in the hope to get more information, but this option has some overhead and the dumped information hasn't much of context, making debug as hard as finding a needle in a haystack.
To solve both issues, introduce a new query function to ask the kernel information when a hang happens by the guilty app. This means that innocent apps won't have overheads and now we can dump just the guilty indirect buffer, making easier to developers to find the offending instruction that hanged the GPU.
The only information available so far is the IB address, it's size and the VM id.
We'll see where this work leads for helping to track down the culprits for AMD Radeon GPU hangs/resets under Linux.