"Guilty" API Proposed For Better Communicating Why Radeon GPUs Hang/Reset

Written by Michael Larabel in Radeon on 2 May 2023 at 09:00 AM EDT. 25 Comments
A set of patches to the AMDGPU Linux kernel driver and Mesa's RADV Vulkan driver would allow more easily relaying information about the reasons why a GPU hang/reset occur so that the user-space software can be more informed about any issues.

André Almeida of consulting firm Igalia, which works with Valve and other parties, has sent out these draft patches for introducing a new "guilty" ioctl information for communicating information about GPU hang/reset events. This patch series proposes the AMDGPU_INFO_GUILTY_APP ioctl to communicate information about why a GPU reset occurred with the related resources that induced the GPU hang.

Currently the process of tracking down GPU hangs with the open-source driver stack can be a bit challenging, especially if trying to just obtain information in passing after the fact of encountering a hang/reset.

Over in user-space there is this Mesa merge request for implementing "get_guilty_info()" within the Mesa Radeon Vulkan driver RADV. Almeida commented in that MR:
The goal of this draft is to gather feedback about the proposed DRM interface and the dumped information. As stated in the commit description, the goal is to make easier for Mesa devs to figure out why the GPU has crashed with less overhead, so Mesa will run umr -di [vmid@]address length on the exact IB that caused the hang if the app is the guilty one.
Currently, when a app crashes on Mesa there's not much information available for the bug report. Users can run the app with RADV_DEBUG=hang in the hope to get more information, but this option has some overhead and the dumped information hasn't much of context, making debug as hard as finding a needle in a haystack.

To solve both issues, introduce a new query function to ask the kernel information when a hang happens by the guilty app. This means that innocent apps won't have overheads and now we can dump just the guilty indirect buffer, making easier to developers to find the offending instruction that hanged the GPU.

The only information available so far is the IB address, it's size and the VM id.

We'll see where this work leads for helping to track down the culprits for AMD Radeon GPU hangs/resets under Linux.
Related News
About The Author
Michael Larabel

Michael Larabel is the principal author of Phoronix.com and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and OpenBenchmarking.org automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via MichaelLarabel.com.

Popular News This Week