Alibaba Engineers Work To Address Suspend/Resume Bugs With The AMD Graphics Driver

Alibaba engineers uncovered a number of resource tracking related bugs within the AMDGPU kernel driver around double buffer frees, use-after-free, and other resource tracking related bugs with the AMDGPU driver's C code. Thanks to the driver being open-source, they took to working through the bugs.
This patch series from Alibaba's Jiang Liu works to enhance the device state machine so the AMDGPU driver better handles the suspend/resume cycles with these bugs that were uncovered.
"Recently we were testing suspend/resume functionality with AMD GPUs, we have encountered several resource tracking related bugs, such as double buffer free, use after free and unbalanced irq reference count.
We have tried to solve these issues case by case, but found that may not be the right way. Especially about the unbalanced irq reference count, there will be new issues appear once we fixed the current known issues. After analyzing related source code, we found that there may be some fundamental implementation flaws behind these resource tracking issues.
...
So we try to fix those issues by two enhancements/refinements to current device management state machines.
...
Then we try to refine each subsystem, such as nbio, asic etc, to follow the new design. Currently we have only taken the nbio and asic as examples to show the proposed changes. Once we have confirmed that's the right way to go, we will handle the lefting subsystems.
This is in early stage and requesting for comments, any comments and suggestions are welcomed!"
Their open-source contribution is now under review and hopefully the patches will be worked through for completing this enhanced device state machine for better suspend/resume handling by the AMD Linux graphics driver.
63 Comments