GPU Lockup Recovery For The Nouveau Driver

The work by Marcin Slusarz allows for detecting lock-ups by watching for time-outs (VM flush / fence), return -EIOs, handle them at the ioctl level, reset the GPU, and repeat last ioctl. The actual Nouveau GPU resetting is done by putting the NVIDIA graphics processor through its suspend / resume cycle but with CPU-only buffer object eviction, ignoring VM flush/fence time-outs, and shortening waits.
The patches can currently be found on the Nouveau mailing list under an "RFC" (Request For Comments) tag, but hopefully they will be able to move on into the mainline Nouveau kernel DRM in the near future. Depending upon the hardware, lock-ups can be a fairly common occurrence when using this open-source NVIDIA Linux graphics driver. This work comes at a time when the Nouveau driver is finally approaching a stable state after being in development for more than the past half-decade.
Over in the open-source Radeon GPU driver camp, they are currently discussing how to rework the GPU reset logic. Those discussions can be found on dri-devel. There's patches there that address multi-ring lock-ups and GPU resets plus overall improvements in case something goes badly wrong with the Radeon DRM driver. That work was done by AMD's Christian König.
Meanwhile, the Intel DRM driver continues to recover GPU lock-ups gracefully. I've only hit a handful of lock-ups so far on Intel's about-to-be-launched Ivy Bridge hardware, but each time under Linux the graphics were quickly and properly restored.
Add A Comment