Announcement

Collapse
No announcement yet.

Intel's Gallium3D Driver Will Now Try To Recover From GPU Hangs

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Intel's Gallium3D Driver Will Now Try To Recover From GPU Hangs

    Phoronix: Intel's Gallium3D Driver Will Now Try To Recover From GPU Hangs

    The Intel Gallium3D OpenGL driver performance is now in good shape for this new open-source Intel Linux GL driver compared to its "classic" Mesa driver, but there are still various other features to be ironed out before this "Iris" driver can become the new default. One of the items now crossed off the list is GPU hang recovery...

    http://www.phoronix.com/scan.php?pag...-Hang-Recovery

  • DoMiNeLa10
    replied
    Originally posted by dkasak View Post

    Agreed. I always get lockups running setiathome under boinc, on my AMD GPU. The other ( boinc ) GPU tasks are fine. It would be nice if I didn't have to power-cycle for these - it makes opencl on AMD pretty risky.
    I hope if it's implemented, it won't be as bad as Sandy Bridge which would recover from hangs just to hang again, with the period between hangs feeling like it would shorten exponentially. Still, hang recovery is just a quick attempt to make hardware more usable, and it doesn't address the actual problem. I hope someone will fix it unlike Sandy Bridge I've mentioned above, which is destined to be pretty unstable.

    Leave a comment:


  • Kayden
    replied
    KWin does support robustness and properly recovers from GPU hangs, actually. Mutter / GNOME Shell on X (but not Wayland) detects hangs and restarts the process, too. Hooking reset detection up for glamor / X would be a great idea.

    Leave a comment:


  • agd5f
    replied
    Originally posted by totoz View Post
    What is supposed to happen when recovering from a GPU hang? Will the current GL application still run or crash? Or maybe the whole X server will require a restart? I'm curious.
    Any applications using the GPU need to use the relevant robustness APIs exposed via the API (the robustness extensions for GL, and the context lost stuff for vulkan). When there is a GPU reset, the application needs to rebuild it's contexts because the GPU has hung and been reset and the applications' data may have been lost (e.g., data lost due to the execution blocks being reset while working on data or data lost due to state being lost as a result of the reset such as vram contents lost if the GPU's memory controller is reset). When the context is lost, the application needs to rebuild it's context (buffer and command state, etc.).

    Support for these robustness extensions would need to be added to glamor and the compositors to gracefully recover from a GPU reset. On windows, MS has support for the DX context lost API in their desktop manager so when there is a GPU reset, the desktop manager will properly re-build it's context. Linux needs similar work.

    In the early days of computer graphics, it was considered acceptable to reset the GPU silently and users could accept some garbage on the screen, flicker, etc, because graphics data is mostly thrown away as soon as a frame is rendered. This is not the case with compute where getting bad data in your results can ruin your entire run. As such, you really need to have applications properly catch and deal with issues. This is not just limited to GPU resets. You have to deal with other kinds of issues like ECC errors in vram that could also result in data corruption. From a data integrity point of view, it is better to die than to corrupt data.

    So today on Linux, since nether glamor nor any of the compositors support context lost APIs, you have to restart your display server to get your desktop back after a GPU reset.

    Leave a comment:


  • totoz
    replied
    What is supposed to happen when recovering from a GPU hang? Will the current GL application still run or crash? Or maybe the whole X server will require a restart? I'm curious.

    Leave a comment:


  • dkasak
    replied
    Originally posted by Lanz View Post
    AMD's GPU hang recovery could use some work.
    Agreed. I always get lockups running setiathome under boinc, on my AMD GPU. The other ( boinc ) GPU tasks are fine. It would be nice if I didn't have to power-cycle for these - it makes opencl on AMD pretty risky.

    Leave a comment:


  • Kayden
    replied
    FWIW, this support should work a lot better with Kernel 5.2+.

    Leave a comment:


  • Lanz
    replied
    AMD's GPU hang recovery could use some work.

    Leave a comment:


  • aufkrawall
    replied
    Question is if xserver or other display servers survive the driver reset.

    Leave a comment:


  • DoMiNeLa10
    replied
    This is great, GPU hangs can be nasty if the driver doesn't attempt to recover from them. I remember running into these issues with nouveau back in the day, it would require power cycling the machine to get the GPU to work properly again. I assume that some GPU hangs will happen with hardware supported by these drivers.

    Leave a comment:

Working...
X