Announcement

Collapse
No announcement yet.

Intel's Gallium3D Driver Will Now Try To Recover From GPU Hangs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Intel's Gallium3D Driver Will Now Try To Recover From GPU Hangs

    Phoronix: Intel's Gallium3D Driver Will Now Try To Recover From GPU Hangs

    The Intel Gallium3D OpenGL driver performance is now in good shape for this new open-source Intel Linux GL driver compared to its "classic" Mesa driver, but there are still various other features to be ironed out before this "Iris" driver can become the new default. One of the items now crossed off the list is GPU hang recovery...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    This is great, GPU hangs can be nasty if the driver doesn't attempt to recover from them. I remember running into these issues with nouveau back in the day, it would require power cycling the machine to get the GPU to work properly again. I assume that some GPU hangs will happen with hardware supported by these drivers.

    Comment


    • #3
      Question is if xserver or other display servers survive the driver reset.

      Comment


      • #4
        AMD's GPU hang recovery could use some work.

        Comment


        • #5
          FWIW, this support should work a lot better with Kernel 5.2+.
          Free Software Developer .:. Mesa and Xorg
          Opinions expressed in these forum posts are my own.

          Comment


          • #6
            Originally posted by Lanz View Post
            AMD's GPU hang recovery could use some work.
            Agreed. I always get lockups running setiathome under boinc, on my AMD GPU. The other ( boinc ) GPU tasks are fine. It would be nice if I didn't have to power-cycle for these - it makes opencl on AMD pretty risky.

            Comment


            • #7
              What is supposed to happen when recovering from a GPU hang? Will the current GL application still run or crash? Or maybe the whole X server will require a restart? I'm curious.

              Comment


              • #8
                Originally posted by totoz View Post
                What is supposed to happen when recovering from a GPU hang? Will the current GL application still run or crash? Or maybe the whole X server will require a restart? I'm curious.
                Any applications using the GPU need to use the relevant robustness APIs exposed via the API (the robustness extensions for GL, and the context lost stuff for vulkan). When there is a GPU reset, the application needs to rebuild it's contexts because the GPU has hung and been reset and the applications' data may have been lost (e.g., data lost due to the execution blocks being reset while working on data or data lost due to state being lost as a result of the reset such as vram contents lost if the GPU's memory controller is reset). When the context is lost, the application needs to rebuild it's context (buffer and command state, etc.).

                Support for these robustness extensions would need to be added to glamor and the compositors to gracefully recover from a GPU reset. On windows, MS has support for the DX context lost API in their desktop manager so when there is a GPU reset, the desktop manager will properly re-build it's context. Linux needs similar work.

                In the early days of computer graphics, it was considered acceptable to reset the GPU silently and users could accept some garbage on the screen, flicker, etc, because graphics data is mostly thrown away as soon as a frame is rendered. This is not the case with compute where getting bad data in your results can ruin your entire run. As such, you really need to have applications properly catch and deal with issues. This is not just limited to GPU resets. You have to deal with other kinds of issues like ECC errors in vram that could also result in data corruption. From a data integrity point of view, it is better to die than to corrupt data.

                So today on Linux, since nether glamor nor any of the compositors support context lost APIs, you have to restart your display server to get your desktop back after a GPU reset.

                Comment


                • #9
                  KWin does support robustness and properly recovers from GPU hangs, actually. Mutter / GNOME Shell on X (but not Wayland) detects hangs and restarts the process, too. Hooking reset detection up for glamor / X would be a great idea.
                  Free Software Developer .:. Mesa and Xorg
                  Opinions expressed in these forum posts are my own.

                  Comment


                  • #10
                    Originally posted by dkasak View Post

                    Agreed. I always get lockups running setiathome under boinc, on my AMD GPU. The other ( boinc ) GPU tasks are fine. It would be nice if I didn't have to power-cycle for these - it makes opencl on AMD pretty risky.
                    I hope if it's implemented, it won't be as bad as Sandy Bridge which would recover from hangs just to hang again, with the period between hangs feeling like it would shorten exponentially. Still, hang recovery is just a quick attempt to make hardware more usable, and it doesn't address the actual problem. I hope someone will fix it unlike Sandy Bridge I've mentioned above, which is destined to be pretty unstable.

                    Comment

                    Working...
                    X