Announcement

Collapse
No announcement yet.

Better Hang Detection For The RADV Vulkan Driver

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Better Hang Detection For The RADV Vulkan Driver

    Phoronix: Better Hang Detection For The RADV Vulkan Driver

    Samuel Pitoiset of Valve's latest work on the open-source Radeon driver stack has been figuring out better GPU hang detection for the RADV Vulkan driver...

    http://www.phoronix.com/scan.php?pag...Hang-Detection

  • #2
    Good work! Maybe not directly related to this GPU hang detection, but I have always wondered how easily a GPU driver hangs can bring down the complete Linux kernel where often the only option is to reset the complete machine. Isn't there anything the kernel can do to reset the GPU or prevent this? How is this done one Windows? It seems more stable there.

    Comment


    • #3
      Originally posted by R41N3R View Post
      Good work! Maybe not directly related to this GPU hang detection, but I have always wondered how easily a GPU driver hangs can bring down the complete Linux kernel where often the only option is to reset the complete machine. Isn't there anything the kernel can do to reset the GPU or prevent this? How is this done one Windows? It seems more stable there.
      Windows has been able to recover the graphics driver without reboot since Windows 7 I think, maybe even Vista?

      Comment


      • #4
        From my experience, graphic drivers lockup the kernel rather rarely: in most cases, you can still ssh in. What they screw up, however, is the VT the GUI is launched on (including the ability to switch to another VT).

        Aren't there some watchdogs in place to prevent this?

        Comment


        • #5
          Originally posted by Brisse View Post
          Windows has been able to recover the graphics driver without reboot since Windows 7 I think, maybe even Vista?
          Since Vista, but only in some cases.

          Windows 10 does still BSOD because of graphics issues (Seen mostly with laptops with dual graphics).

          Comment


          • #6
            Originally posted by M@yeulC View Post
            From my experience, graphic drivers lockup the kernel rather rarely: in most cases, you can still ssh in. What they screw up, however, is the VT the GUI is launched on (including the ability to switch to another VT).

            Aren't there some watchdogs in place to prevent this?
            You are right, I can usually SSH into the machine, but reset is fast if my other machine is not running. This is really annoying, at least switching to another TTY would be good (blindly would be ok) but I cannot even to this.

            Just as an example games with the Unreal 4.16 engine can lock up the kernel with mesa on AMD. Witcher 3 in wine was another game I remember. I have seen it way too often already, I wouldn't say it is seldom :-) But I do not complain as I use mesa-git, just want to learn more about how other system do prevent this.

            Comment


            • #7
              Originally posted by R41N3R View Post

              You are right, I can usually SSH into the machine, but reset is fast if my other machine is not running. This is really annoying, at least switching to another TTY would be good (blindly would be ok) but I cannot even to this.

              Just as an example games with the Unreal 4.16 engine can lock up the kernel with mesa on AMD. Witcher 3 in wine was another game I remember. I have seen it way too often already, I wouldn't say it is seldom :-) But I do not complain as I use mesa-git, just want to learn more about how other system do prevent this.
              I think this is related to the rusty old and so far unkillable VT system in the kernel, this code is so old and barely readable that I don't think anyone in the last 10 years has managed anything more than keeping it working somehow and in that context I'm fairly confident that it has no way to handle failure recovery(after all that thing was designed with TTY modems on mind not GPUs)

              Additionally I'm not actually sure the kernel have an actual failure protection system or if drivers actually use it in the case it exists, outside GPU lots of drivers can panic the kernel and even through SSH is not possible to unload them or reload them, those simply enter the loop of panic and never timeout.

              Note: yes I mean drivers compiled as modules and not included on initramfs or initrd

              Comment


              • #8
                Originally posted by R41N3R View Post

                You are right, I can usually SSH into the machine, but reset is fast if my other machine is not running. This is really annoying, at least switching to another TTY would be good (blindly would be ok) but I cannot even to this.

                Just as an example games with the Unreal 4.16 engine can lock up the kernel with mesa on AMD. Witcher 3 in wine was another game I remember. I have seen it way too often already, I wouldn't say it is seldom :-) But I do not complain as I use mesa-git, just want to learn more about how other system do prevent this.
                Next time I get one such hang, I will try the following:
                • sysrq + R, then change to vt 2 or so (with alt+f2)
                • via ssh, do sudo chvt 2
                • compile a program that does:


                Code:
                  
                  ioctl(tty_fd,KDSETMODE,KD_TEXT);
                (source)

                Comment


                • #9
                  Windows has GPU reset plumbed through the OS. There is a DX API for GPU reset. The windows desktop manager uses this API. It gets signalled when there is a GPU reset so it knows that its context/buffers are lost and that it should rebuild it's state and buffers. OpenGL has a robustness API, but very few if any apps use it. At a minimum for X, glamor would need to support the GL robustness extensions to that it will properly handle a GPU reset.

                  Comment


                  • #10
                    radeon was able to successfully reset my laptop GPU with most GPU hangs I encountered. amdgpu (at least on Polaris) only very rarely successfully resets. I believe you still have to manually enable automatic gpu reset by booting with amdgpu.lockup_timeout=10000 so it will try to reset 10 seconds after detecting a hang, but you'll usually get bad results, from the monitor not receiving any signals anymore, to it showing a very colorful distorted image to just straight out null pointer dereferences in the amdgpu kernel module.

                    Comment

                    Working...
                    X