Announcement

Collapse
No announcement yet.

AMD Working On Better Page Fault Handling For Navi / Vega GPUs

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • AMD Working On Better Page Fault Handling For Navi / Vega GPUs

    Phoronix: AMD Working On Better Page Fault Handling For Navi / Vega GPUs

    Longtime open-source AMD Linux driver developer Christian König on Wednesday sent out a set of patches providing "graceful" page fault handling support for Navi and Vega graphics processors...

    http://www.phoronix.com/scan.php?pag...er-PF-Handling

  • #2
    Great. Can this be used to gracefully kill the X server in case of a hang? (without having to reboot the machine)

    Comment


    • #3
      oh, nice! I wonder if this is gonna fix some of the problems I have (some games just hang and then take the system out with it after a short while)

      Comment


      • #4
        Stories like that make me worry about whether I should buy an AMD card when my GeForce GTX750 dies. I think the last time I ever had the nVidia drivers do anything like that was back when I was on a GeForce 7600GS back around 2009.

        (And I'm big on desktop session uptime. It's the #1 reason I won't be moving to Wayland until they standardize some kind of compositor crash recovery protocol.)

        Comment


        • #5
          i read "Vega10" - it could concern APU like RavenRidge as well, doesn't it ?
          I don't know how different is the memory management between dGPU and iGPU.

          Comment


          • #6
            Originally posted by ssokolow View Post
            Stories like that make me worry about whether I should buy an AMD card when my GeForce GTX750 dies. I think the last time I ever had the nVidia drivers do anything like that was back when I was on a GeForce 7600GS back around 2009.

            (And I'm big on desktop session uptime. It's the #1 reason I won't be moving to Wayland until they standardize some kind of compositor crash recovery protocol.)
            It's a bit older now, but Polaris is pretty rock solid. I have little to no problems with my RX 580 outside of vsync issues with a few Wine/Proton games. For up to 1080p on Linux, the 580 is one of the best GPUs to have.

            The only loss of uptime I've had that was GPU related was a week ago when I went to update llvm-git, forgetting that I've been manually editing the PKGBUILDs to use the 9.0 branch and not master, and that pulled in llvm10 which, well, didn't have very good results and games would hang until I force rebooted, desktop compositing had bad tearing and freezing, etc until I downgraded back to llvm9. I can't blame the GPU there since I was running llvm from git for mesa-aco.

            Comment


            • #7
              Originally posted by tildearrow View Post
              Great. Can this be used to gracefully kill the X server in case of a hang? (without having to reboot the machine)
              This already works quite reliably. If a GPU hangs occurs, the GPU is automatically reset (after some timeout). However, you have to restart desktop session and login manager as all GPU contexts are invalidated. I hope this can be automated. It would be even better if the display server was able to recreate contexts and continue without a restart.
              Last edited by brent; 09-05-2019, 08:28 AM.

              Comment


              • #8
                Originally posted by brent View Post

                This already works quite reliably. If a GPU hangs occurs, the GPU is automatically reset (after some timeout). However, you have to restart desktop session and login manager as all GPU contexts are invalidated. I hope this can be automated. It would be even better if the display server was able to recreate contexts and continue without a restart.
                No, it does not. It is totally unreliable.
                When my X server hangs and the speedometer increases to maximum (due to a Mesa hang), the GPU reset mechanism never triggers, and forcing a reset does not work because it thinks there is no hang.

                I mean, why the heck did they have to make it so that when you play Xonotic or StepMania or any other application with vblank_mode=0 and then begin to record the screen using my application that uses VA-API cause it to HANG after like 2 seconds?!?!
                This is why I went back to Mesa 19.0 and I am not upgrading. I don't understand what the heck is going on inside that causes the GPU for some magic reason to hang, FORCING ME TO REBOOT, AKA losing your work and having to re-open every damn thing back!!!
                I remember there was a time where one of the Mesa developers disabled a lot of compiler warnings, so I am sure this may be related, and indeed, there's a piece of erroneous code that causes a race condition which yields the hang...

                You know what? If it does hang again, I'm done with this crap- I will be forced to install Gentoo and debug the thing out myself.
                Last edited by tildearrow; 09-05-2019, 02:32 PM.

                Comment


                • #9
                  Originally posted by tildearrow View Post
                  No, it does not. It is totally unreliable.
                  When my X server hangs and the speedometer increases to maximum (due to a Mesa hang), the GPU reset mechanism never triggers, and forcing a reset does not work because it thinks there is no hang.
                  That's unfortunate, I can only claim "works for me".

                  Maybe it depends on GPU generation, which is Polaris here. I can trigger hangs and reset and recovery work every time.

                  Edit: GPU resets are only enabled on newer generations. You may need to enable resets if you have an older GPU. There's an "amdgpu" module parameter for this.
                  Last edited by brent; 09-05-2019, 04:23 PM.

                  Comment


                  • #10
                    Originally posted by brent View Post

                    That's unfortunate, I can only claim "works for me".

                    Maybe it depends on GPU generation, which is Polaris here. I can trigger hangs and reset and recovery work every time.

                    Edit: GPU resets are only enabled on newer generations. You may need to enable resets if you have an older GPU. There's an "amdgpu" module parameter for this.
                    Of course, Polaris is super stable because it has all the attention. Nobody cares about Vega. (edit: this is not the case.)

                    Note that my graphics card isn't that old, so I doubt I have to enable such kernel param.
                    Last edited by tildearrow; 09-05-2019, 08:43 PM.

                    Comment

                    Working...
                    X