Announcement

Collapse
No announcement yet.

AMD devs: *ERROR* ring gfx timeout

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Thanks for the input. I guess gpu_recovery is going to be a WIP for awhile.
    As far as my particular problem goes, do you think it is more likely to be a kernel/drm problem, a hardware problem, or a problem with the particular game ?
    Is amdgpu.job_hang_limit a possible solution ?

    I'm just not sure which direction I should take from here. Ruling out hardware 100% would be ideal.

    When I search google for vega ring gfx timeout I am finding a number of results, but no real solutions.
    Last edited by Soul_keeper; 26 September 2018, 06:30 PM.

    Comment


    • #12
      There is no documentation on this, but it appears to be working:

      amdgpu.disable_cu=0.0.14,0.0.15,1.0.14,1.0.15,2.0. 14,2.0.15,3.0.14,3.0.15

      This should basically turn a vega64/fe into a vega56.
      I'm going to try to rule out hardware stability by playing with this for awhile ....
      Last edited by Soul_keeper; 27 September 2018, 02:06 AM. Reason: edit, tested

      Comment


      • #13
        I've discovered another way to trigger the ring gfx timeout and force me to reboot.

        amdgpu.disable_cu=3.0.0,3.0.1,3.0.2,3.0.3,3.0.4,3. 0.5,3.0.6,3.0.7,3.0.8,3.0.9,3.0.10,3.0.11,3.0.12,3 .0.13,3.0.14,3.0.15

        startx and I get a black screen with no keyboard.
        ssh in from another box, grab the dmesg, and reboot.

        Apparently disabling all the CUs in an SE will cause this to trigger when X starts.

        Any ideas why this might trigger it ?

        here are the ones that did pass:
        https://openbenchmarking.org/result/...RA-1809275RA81
        Strangely there seems to be no performance benefit to disabling the CUs in a staggered fashion. I'd think that the L1 cache sharing or other SE units would be affected ... unless this is somehow virtual.
        I'd be really interested to see what disabling an entire SE, or two, would do.
        Last edited by Soul_keeper; 27 September 2018, 07:59 AM.

        Comment


        • #14
          The vega design as a whole just has high temps. After applying better thermal compound, more fans, and even underclocking/undervolting I havn't been able to increase stability. So I think that rules out temperature. So it's either DOA or software, i'm leaning towards software now.

          debianxfce: Did you say you have a vega ?
          Well, any gcn you could try disabled an SE with amdgpu.disable_cu, I bet that'd trigger a ring gfx timeout. Can you play dota2 completely stable ?

          Comment


          • #15
            This issue still exists. No progress.

            Comment


            • #16
              Originally posted by debianxfce View Post

              What if you run it with lower engine clocks. You can use my simple python app for that: https://www.phoronix.com/forums/foru...justing-clocks
              clocks/voltages/temps don't seem to fix the issue

              ...
              [61988.103214] [drm:0xffffffffa04012ed] *ERROR* ctx 00000000bbf8582b is still alive
              [61988.103214] [drm:0xffffffffa04012ed] *ERROR* ctx 00000000bbf8582b is still alive
              [61988.103215] [drm:0xffffffffa040137e] *ERROR* ctx 00000000bbf8582b is still alive
              [72134.598433] [drm:0xffffffffa04929af] *ERROR* ring gfx timeout, signaled seq=4188416, emitted seq=4188418
              [72134.598435] [drm] GPU recovery disabled.

              System Lockup


              Short of buying a new/different video card, I guess it's impossible to get amdgpu stable with a vega FE in 3d

              Comment


              • #17
                The latest kernel-firmware-20181215_211de16-noarch-1 causes my system to not boot. Screen turns black and it goes into some kind of power cycle loop ...
                Reverting to kernel-firmware-20181026_1baa348-noarch-1 allows it to boot again.

                What did you guys break now ?

                Comment


                • #18
                  Originally posted by debianxfce View Post
                  You did remove the heatsink from vega and have temp problems. I think your gpu card is broken based on the image you posted. GPU video memory gets corrupted when the temperature rises.
                  Just to be clear here I "thought" I had temp issues, now i'm certain this isn't and likely was never a factor in anything.
                  So that either means the software kernel/mesa/games are buggy or i've had a DOA card from day 1 and because I panicked and replaced the thermal paste I just have to eat a $1000 loss due to voided warranty (also owning the card for so long).

                  It's a shame I have no friends, or i'd loan the card to a windows gamer and let them play with it for a few weeks to potentially rule out linux as being the problem.

                  Comment


                  • #19
                    I havn't booted a windows system in over 15yrs

                    Comment


                    • #20
                      Does anyone else have this problem ?
                      It's not going away ...

                      Comment

                      Working...
                      X