Announcement

Collapse
No announcement yet.

AMDGPU Reset Recovery To Be Flipped On By Default For Newer Radeon GPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    I get the same *ERROR* ring gfx timeout error/lockup all the time with my vegafe ...
    If it was heat the underclocking/undervolting and maxing out the fans would help, but it don't.

    dota2 is great for exposing it, just try to finish a complete game against bots with details maxed at 4K (maybe lesser settings do it too).
    typically half way thru the game it'll lock up, just when the big fights are starting.

    I don't really have the time to care right now or test/report as i'm not a driver dev. All I can do is be patient and wait another year I guess ...

    Now if reset worked ... i'd prolly never notice. It would just be a dmesg warning that happens once or twice per game session, with maybe a slight stutter in game. That would be far easier to deal with as I wait for the drivers to mature.
    Last edited by Soul_keeper; 31 October 2018, 03:07 AM.

    Comment


    • #42
      I get a consistent GPU hang on my RX 570, I also reported a bug here: https://bugs.freedesktop.org/show_bug.cgi?id=108493 - there are some tweaks that can improve the stability, so that it doesn't hang immediately but after 10-15 minutes.

      But if you google for "ring gfx timeout" or "ring sdma0 timeout" you'll find a bunch of other similar issues. Most of the time it is also unclear whether the problem is in amdgpu (the kernel driver) itself or in radeonsi (the mesa driver). It has been stated in a few bug reports that mesa can indeed trigger a GPU hang if it behaves badly.

      I guess I'm looking forward to seeing kernel 4.20-rc1 to test these fantastic new changes and see if the problem is fixed (or at least maybe it can now properly reset the GPU if not).

      Comment


      • #43
        Originally posted by debianxfce View Post
        You are using external pci-e converter. That is not in the AMD interest and I am sure they do not have the same hardware that you have. I am sure when I install my new ASUS Radeon RX 570 4GB Expedition OC to the desktop computer it works without problems. I am waiting to the pci-e 8 pin power adapter to arrive.
        That has nothing to do with it. Seriously, search for "ring gfx timeout" on google and you will find many very very similar issues. People experience those issues with their desktop computers.

        Comment


        • #44
          Originally posted by debianxfce View Post
          Many people including you uses old and buggy OS and drivers.
          I use the latest kernel and mesa releases (4.19 and 18.2). Are you suggesting that there is an AMD kernel somewhere that I can try? If so, please give me a link and I will try it.

          Comment


          • #45
            Originally posted by debianxfce View Post

            Michael is a huge fedora fan. He uses Ubuntu and Padoka ppa Mesa git when testing with games. Use a rolling release OS like Debian testing Xfce. Fedora is not popular for gaming: https://www.gamingonlinux.com/users/statistics


            Here is the wip kernel, clone it at Saturday:
            https://cgit.freedesktop.org/~agd5f/...-next-4.21-wip

            You need to have latest Mesa git and firmware too.
            https://launchpad.net/~oibaf/+archiv...aphics-drivers

            I have all of these in my distribution: https://www.youtube.com/watch?v=fKJ-IatUfis
            Thanks man. Since you say you already got all of it in your distro, is there an easy way to get a live image of your distro with all this very latest stuff included? That would make it pretty easy for me to test.

            (If not, I will just try that AMD kernel on Fedora and see where it goes.)

            Comment


            • #46
              Originally posted by mercurio View Post

              Hi,

              I get same issue on my Gigabyte Radeon RX VEGA 64 GAMING OC 8G, see below

              28 22:23:41 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2945324, emitted seq=2945326
              28 22:23:41 kernel: [drm] GPU recovery disabled.
              --> Had to hold power button
              -- Reboot --

              I have been logging GPU temperature via lm-sensors, the critical is set to 89.0°C and I get max 72°C while playing DOOM 2016 in 4K, so I guess, that GPU is not overheating as someone mentioned, the PC is brand new, so no settled dust, or anything like that.
              There is a review at Tom`s Hardware, they have measured 74-75°C
              https://www.tomshardware.com/reviews...ew,5441-5.html

              Other interesting kernel messages I get:

              30 17:54:49 kernel: [drm:amdgpu_ctx_mgr_entity_fini [amdgpu]] *ERROR* ctx 000000003349f739 is still alive
              30 17:54:49 kernel: [drm:amdgpu_ctx_mgr_fini [amdgpu]] *ERROR* ctx 000000003349f739 is still alive
              --> freeze while running command "shutdown -h now"

              I am using xubuntu 18.04.1 LTS, kernel 4.19 mainline from kernel.ubuntu.com with Padoka Stable PPA repo.

              Gigabyte released F2 VGA BIOS for my card I am going to give it try.
              http://download.gigabyte.eu/FileList...A64_8GD_F2.zip








              Hi, it seems, that new kernel 4.19.1

              http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19.1/

              and latest firmware have solved this issue.

              https://bugs.debian.org/cgi-bin/bugr...cgi?bug=906526
              https://git.kernel.org/pub/scm/linux...e657faef7f7f45

              I have also switched to Padoka PPA - unstable

              https://launchpad.net/~paulo-miguel-...ve/ubuntu/mesa

              I am also using amdgpu.gpu_recovery=1

              Cheers

              Comment


              • #47
                Originally posted by tildearrow View Post

                Not anymore. With one exception, but this problem also happens on Windows: My card sometimes due to AC voltage variance, cosmic fluctuations or lack of ground, randomly self-resets (goes to a "0MHz" power clock, turns the screen off and resets the fan speed). I don't know how to plug it in again, which is most likely impossible
                I'm having a similar problem only it doesn't shut my screen off, it just goes black. It's still powered up and back-lit. The system is halted, with full power going to hardware. I know this because the load meter goes up on my UPS, the graphics card has been hot (in the 60's immediately after the hard boot and quickly cools down) and the machine doesn't respond to input (including ACPI power button) or ping... it's halted. Nothing is logged.

                I do sometimes notice line voltage fluctuation when this happens, but no matter how good the voltmeter on my UPS is, it's not going to react quickly enough to see an instantaenous drop. What I thought was that it happened when the voltage was high, and suddenly dropped low, in other words, it's the delta that something is reacting to when the card is in low clock mode. I happened to see it once, the line voltage was about 127V (yes, that's real, I've compared the voltmeter on my UPS's OSD with multitester voltmeter) and dropped down to about 113V and came back up to about 124.

                It has never happened when the card is in high clock mode, for example while playing games. It's more likely to happen when reading a web page, or especially while doing nothing. I have also never come back and found my system dead when display power management has shut the screen off, though I have come back after about 2 minutes (refill my coffee) to find it dead, with the screen black and back-lit. Nothing going on, all programs except desktop environment closed.

                I am using Plasma 5 on both my Kubuntu (for games) and Arch (for work and life) now but this has also happened with XFCE in a very simple Crux distro.

                I roll my own kernels, as I have been doing for decades.

                I have ASPM disabled (I have always just enabled it in kernel and set "Performance" mode) but I have been using pcie_aspm=off for some time now. It's something you have to enable in kernel or you can't disable it :-)

                It happened to me in Windows once, I had just clicked a menu in Firefox. HOWEVER, I don't use Windows very often anymore and while I'm there I'd be very unlikely to be idle (I'd be playing a game).

                The problem could occur 3 times in one day, or go a week without happening. This is why I think it's external factors, for example line conditions.

                My guess would be AMD drivers and firmware blobs reacting adversely.

                Driver recovery has NEVER really worked properly for me since I bought this card. I used to have assloads of problems in Windows with early AMD drivers. It got to the point where I just set Windows TDR ("time out detection and recovery") to blue screen stop error (BugCheck on timeout) instead of recovery. It's marginally better than a hard power off, as it at least syncs buffers. I don't seem to have that particular problem in Windows anymore, but I am using Windows 7 and sticking with AMD drivers that have no problems, not upgrading them for the sake of it. I don't really have those kinds of game crashes to invoke TDR.

                My graphics card is a MSI brand R9 380.

                Comment

                Working...
                X