Announcement

Collapse
No announcement yet.

AMD devs: *ERROR* ring gfx timeout

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    It's been 3 months with this thread and nobody with vega hardware has responded.
    Either nobody uses vega in linux for 3d or nobody has stability issues.
    The former seems far more likely due to unstable software.

    Someone please set this straight if you actually own the hardware.
    Can somebody with vega try to finish a complete game of dota2 and see if it locks up ?

    Comment


    • #22
      It seems that if I switch to vulkan it's fully stable in dota2
      So OpenGL is unstable ... definitely not faulty hardware, it's the other guy's fault.
      I guess vulkan gets all the attention to the detriment of opengl these days

      Comment


      • #23
        Originally posted by Soul_keeper View Post
        It's been 3 months with this thread and nobody with vega hardware has responded.
        Either nobody uses vega in linux for 3d or nobody has stability issues.
        The former seems far more likely due to unstable software.

        Someone please set this straight if you actually own the hardware.
        Can somebody with vega try to finish a complete game of dota2 and see if it locks up ?
        I don't own DOTA2 but I stumbled across this thread from a search of the ring gfx timeout error. I have a Vega 56 and a WX7100 running on Fedora 29 (multi monitor setup) and here in the last week the Vega has gotten bad about doing this. Doesn't have to be doing anything 3D to happen. I replaced the firmware that the Fedora kernel ships with with the more up to date versions from the linux-firmware git repo hosted on kernel.org and so far it hasn't it done it again, well yet anyway. I had a GTX970 in my previous machine that was bad to fall off the PCI-E bus and cause a similar crash and that turned out to be a hardware issue. My Vega is the XFX version with the double fans so it's not a reference design.

        The WX7100 I've had for quite a while (ran it in my old machine after the 970 made me angry enough) and it's rock solid. This Vega is maybe 3-4 months old at this point. Kind of tempted to just RMA it before it's out of warranty. I've got an 850W PSU which should be plenty stout for this setup. No overclocking or anything. Anyone else still having this? From the rarity of the issue after looking on Google I'm starting to think it's just bad hardware.

        Comment


        • #24
          Originally posted by debianxfce View Post

          Lower the engine clock of the problematic GPU to check if your PSU is the cause. I avoid XFX cards after touching the grounding shield of the HDMI cable, got an electric shock and the XFX RX460 card lost its graphics engine. I got Gigabyte RX 460 for free as replacement. I buy expensive PC parts only from Asus after flashing latest bios to the new Gigabyte B350 Gaming 3 motherboard and it did not boot anymore. I got my money back and bought ASUSTeK model: PRIME B350M-K. This is my second Ryzen 5 1600 & PRIME B350M-K PC build. I sold the first build for same amount money that I used when I wanted a larger ssd.
          I'm running a 2950X and haven't checked what it pulls from the wall since adding a second GPU (was running one until November), it could be the PSU. But I'm not over clocking and the WX7100 isn't exactly a power hog (most of the time it's just display text, just stuck it in as a way to drive some extra displays I had) so with the math I did it seemed like 850W should keep me covered. I don't even have to be doing anything strenuous for it to happen.

          So far I haven't had the issue since replacing the kernel firmware F29 has in place with what's in the newest kernel's git repo. Maybe a bug there. Although it usually only does it pretty soon after a cold boot, once it's been up and running for a while it feels like it doesn't happen as often. No scientific measurement there just anecdote. I'll see what I get tonight when I get home as it's been off since this morning.

          Comment


          • #25
            The watts aren't that important. The important thing is the build quality of the power supply. My overclocked RX 570 @ 1.400 MHZ runs perfect on my 9 years old 450W power suply.

            Comment


            • #26
              Originally posted by HD7950 View Post
              The watts aren't that important. The important thing is the build quality of the power supply. My overclocked RX 570 @ 1.400 MHZ runs perfect on my 9 years old 450W power suply.
              Got you beat there, one of my Debian machines is running an old power hungry (by today's standards) Athlon 64 X2 5400+ and an R9 270 off an 12 year old TR 430W. Stable as can be. Efficiency counts as well, especially considering how much it can drop when things get hot.

              My 2950X system is running a Corsair HX850 which isn't an awful quality PSU. Definitely not one of these "says 850W on the box but more like 650W on a good day and don't measure the current or you're going to be sad" deal-of-the-day specials. I mainly bought it because of the long warranty, high efficiency rating and lack of RGB shenanigans. Good grief that stuff is getting hard to avoid ... finding a main-board without any lighting is impossible.

              Comment


              • #27
                Well, it's been running fine for the last few days but I didn't get a chance to look at it/shut it down before I left for the office this morning so I tried to remote in for a shut down and it's apparently totally crashed. No response, etc. Looks like it fell off the network not long after I went to bed for the night. Going to see if there was anything in the logs when I get home. Nothing in the UPS log so it wasn't a power failure or anything like that.

                There was a kernel update last night but I hadn't rebooted into it yet. Need to get some monitoring/remote logging/statistics on this thing. At any rate I'm about 99% sure I've got some sort of hardware issues at this point.

                Comment


                • #28
                  Originally posted by debianxfce View Post

                  Fedora is like windows 8 (the UI) with automatic updates that crashes the system then. There are better operating systems. I wonder why US people do not believe their scientist what is good.
                  http://allinfo.space/2017/01/14/nasa...ops-by-debian/
                  I've run Debian everywhere for more than a decade, Fedora on and off since they split it from Red Hat along with Slackware and Red Hat/Mandrake before that. That's just my desktop systems. I didn't want to fool around with Sid or Testing on a production system and there's not great support for this newer gear in Stable yet. I don't care for Ubuntu or derivative distributions in general but that's more a of a personal preference thing. Moved from XFCE a few years ago because of HiDPi, moving my multimedia stuff to Linux from OS X and needing color profiling support. They're all fully capable of breaking and leaving you hanging.

                  The FUD about dnf is unfounded (honestly I thought that had all blown over at this point) and I've had apt take a dump on me many times in the past too. Everything has it's problems. Not sure what the GUI remark was about, KDE Fedora has Discover which AFAIK I know is pretty standard among Plasma 5 distros these days. But I just use dnf at the command line for package management anyway. Fedora has been great and honestly if you're running a KDE desktop it's one of the better ones out there. I highly doubt a kernel update that I had not rebooted for at the moment caused any problems. If that did I would have many more broken boxes waiting on their reboot while I roll through a few hundred of them this morning. :P A mix of Debian, RH and Fedora systems as well!

                  Back to the matter at hand. I did some perusing of the FreeDesktop bug system and found a few others reporting the same errors. At least one person reported it was fixed by replacing a faulty GPU. I've had other failed GPUs in the past that cause similar issues (mostly NoVideo) too. Considering my Polaris based WX7100 isn't having issues I'd bet this Vega card is just DoA. I'm going to yank it out tonight and see if the box is stable on the WX7100. If so it's under warranty and back to the mothership it goes! Man, quality control on parts seems to have really gone down hill since the 90s.

                  TL;DR: am old fart, been running Linux a long while, they all break and make you angry from time to time.

                  EDIT: I might throw the Vega in another box and see if it hangs in there. I forgot I had access to an old i7 with PCI express.
                  Last edited by lhutton; 17 January 2019, 12:40 PM.

                  Comment


                  • #29
                    I have the exact same problem with my vega 64 and it has been happening for as long as I've had the card. Previously used the card on Windows before going Linux, and it was 100% stable on Windows, so I can rule out a hardware issue. Interestingly, the issue only seems to happen on Vulkan games for me. For some games it seems to be completely stable (World of Warcraft via Wine with DXVK, Final Fantasy XII via Proton with DXVK), while others exhibit the issue every now and then, like once every few hours on average (GTA V via Wine on DXVK), while some crash it really quickly (Mario Party 9 running on Dolphin with Vulkan rendering) in seconds/minutes on the menu screen.

                    You might want to play around with different LLVM and Mesa/RADV versions to see if the latest SVN/git versions work better, it seems to have helped for some people. I haven't found a complete fix yet unfortunately. Or try out a completely different Vulkan driver (AMDVLK, AMDGPU-PRO) (haven't tried that myself yet). If the issue is in the AMDGPU part in the kernel, that might not actually fix the issue. But since only Vulkan seems to be affected, it might actually be worth a try.

                    Please keep me posted. Really curious if you manage to find something interesting.

                    Comment


                    • #30
                      I couldn't find an edit button for my previous post, but that might also be due to the fact it hasn't been approved yet. So I am going to post this in a separate post. I managed to get a stack trace of my latest crash by this issue. Maybe this is helpful for debugging purposes? I am not sure on which bugtracker I should open an issue though. Any thoughts from you guys where we should report this?

                      Code:
                      [  858.970202] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=160177, emitted seq=160179
                      [  858.970205] [drm] GPU recovery disabled.
                      [  982.906053] INFO: task kworker/u32:6:398 blocked for more than 120 seconds.
                      [  982.906055]       Not tainted 4.20.3-arch1-1-ARCH #1
                      [  982.906056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
                      [  982.906057] kworker/u32:6   D    0   398      2 0x80000000
                      [  982.906068] Workqueue: events_unbound commit_work [drm_kms_helper]
                      [  982.906069] Call Trace:
                      [  982.906075]  ? __schedule+0x29b/0x8b0
                      [  982.906077]  ? __switch_to_asm+0x40/0x70
                      [  982.906079]  schedule+0x32/0x90
                      [  982.906080]  schedule_timeout+0x311/0x4a0
                      [  982.906126]  ? dce120_timing_generator_get_crtc_position+0x5b/0x70 [amdgpu]
                      [  982.906167]  ? dce120_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu]
                      [  982.906170]  dma_fence_default_wait+0x204/0x280
                      [  982.906172]  ? dma_fence_wait_timeout+0x120/0x120
                      [  982.906173]  dma_fence_wait_timeout+0x105/0x120
                      [  982.906175]  reservation_object_wait_timeout_rcu+0x1f2/0x370
                      [  982.906178]  ? preempt_count_add+0x79/0xb0
                      [  982.906221]  amdgpu_dm_do_flip+0x10d/0x370 [amdgpu]
                      [  982.906265]  amdgpu_dm_atomic_commit_tail+0x6c4/0xd20 [amdgpu]
                      [  982.906267]  ? _raw_spin_lock_irq+0x1a/0x40
                      [  982.906268]  ? wait_for_common+0x113/0x190
                      [  982.906269]  ? __switch_to_asm+0x34/0x70
                      [  982.906275]  commit_tail+0x3d/0x70 [drm_kms_helper]
                      [  982.906278]  process_one_work+0x1eb/0x410
                      [  982.906280]  worker_thread+0x2d/0x3d0
                      [  982.906282]  ? process_one_work+0x410/0x410
                      [  982.906283]  kthread+0x112/0x130
                      [  982.906284]  ? kthread_park+0x80/0x80
                      [  982.906286]  ret_from_fork+0x22/0x40
                      [  982.906290] INFO: task kworker/u32:8:404 blocked for more than 120 seconds.
                      [  982.906290]       Not tainted 4.20.3-arch1-1-ARCH #1
                      [  982.906291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
                      [  982.906291] kworker/u32:8   D    0   404      2 0x80000000
                      [  982.906297] Workqueue: events_unbound commit_work [drm_kms_helper]
                      [  982.906298] Call Trace:
                      [  982.906300]  ? __schedule+0x29b/0x8b0
                      [  982.906301]  schedule+0x32/0x90
                      [  982.906302]  schedule_preempt_disabled+0x14/0x20
                      [  982.906303]  __ww_mutex_lock.isra.2+0x413/0x7f0
                      [  982.906329]  ? amdgpu_get_vblank_counter_kms+0x110/0x160 [amdgpu]
                      [  982.906370]  amdgpu_dm_do_flip+0xd2/0x370 [amdgpu]
                      [  982.906412]  amdgpu_dm_atomic_commit_tail+0x6c4/0xd20 [amdgpu]
                      [  982.906414]  ? _raw_spin_lock_irq+0x1a/0x40
                      [  982.906415]  ? wait_for_common+0x113/0x190
                      [  982.906416]  ? __switch_to_asm+0x34/0x70
                      [  982.906422]  commit_tail+0x3d/0x70 [drm_kms_helper]
                      [  982.906424]  process_one_work+0x1eb/0x410
                      [  982.906425]  worker_thread+0x2d/0x3d0
                      [  982.906427]  ? process_one_work+0x410/0x410
                      [  982.906428]  kthread+0x112/0x130
                      [  982.906429]  ? kthread_park+0x80/0x80
                      [  982.906431]  ret_from_fork+0x22/0x40

                      Comment

                      Working...
                      X