Announcement

**Soul_keeper** · 26 September 2018, 06:25 PM

Thanks for the input. I guess gpu_recovery is going to be a WIP for awhile.
As far as my particular problem goes, do you think it is more likely to be a kernel/drm problem, a hardware problem, or a problem with the particular game ?
Is amdgpu.job_hang_limit a possible solution ?

I'm just not sure which direction I should take from here. Ruling out hardware 100% would be ideal.

When I search google for vega ring gfx timeout I am finding a number of results, but no real solutions.

**Soul_keeper** · 26 September 2018, 11:07 PM

There is no documentation on this, but it appears to be working:

amdgpu.disable_cu=0.0.14,0.0.15,1.0.14,1.0.15,2.0. 14,2.0.15,3.0.14,3.0.15

This should basically turn a vega64/fe into a vega56.
I'm going to try to rule out hardware stability by playing with this for awhile ....

**Soul_keeper** · 27 September 2018, 07:50 AM

I've discovered another way to trigger the ring gfx timeout and force me to reboot.

amdgpu.disable_cu=3.0.0,3.0.1,3.0.2,3.0.3,3.0.4,3. 0.5,3.0.6,3.0.7,3.0.8,3.0.9,3.0.10,3.0.11,3.0.12,3 .0.13,3.0.14,3.0.15

startx and I get a black screen with no keyboard.
ssh in from another box, grab the dmesg, and reboot.

Apparently disabling all the CUs in an SE will cause this to trigger when X starts.

Any ideas why this might trigger it ?

here are the ones that did pass:
https://openbenchmarking.org/result/...RA-1809275RA81
Strangely there seems to be no performance benefit to disabling the CUs in a staggered fashion. I'd think that the L1 cache sharing or other SE units would be affected ... unless this is somehow virtual.
I'd be really interested to see what disabling an entire SE, or two, would do.

**Soul_keeper** · 30 September 2018, 03:31 AM

The vega design as a whole just has high temps. After applying better thermal compound, more fans, and even underclocking/undervolting I havn't been able to increase stability. So I think that rules out temperature. So it's either DOA or software, i'm leaning towards software now.

debianxfce: Did you say you have a vega ?
Well, any gcn you could try disabled an SE with amdgpu.disable_cu, I bet that'd trigger a ring gfx timeout. Can you play dota2 completely stable ?

**Soul_keeper** · 16 December 2018, 07:36 AM

This issue still exists. No progress.

**Soul_keeper** · 17 December 2018, 07:12 AM

Originally posted by debianxfce View Post

What if you run it with lower engine clocks. You can use my simple python app for that: https://www.phoronix.com/forums/foru...justing-clocks

clocks/voltages/temps don't seem to fix the issue

...
[61988.103214] [drm:0xffffffffa04012ed] *ERROR* ctx 00000000bbf8582b is still alive
[61988.103214] [drm:0xffffffffa04012ed] *ERROR* ctx 00000000bbf8582b is still alive
[61988.103215] [drm:0xffffffffa040137e] *ERROR* ctx 00000000bbf8582b is still alive
[72134.598433] [drm:0xffffffffa04929af] *ERROR* ring gfx timeout, signaled seq=4188416, emitted seq=4188418
[72134.598435] [drm] GPU recovery disabled.

System Lockup

Short of buying a new/different video card, I guess it's impossible to get amdgpu stable with a vega FE in 3d

**Soul_keeper** · 17 December 2018, 08:14 AM

The latest kernel-firmware-20181215_211de16-noarch-1 causes my system to not boot. Screen turns black and it goes into some kind of power cycle loop ...
Reverting to kernel-firmware-20181026_1baa348-noarch-1 allows it to boot again.

What did you guys break now ?

**Soul_keeper** · 17 December 2018, 08:31 AM

Originally posted by debianxfce View Post

You did remove the heatsink from vega and have temp problems. I think your gpu card is broken based on the image you posted. GPU video memory gets corrupted when the temperature rises.

Just to be clear here I "thought" I had temp issues, now i'm certain this isn't and likely was never a factor in anything.
So that either means the software kernel/mesa/games are buggy or i've had a DOA card from day 1 and because I panicked and replaced the thermal paste I just have to eat a $1000 loss due to voided warranty (also owning the card for so long).

It's a shame I have no friends, or i'd loan the card to a windows gamer and let them play with it for a few weeks to potentially rule out linux as being the problem.

**Soul_keeper** · 17 December 2018, 07:24 PM

I havn't booted a windows system in over 15yrs

**Soul_keeper** · 28 December 2018, 09:46 PM

Does anyone else have this problem ?
It's not going away ...

Announcement

AMD devs: ERROR ring gfx timeout

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Announcement

AMD devs: *ERROR* ring gfx timeout

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

AMD devs: ERROR ring gfx timeout