AMDGPU Reset Recovery To Be Flipped On By Default For Newer Radeon GPUs

Grogan replied

27 February 2019, 06:15 PM
Originally posted by tildearrow View Post

Not anymore. With one exception, but this problem also happens on Windows: My card sometimes due to AC voltage variance, cosmic fluctuations or lack of ground, randomly self-resets (goes to a "0MHz" power clock, turns the screen off and resets the fan speed). I don't know how to plug it in again, which is most likely impossible

I'm having a similar problem only it doesn't shut my screen off, it just goes black. It's still powered up and back-lit. The system is halted, with full power going to hardware. I know this because the load meter goes up on my UPS, the graphics card has been hot (in the 60's immediately after the hard boot and quickly cools down) and the machine doesn't respond to input (including ACPI power button) or ping... it's halted. Nothing is logged.

I do sometimes notice line voltage fluctuation when this happens, but no matter how good the voltmeter on my UPS is, it's not going to react quickly enough to see an instantaenous drop. What I thought was that it happened when the voltage was high, and suddenly dropped low, in other words, it's the delta that something is reacting to when the card is in low clock mode. I happened to see it once, the line voltage was about 127V (yes, that's real, I've compared the voltmeter on my UPS's OSD with multitester voltmeter) and dropped down to about 113V and came back up to about 124.

It has never happened when the card is in high clock mode, for example while playing games. It's more likely to happen when reading a web page, or especially while doing nothing. I have also never come back and found my system dead when display power management has shut the screen off, though I have come back after about 2 minutes (refill my coffee) to find it dead, with the screen black and back-lit. Nothing going on, all programs except desktop environment closed.

I am using Plasma 5 on both my Kubuntu (for games) and Arch (for work and life) now but this has also happened with XFCE in a very simple Crux distro.

I roll my own kernels, as I have been doing for decades.

I have ASPM disabled (I have always just enabled it in kernel and set "Performance" mode) but I have been using pcie_aspm=off for some time now. It's something you have to enable in kernel or you can't disable it :-)

It happened to me in Windows once, I had just clicked a menu in Firefox. HOWEVER, I don't use Windows very often anymore and while I'm there I'd be very unlikely to be idle (I'd be playing a game).

The problem could occur 3 times in one day, or go a week without happening. This is why I think it's external factors, for example line conditions.

My guess would be AMD drivers and firmware blobs reacting adversely.

Driver recovery has NEVER really worked properly for me since I bought this card. I used to have assloads of problems in Windows with early AMD drivers. It got to the point where I just set Windows TDR ("time out detection and recovery") to blue screen stop error (BugCheck on timeout) instead of recovery. It's marginally better than a hard power off, as it at least syncs buffers. I don't seem to have that particular problem in Windows anymore, but I am using Windows 7 and sticking with AMD drivers that have no problems, not upgrading them for the sake of it. I don't really have those kinds of game crashes to invoke TDR.

My graphics card is a MSI brand R9 380.
Leave a comment:
mercurio replied

04 November 2018, 04:54 PM
Originally posted by mercurio View Post

Hi,

I get same issue on my Gigabyte Radeon RX VEGA 64 GAMING OC 8G, see below

28 22:23:41 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2945324, emitted seq=2945326
28 22:23:41 kernel: [drm] GPU recovery disabled.
--> Had to hold power button
-- Reboot --

I have been logging GPU temperature via lm-sensors, the critical is set to 89.0°C and I get max 72°C while playing DOOM 2016 in 4K, so I guess, that GPU is not overheating as someone mentioned, the PC is brand new, so no settled dust, or anything like that.
There is a review at Tom`s Hardware, they have measured 74-75°C

Gigabyte Radeon RX Vega 64 Gaming OC 8G Review

https://www.tomshardware.com/reviews/gigabyte-rx-vega-64-gaming-oc-review,5441-5.html

If there were enough Vega GPUs to go around, Gigabyte's Radeon RX Vega 64 OC 8G would probably be great for high-end performance at a reasonable price. Unfortunately, a lack of availability means you probably won't be able to find one.

Other interesting kernel messages I get:

30 17:54:49 kernel: [drm:amdgpu_ctx_mgr_entity_fini [amdgpu]] *ERROR* ctx 000000003349f739 is still alive
30 17:54:49 kernel: [drm:amdgpu_ctx_mgr_fini [amdgpu]] *ERROR* ctx 000000003349f739 is still alive
--> freeze while running command "shutdown -h now"

I am using xubuntu 18.04.1 LTS, kernel 4.19 mainline from kernel.ubuntu.com with Padoka Stable PPA repo.

Gigabyte released F2 VGA BIOS for my card I am going to give it try.

http://download.gigabyte.eu/FileList/BIOS/vga_bios_RXVEGA64_8GD_F2.zip

Hi, it seems, that new kernel 4.19.1

404 Not Found

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19.1/

and latest firmware have solved this issue.

#906526 - firmware-amd-graphics: New updatream version available; includes important system hang fix. - Debian Bug report logs

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=906526

amdgpu: update vega10 firmware to 18.40 - kernel/git/firmware/linux-firmware.git - Repository of firmware blobs for use with the Linux kernel

https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/amdgpu?id=ac5f8bdd1f02612cf397c6fdfde657faef7f7f45

I have also switched to Padoka PPA - unstable

padoka PPA : Paulo Dias

https://launchpad.net/~paulo-miguel-dias/+archive/ubuntu/mesa

Hi, this is the UNSTABLE, built from git padoka ppa. if you are looking for the STABLE padoka PPA, go here: https://launchpad.net/~paulo-miguel-dias/+archive/ubuntu/pkppa/ If you like/use this PPA and think i deserve a cup of coffee, do a Paypal donation: https://www.paypal.me/padoka I don't have the time to support multiple ubuntu versions, so i only provide support for the LTS and the latest. if you need support for older versions, use oibaf repo instead (URL below). oibaf ppa for refer...

I am also using amdgpu.gpu_recovery=1

Cheers
Leave a comment:
Venemo replied

31 October 2018, 06:44 AM
Originally posted by debianxfce View Post

Michael is a huge fedora fan. He uses Ubuntu and Padoka ppa Mesa git when testing with games. Use a rolling release OS like Debian testing Xfce. Fedora is not popular for gaming: https://www.gamingonlinux.com/users/statistics

Here is the wip kernel, clone it at Saturday:
https://cgit.freedesktop.org/~agd5f/...-next-4.21-wip

You need to have latest Mesa git and firmware too.
https://launchpad.net/~oibaf/+archiv...aphics-drivers

I have all of these in my distribution: https://www.youtube.com/watch?v=fKJ-IatUfis

Thanks man. Since you say you already got all of it in your distro, is there an easy way to get a live image of your distro with all this very latest stuff included? That would make it pretty easy for me to test.

(If not, I will just try that AMD kernel on Fedora and see where it goes.)
Leave a comment:
Venemo replied

31 October 2018, 06:05 AM
Originally posted by debianxfce View Post

Many people including you uses old and buggy OS and drivers.

I use the latest kernel and mesa releases (4.19 and 18.2). Are you suggesting that there is an AMD kernel somewhere that I can try? If so, please give me a link and I will try it.
Leave a comment:
Venemo replied

31 October 2018, 05:54 AM
Originally posted by debianxfce View Post

You are using external pci-e converter. That is not in the AMD interest and I am sure they do not have the same hardware that you have. I am sure when I install my new ASUS Radeon RX 570 4GB Expedition OC to the desktop computer it works without problems. I am waiting to the pci-e 8 pin power adapter to arrive.

That has nothing to do with it. Seriously, search for "ring gfx timeout" on google and you will find many very very similar issues. People experience those issues with their desktop computers.
Likes 1
Leave a comment:
Venemo replied

31 October 2018, 04:46 AM
I get a consistent GPU hang on my RX 570, I also reported a bug here: https://bugs.freedesktop.org/show_bug.cgi?id=108493 - there are some tweaks that can improve the stability, so that it doesn't hang immediately but after 10-15 minutes.

But if you google for "ring gfx timeout" or "ring sdma0 timeout" you'll find a bunch of other similar issues. Most of the time it is also unclear whether the problem is in amdgpu (the kernel driver) itself or in radeonsi (the mesa driver). It has been stated in a few bug reports that mesa can indeed trigger a GPU hang if it behaves badly.

I guess I'm looking forward to seeing kernel 4.20-rc1 to test these fantastic new changes and see if the problem is fixed (or at least maybe it can now properly reset the GPU if not).
Leave a comment:
Soul_keeper replied

31 October 2018, 03:01 AM
I get the same *ERROR* ring gfx timeout error/lockup all the time with my vegafe ...
If it was heat the underclocking/undervolting and maxing out the fans would help, but it don't.

dota2 is great for exposing it, just try to finish a complete game against bots with details maxed at 4K (maybe lesser settings do it too).
typically half way thru the game it'll lock up, just when the big fights are starting.

I don't really have the time to care right now or test/report as i'm not a driver dev. All I can do is be patient and wait another year I guess ...

Now if reset worked ... i'd prolly never notice. It would just be a dmesg warning that happens once or twice per game session, with maybe a slight stutter in game. That would be far easier to deal with as I wait for the drivers to mature.

Last edited by Soul_keeper; 31 October 2018, 03:07 AM.
Leave a comment:
mercurio replied

30 October 2018, 02:01 PM
Originally posted by Med_ View Post

This is good news. I consistently get hangs with games. Typically once every few hours. I do not bother reporting them as I cannot reproduce on demand and the bug tracker is full of them with similar logs ([drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=145101, last emitted seq=145103). I have activated the option, we will see whether that at least prevents the power button treatment.

Hi,

I get same issue on my Gigabyte Radeon RX VEGA 64 GAMING OC 8G, see below

28 22:23:41 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2945324, emitted seq=2945326
28 22:23:41 kernel: [drm] GPU recovery disabled.
--> Had to hold power button
-- Reboot --

I have been logging GPU temperature via lm-sensors, the critical is set to 89.0°C and I get max 72°C while playing DOOM 2016 in 4K, so I guess, that GPU is not overheating as someone mentioned, the PC is brand new, so no settled dust, or anything like that.
There is a review at Tom`s Hardware, they have measured 74-75°C

Gigabyte Radeon RX Vega 64 Gaming OC 8G Review

https://www.tomshardware.com/reviews/gigabyte-rx-vega-64-gaming-oc-review,5441-5.html

If there were enough Vega GPUs to go around, Gigabyte's Radeon RX Vega 64 OC 8G would probably be great for high-end performance at a reasonable price. Unfortunately, a lack of availability means you probably won't be able to find one.

Other interesting kernel messages I get:

30 17:54:49 kernel: [drm:amdgpu_ctx_mgr_entity_fini [amdgpu]] *ERROR* ctx 000000003349f739 is still alive
30 17:54:49 kernel: [drm:amdgpu_ctx_mgr_fini [amdgpu]] *ERROR* ctx 000000003349f739 is still alive
--> freeze while running command "shutdown -h now"

I am using xubuntu 18.04.1 LTS, kernel 4.19 mainline from kernel.ubuntu.com with Padoka Stable PPA repo.

Gigabyte released F2 VGA BIOS for my card I am going to give it try.

http://download.gigabyte.eu/FileList/BIOS/vga_bios_RXVEGA64_8GD_F2.zip
Leave a comment:
DarkFoss replied

29 October 2018, 12:30 PM
Originally posted by tildearrow View Post

I can't encode in H.264... (the encoding slice is now present but is useless)

Just curious do you have this error in your dmesg ?
[drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
Leave a comment:
FireBurn replied

29 October 2018, 08:09 AM
Originally posted by Brisse View Post

That's not a fix. A fix is to fix the overheating problem. Overheating is not a normal operating condition for a computer or pretty much for any system electric or otherwise. That's why it's called OVERheating.

Either way, the graphics card going titsup shouldn't take out the whole system
Likes 1
Leave a comment:

Announcement

AMDGPU Reset Recovery To Be Flipped On By Default For Newer Radeon GPUs

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: