Announcement

Collapse
No announcement yet.

AMDGPU Reset Recovery To Be Flipped On By Default For Newer Radeon GPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by Med_ View Post
    It would be worth a try. How do you alter that with AMDGPU?
    There is:
    https://github.com/grmat/amdgpu-fancontrol and
    https://github.com/marazmista/radeon-profile

    Comment


    • #22
      Originally posted by Med_ View Post

      It would be worth a try. How do you alter that with AMDGPU?
      Here is my little shell script that does a simple fan curve:

      Code:
      #!/bin/bash
      cardpath=$(eval echo "/sys/bus/pci/devices/0000:03:00.0/hwmon/hwmon*")
      echo "$cardpath"
      temp=0;
      while true
        do temp=$(cat "$cardpath/temp1_input")
           temp=$((60+((temp/1000)-30)*3))
           echo "$temp" > "$cardpath/pwm1"
           sleep 0.5
        done
      (edit the PCI device location if necessary)

      ​​​​​​​(it won't work well if your card goes under 10°C but this is unlikely)

      Comment


      • #23
        Originally posted by debianxfce View Post
        I did laugh when I saw GPU reset patches for the intel gpus. The AMD drm-next-4.21-wip kernel started to hang the system when waking up from monitor blanking and sleeping after 30.9.2018 (4.19.rc5->rc6). The gpu reset patch has no effect with RX560 and system must be rebooted with the power button. My distribution has latest wip kernel available with Synaptic.
        That's what happens when you use a WIP (work in progress) kernel, a unstable distro and Ubuntu PPA's on Debian.

        (this gives you the advantage to bisect with further ease and do a bug report though)

        Comment


        • #24
          [140122.327535] WARNING: CPU: 7 PID: 31433 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn10/dcn10_hw_sequencer.c:845 dcn10_verify_allow_pstate_ch
          ange_high+0x25/0x240 [amdgpu]
          [.....]
          [140122.328573] RIP: 0010:dcn10_verify_allow_pstate_change_high+0x25/0x240 [amdgpu]
          [140122.328615] Code: 00 00 00 00 00 0f 1f 44 00 00 55 53 48 8b 87 38 01 00 00 48 89 fb 48 8b b8 e0 01 00 00 e8 73 0f 01 00 84 c0 0f 85 16 02 00 0
          0 <0f> 0b 80 bb b9 00 00 00 00 0f 84 07 02 00 00 48 8b 83 38 01 00 00
          [140122.328714] RSP: 0018:ffffa0208c76bb78 EFLAGS: 00010246
          [140122.328745] RAX: 0000000000000000 RBX: ffff91abf9ae7000 RCX: 0000000000000000
          [140122.328785] RDX: 0000000000000000 RSI: ffff91ac0ebd5548 RDI: ffff91ac0ebd5548
          [140122.328825] RBP: ffff91a904480000 R08: 0000000000000000 R09: 0000000000aaaaaa
          [140122.328865] R10: 0000000000000000 R11: ffffa020a1248220 R12: 0000000000000001
          [140122.328904] R13: ffffa0208c76bbc0 R14: ffff91a900094000 R15: 0000000000000000
          [140122.328945] FS: 0000000000000000(0000) GS:ffff91ac0ebc0000(0000) knlGS:0000000000000000
          [140122.328989] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          [140122.329022] CR2: 00007fc3fd172594 CR3: 00000003ebb16000 CR4: 00000000003406e0
          [140122.329061] Call Trace:
          [140122.329165] dcn10_set_bandwidth+0xad/0xc0 [amdgpu]
          [140122.329271] dc_commit_state+0x420/0x550 [amdgpu]
          [140122.329384] amdgpu_dm_atomic_commit_tail+0x388/0xdb0 [amdgpu]
          [140122.329422] ? __wake_up_common_lock+0x89/0xc0
          [140122.329451] ? _cond_resched+0x15/0x30
          [140122.329475] ? wait_for_completion_timeout+0x3a/0x180
          [140122.329505] ? wait_for_completion_interruptible+0x35/0x1b0
          [140122.329619] ? amdgpu_dm_atomic_commit_tail+0xdb0/0xdb0 [amdgpu]
          [140122.329664] commit_tail+0x3d/0x70 [drm_kms_helper]
          [140122.329702] drm_atomic_helper_commit+0x103/0x110 [drm_kms_helper]
          [140122.329745] restore_fbdev_mode_atomic+0x1c4/0x1e0 [drm_kms_helper]
          [140122.329790] drm_fb_helper_restore_fbdev_mode_unlocked+0x45/0x90 [drm_kms_helper]
          [140122.329839] drm_fb_helper_set_par+0x29/0x50 [drm_kms_helper]
          [140122.329880] drm_fb_helper_hotplug_event.part.35+0x90/0xb0 [drm_kms_helper]
          [140122.329927] drm_kms_helper_hotplug_event+0x26/0x30 [drm_kms_helper]
          [140122.330046] handle_hpd_irq+0xd9/0x100 [amdgpu]
          [140122.330156] dm_irq_work_func+0x4e/0x60 [amdgpu]
          [140122.330186] process_one_work+0x19b/0x390
          [140122.330212] worker_thread+0x30/0x370
          [140122.330236] ? rescuer_thread+0x320/0x320
          [140122.330261] kthread+0x112/0x130
          [140122.330282] ? kthread_create_worker_on_cpu+0x70/0x70
          [140122.330313] ret_from_fork+0x22/0x40
          [140122.330337] ---[ end trace 44fa92c7d8e7ba0d ]---


          What did AMD mean by this?

          Comment


          • #25
            Interesting as my Ryzen Mobile does hang randomly but infrequently. I’m not even sure it is a GPU hang though it certainly feels like it.

            For those well versed in Linux what is the best way to turn this on with a new distro like Fedora 29?

            The frustratingthing with the hangs is that that they will happen when doing something that is difficult to describe as complicated or GPU demanding. Basically totally random. As others have pointed out the latest kernel and Mesa do perform much better.

            Comment


            • #26
              Originally posted by tildearrow View Post

              That's what happens when you use a WIP (work in progress) kernel, a unstable distro and Ubuntu PPA's on Debian.

              (this gives you the advantage to bisect with further ease and do a bug report though)
              LOL. You are responding to a troll who advocates to casual users to use WIP kernels/mesa because supposedly stable software is incomplete.

              Comment


              • #27
                Originally posted by xiando View Post
                What did AMD mean by this?
                If you haven't already, submit a bug report and keep it updated.

                Comment


                • #28
                  I do experience crash often with the game EVERSPACE.

                  It append at random time, usually it take less than 1h30 to occur. I thought it was overclocked related but it seam it's the only game I crash and it take the whole system with it. Reset required. Sometime using the reset button still leave anomalies like, extra stuff that does not work properly in the OS. So when that game use to crash my system, I did use the power supply switch to "reset" instead.

                  I do not play that game anymore due to these crash, with is unfortunate as it was one of my favorite that have a Linux version on Steam..

                  GPU : RX480 @ 1500ish Mhz 1.35V (It's not the overclocking, it also append with stock clock/volt)
                  CPU : Ryzen 7 2700x @ 4.3Ghz (was also appening with my R7 1700 OC or not OC, but less than with the 2700x)
                  Kernel : usually latest, release or git, it's variable
                  Mesa : latest git updated regulary

                  The GPU is watercooled with a custom loop, it does not usually reach 65C.
                  No other games take my system down, at lease for now.

                  Comment


                  • #29
                    Originally posted by Med_ View Post
                    This is good news. I consistently get hangs with games. Typically once every few hours. I do not bother reporting them as I cannot reproduce on demand and the bug tracker is full of them with similar logs ([drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=145101, last emitted seq=145103). I have activated the option, we will see whether that at least prevents the power button treatment.
                    As you I' m having the same issues, I just need to hard reboot, and happen in the same card RX480, I tested everything, but is getting less with the update of the kernel. The main problem as you said, this cannot reproduce if happen you make a hard reboot and then the card is fine. I hope this option is soon, I will like that you can reset the gpu without freeze the system.

                    Comment


                    • #30
                      I tried amdgpu.gpu_recovery=1 in the past, but for the system crashes I experience when using amdgpu.dc=1 it makes no difference whether amdgpu.gpu_recovery is set to 0 or 1 - the system crashes hard either way.

                      Comment

                      Working...
                      X