Announcement

Collapse
No announcement yet.

AMD devs: *ERROR* ring gfx timeout

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    File bugs here: https://bugs.freedesktop.org

    Comment


    • #32
      Originally posted by agd5f View Post
      Certainly, once I rule out a hardware issue. Don't want to fill up your issue tracker with my broken stuff that isn't your software's problem.

      As far as this issue goes I haven't had much time to mess with this weekend. I added the amdgpu.gpu_recovery=1 to my kernel parameters and the desktop has not locked up on me since. Still getting amdgpu errors in the kernel that are bad enough to set arbt traps but the system keeps going. MCE is also flagging some hardware issues but from what I understood mcelog doesn't work on Ryzen? Could be wrong there. Journalctl just shows it complaining about not knowing about this CPU, although on a couple of boots it set a arbt trap for a hardware issue:

      Code:
      mcelog: ERROR: AMD Processor family 23: mcelog does not support this processor.  Please use the edac_mce_amd module >
      mcelog[1149]: CPU is unsupported
      I've got rasdaemon running now and it's not logged any errors. I checked my BIOS and noticed that my 2400Mhz RAM was detected at 2133Mhz so I set it appropriately and it's not flagged any more hardware errors on boot. But it's been the other way for months with no issues so I'm betting that's a fluke. Plan on running memtest86+ just to be sure but it passed one when I built it in October.

      Right now this is the stuff setting off arbt and filling up my logs, a bit reluctant to fill out a bug report until I've ruled out hardware problems though. Even with this the machine has been stable since adjusting the RAM frequency to what it should have been and adding the GPU recovery option to grub.

      Code:
      [85353.172060] WARNING: CPU: 5 PID: 1330 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:254 generic_reg_wait+0xe7/0x160 [amdgpu]
      [85353.172062] Modules linked in: ipheth md4 sha512_ssse3 sha512_generic nls_utf8 cifs ccm dns_resolver fscache rfcomm cmac bnep sunrpc vfat fat raid1 arc4 edac_mce_amd kvm_amd kvm iwlmvm mxm_wmi wmi_bmof irqbypass mac80211 crct10dif_pclmul snd_hda_codec_realtek crc32_pclmul snd_hda_codec_generic ghash_clmulni_intel snd_hda_codec_hdmi snd_hda_intel snd_hda_codec iwlwifi snd_hda_core btusb btrtl snd_hwdep btbcm btintel snd_seq snd_seq_device bluetooth cfg80211 snd_pcm snd_timer snd ecdh_generic sp5100_tco rfkill ccp soundcore k10temp i2c_piix4 wmi gpio_amdpt gpio_generic pcc_cpufreq acpi_cpufreq amdkfd amd_iommu_v2 amdgpu hid_logitech_hidpp chash gpu_sched drm_kms_helper igb ttm nvme dca drm crc32c_intel serio_raw nvme_core atlantic i2c_algo_bit uas usb_storage hid_logitech_dj pinctrl_amd
      [85353.172105] CPU: 5 PID: 1330 Comm: Xorg Tainted: G        W         4.19.15-300.fc29.x86_64 #1
      [85353.172107] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Professional Gaming, BIOS P3.30 08/14/2018
      [85353.172169] RIP: 0010:generic_reg_wait+0xe7/0x160 [amdgpu]
      [85353.172171] Code: 44 24 58 8b 54 24 48 89 de 44 89 4c 24 08 48 8b 4c 24 50 48 c7 c7 40 af 89 c0 e8 44 c4 cd ff 83 7d 18 01 44 8b 4c 24 08 74 02 <0f> 0b 48 83 c4 10 44 89 c8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 0f
      [85353.172172] RSP: 0018:ffffa70690bc3878 EFLAGS: 00010297
      [85353.172174] RAX: 0000000000000000 RBX: 000000000000000a RCX: 0000000000000000
      [85353.172175] RDX: 0000000000000000 RSI: ffff9acc7cf56868 RDI: ffff9acc7cf56868
      [85353.172176] RBP: ffff9acc55b2b080 R08: 0000000000000084 R09: 0000000000010200
      [85353.172176] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000bb9
      [85353.172177] R13: 0000000000004fa4 R14: 0000000000010000 R15: 0000000000000000
      [85353.172179] FS:  00007f2490fa6ac0(0000) GS:ffff9acc7cf40000(0000) knlGS:0000000000000000
      [85353.172180] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [85353.172181] CR2: 00007f259c002158 CR3: 0000000ff1328000 CR4: 00000000003406e0
      [85353.172182] Call Trace:
      [85353.172257]  dce110_stream_encoder_dp_blank+0x12c/0x1a0 [amdgpu]
      [85353.172322]  core_link_disable_stream+0x54/0x220 [amdgpu]
      [85353.172386]  dce110_reset_hw_ctx_wrap+0xc1/0x1e0 [amdgpu]
      [85353.172451]  dce110_apply_ctx_to_hw+0x45/0x650 [amdgpu]
      [85353.172520]  ? dm_pp_apply_display_requirements+0x191/0x1a0 [amdgpu]
      [85353.172583]  ? dce110_set_bandwidth+0x20b/0x230 [amdgpu]
      [85353.172646]  dc_commit_state+0x2dc/0x550 [amdgpu]
      [85353.172716]  amdgpu_dm_atomic_commit_tail+0x388/0xdb0 [amdgpu]
      [85353.172721]  ? __wake_up_common_lock+0x89/0xc0
      [85353.172725]  ? _cond_resched+0x15/0x30
      [85353.172727]  ? wait_for_completion_timeout+0x3a/0x190
      [85353.172729]  ? wait_for_completion_interruptible+0x35/0x1d0
      [85353.172738]  commit_tail+0x3d/0x70 [drm_kms_helper]
      [85353.172747]  drm_atomic_helper_commit+0x103/0x110 [drm_kms_helper]
      [85353.172764]  drm_mode_atomic_ioctl+0x81b/0x940 [drm]
      [85353.172768]  ? unix_stream_sendmsg+0x37f/0x3b0
      [85353.172785]  ? drm_atomic_set_property+0x690/0x690 [drm]
      [85353.172798]  drm_ioctl_kernel+0xa1/0xf0 [drm]
      [85353.172813]  drm_ioctl+0x206/0x3a0 [drm]
      [85353.172829]  ? drm_atomic_set_property+0x690/0x690 [drm]
      [85353.172831]  ? _cond_resched+0x15/0x30
      [85353.172881]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
      [85353.172886]  do_vfs_ioctl+0xa4/0x630
      [85353.172889]  ksys_ioctl+0x60/0x90
      [85353.172891]  ? ksys_read+0x9c/0xb0
      [85353.172893]  __x64_sys_ioctl+0x16/0x20
      [85353.172896]  do_syscall_64+0x5b/0x160
      [85353.172899]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [85353.172901] RIP: 0033:0x7f24914ce09b
      [85353.172904] Code: 0f 1e fa 48 8b 05 ed bd 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d bd bd 0c 00 f7 d8 64 89 01 48
      [85353.172905] RSP: 002b:00007ffc2d88c558 EFLAGS: 00003246 ORIG_RAX: 0000000000000010
      [85353.172906] RAX: ffffffffffffffda RBX: 000055c443bcb0d0 RCX: 00007f24914ce09b
      [85353.172907] RDX: 00007ffc2d88c5a0 RSI: 00000000c03864bc RDI: 0000000000000016
      [85353.172908] RBP: 00007ffc2d88c5a0 R08: 000055c4439b2d90 R09: 000000000000000d
      [85353.172909] R10: 000000000000000d R11: 0000000000003246 R12: 00000000c03864bc
      [85353.172910] R13: 0000000000000016 R14: 0000000000000000 R15: 000055c4434712b0
      [85353.172912] ---[ end trace af3cf32b9038afa4 ]---


      Comment


      • #33
        Originally posted by debianxfce View Post

        You are using buggy fedora with old drivers. The mainline kernel is in the version 5.0-rc3. Use Debian distributions with the Xfce desktop and Oibaf ppa Mesa.
        You are literally suggesting two things warned against in "Don't Break Debian:"





        If you want a stable system those are both big "do nots."

        Comment


        • #34
          Had a rx64 for year an a half, had one ringbus lockup on kernel 4.14 iirc. Not had a issue since. I've used mine for all sorts too not just gaming.

          I do run the AMD staging kernel on Antergos though.

          Comment


          • #35
            Originally posted by debianxfce View Post

            The AMD staging kernel is totally different than mainline kernels because it receives some of latest AMD patches. It uses the 4.20-rc3 kernel now so it misses a lot of other new kernel features. The AMD wip kernel receives a lot of more AMD patches and it uses the buggy 5.0-rc1 kernel now.
            Why are you telling me something to which I already know

            I've used both mainline and staging with no issues so it's a moot regards that point anyway

            I was pointing out the fact that I don't get them issues bar the one mentioned

            Comment


            • #36
              I still have the same issue on Vega 56. Problem most likely is related to mclk level 0 switching on Linux, as the GPU is perfectly stable on Windows.

              At least Vega is affected by the issue, but looks like that at least some Navi and Polaris cards are affected by it too. I really hope someone can do something about, as random crashes are annoying af.

              Comment

              Working...
              X