Announcement

Collapse
No announcement yet.

Ryzen freeze or AMDGPU+GCN 1.1 bug?

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Ryzen freeze or AMDGPU+GCN 1.1 bug?

    I've been suffering from hard freeze since I upgraded my system to a Ryzen 1800X. Most of the time the system didn't even respond to REISUB emergency instructions, however, I managed to get the kernel logs and from the stacktrace, it actually looks like an AMDGPU bug

    Symptoms:
    Hard freeze when I leave the system for more than half an hour. It doesn't even respond to REISUB if I come back after 1 hour, but I can use that if I come back soon enough.
    I thought it was because the system was going idle, but I left it with "stress -c6" running in the background and it froze anyway.

    My config:
    I tried Ubuntu 17.10 - 18.04, and all stable kernel versions, since the official that comes with 17.10 to the latest 4.16.1 (always installed from the ubuntu kernel ppa)
    CPU: Ryzen 1800X - 32GB RAM
    Motherboard: Gigabyte AX370 Gaming K7
    GPU: AMD 390X 8GB
    I didn't get any freeze before the upgrade. I didn't reinstall the OS.

    I thought it was one of these bugs, but my stack trace didn't look anything like that:
    https://bugzilla.kernel.org/show_bug.cgi?id=196683
    https://bugzilla.redhat.com/show_bug.cgi?id=1562530
    https://bugs.launchpad.net/ubuntu/+s...x/+bug/1690085

    Stack trace:
    Code:
    Apr  9 09:43:11 net kernel: [ 3902.972311] ------------[ cut here ]------------
    Apr  9 09:43:11 net kernel: [ 3902.972313] kernel BUG at /home/kernel/COD/linux/mm/slub.c:296!
    Apr  9 09:43:11 net kernel: [ 3902.972320] invalid opcode: 0000 [#1] SMP NOPTI
    Apr  9 09:43:11 net kernel: [ 3902.972321] Modules linked in: pci_stub vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) edac_mce_amd kvm_amd snd_hda_codec_realtek kvm snd_hda_codec_generic irqbypass snd_hda_codec_hdmi crct10dif_pclmul snd_hda_intel crc32_pclmul snd_seq_midi ghash_clmulni_intel snd_seq_midi_event pcbc snd_rawmidi snd_hda_codec aesni_intel snd_seq snd_hda_core snd_hwdep snd_seq_device aes_x86_64 snd_pcm crypto_simd glue_helper joydev input_leds cryptd wmi_bmof k10temp snd_timer ccp i2c_piix4 snd soundcore shpchp mac_hid binfmt_misc parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic usbhid hid uas usb_storage amdkfd amd_iommu_v2 amdgpu chash gpu_sched radeon ttm igb drm_kms_helper syscopyarea dca sysfillrect sysimgblt ptp fb_sys_fops mxm_wmi alx pps_core drm i2c_algo_bit mdio ahci libahci wmi
    Apr  9 09:43:11 net kernel: [ 3902.972355]  gpio_amdpt gpio_generic
    Apr  9 09:43:11 net kernel: [ 3902.972358] CPU: 6 PID: 1361 Comm: Xorg Tainted: G           OE    4.16.1-041601-generic #201804081334
    Apr  9 09:43:11 net kernel: [ 3902.972359] Hardware name: Gigabyte Technology Co., Ltd. AX370-Gaming K7/AX370-Gaming K7, BIOS F22 03/15/2018
    Apr  9 09:43:11 net kernel: [ 3902.972363] RIP: 0010:__slab_free+0x17a/0x2c0
    Apr  9 09:43:11 net kernel: [ 3902.972365] RSP: 0018:ffffb3fd8927b980 EFLAGS: 00010246
    Apr  9 09:43:11 net kernel: [ 3902.972366] RAX: ffff9b89929ac800 RBX: ffff9b89929ac800 RCX: 0000000180200017
    Apr  9 09:43:11 net kernel: [ 3902.972367] RDX: ffff9b89929ac800 RSI: ffffd88a204a6a00 RDI: ffff9b899e806e80
    Apr  9 09:43:11 net kernel: [ 3902.972368] RBP: ffffb3fd8927ba20 R08: 0000000000000001 R09: ffffffffc07468e4
    Apr  9 09:43:11 net kernel: [ 3902.972369] R10: ffffb3fd8927ba40 R11: ffff9b899413e000 R12: ffff9b899e806e80
    Apr  9 09:43:11 net kernel: [ 3902.972370] R13: ffffd88a204a6a00 R14: ffff9b89929ac800 R15: ffff9b899413f800
    Apr  9 09:43:11 net kernel: [ 3902.972372] FS:  00007f7b14886500(0000) GS:ffff9b899ed80000(0000) knlGS:0000000000000000
    Apr  9 09:43:11 net kernel: [ 3902.972373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Apr  9 09:43:11 net kernel: [ 3902.972374] CR2: 00007fad30a30000 CR3: 00000008181d6000 CR4: 00000000003406e0
    Apr  9 09:43:11 net kernel: [ 3902.972375] Call Trace:
    Apr  9 09:43:11 net kernel: [ 3902.972407]  ? dc_sink_free+0x34/0x40 [amdgpu]
    Apr  9 09:43:11 net kernel: [ 3902.972409]  kfree+0x166/0x180
    Apr  9 09:43:11 net kernel: [ 3902.972411]  ? kfree+0x166/0x180
    Apr  9 09:43:11 net kernel: [ 3902.972438]  dc_sink_free+0x34/0x40 [amdgpu]
    Apr  9 09:43:11 net kernel: [ 3902.972464]  dc_sink_release+0x24/0x30 [amdgpu]
    Apr  9 09:43:11 net kernel: [ 3902.972490]  dc_stream_free+0x22/0x50 [amdgpu]
    Apr  9 09:43:11 net kernel: [ 3902.972515]  dc_stream_release+0x2c/0x30 [amdgpu]
    Apr  9 09:43:11 net kernel: [ 3902.972544]  dm_update_crtcs_state+0x126/0x370 [amdgpu]
    Apr  9 09:43:11 net kernel: [ 3902.972571]  amdgpu_dm_atomic_check+0x2ad/0x4d0 [amdgpu]
    Apr  9 09:43:11 net kernel: [ 3902.972580]  drm_atomic_check_only+0x389/0x550 [drm]
    Apr  9 09:43:11 net kernel: [ 3902.972588]  drm_atomic_commit+0x18/0x60 [drm]
    Apr  9 09:43:11 net kernel: [ 3902.972596]  drm_atomic_connector_commit_dpms+0xef/0x100 [drm]
    Apr  9 09:43:11 net kernel: [ 3902.972603]  drm_mode_obj_set_property_ioctl+0x176/0x280 [drm]
    Apr  9 09:43:11 net kernel: [ 3902.972611]  ? drm_mode_connector_set_obj_prop+0x80/0x80 [drm]
    Apr  9 09:43:11 net kernel: [ 3902.972618]  drm_mode_connector_property_set_ioctl+0x3f/0x60 [drm]
    Apr  9 09:43:11 net kernel: [ 3902.972625]  drm_ioctl_kernel+0x5f/0xb0 [drm]
    Apr  9 09:43:11 net kernel: [ 3902.972631]  drm_ioctl+0x31b/0x3d0 [drm]
    Apr  9 09:43:11 net kernel: [ 3902.972638]  ? drm_mode_connector_set_obj_prop+0x80/0x80 [drm]
    Apr  9 09:43:11 net kernel: [ 3902.972640]  ? __check_object_size+0xac/0x1a0
    Apr  9 09:43:11 net kernel: [ 3902.972658]  amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
    Apr  9 09:43:11 net kernel: [ 3902.972661]  do_vfs_ioctl+0xa8/0x620
    Apr  9 09:43:11 net kernel: [ 3902.972663]  ? vfs_read+0x115/0x130
    Apr  9 09:43:11 net kernel: [ 3902.972665]  SyS_ioctl+0x79/0x90
    Apr  9 09:43:11 net kernel: [ 3902.972668]  do_syscall_64+0x73/0x130
    Apr  9 09:43:11 net kernel: [ 3902.972670]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    Apr  9 09:43:11 net kernel: [ 3902.972672] RIP: 0033:0x7f7b11cdfef7
    Apr  9 09:43:11 net kernel: [ 3902.972673] RSP: 002b:00007ffc354544b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    Apr  9 09:43:11 net kernel: [ 3902.972675] RAX: ffffffffffffffda RBX: 000055800f2d87e0 RCX: 00007f7b11cdfef7
    Apr  9 09:43:11 net kernel: [ 3902.972676] RDX: 00007ffc354544f0 RSI: 00000000c01064ab RDI: 000000000000000d
    Apr  9 09:43:11 net kernel: [ 3902.972677] RBP: 00007ffc354544f0 R08: 0000000000000001 R09: 0000000000000000
    Apr  9 09:43:11 net kernel: [ 3902.972678] R10: 00007f7b11d64280 R11: 0000000000000246 R12: 00000000c01064ab
    Apr  9 09:43:11 net kernel: [ 3902.972679] R13: 000000000000000d R14: 000055800f2d81a0 R15: 000055800dde6e01
    Apr  9 09:43:11 net kernel: [ 3902.972680] Code: 0f 84 ee fe ff ff 44 0f b6 7d 8b 80 7d ab 00 79 05 45 84 ff 74 61 48 83 c4 70 5b 41 5a 41 5c 41 5d 41 5e 41 5f 5d 49 8d 62 f8 c3 <0f> 0b 4c 89 d0 4c 89 d7 45 89 fa 48 85 c0 44 0f b6 7d 8b 74 cb
    Apr  9 09:43:11 net kernel: [ 3902.972701] RIP: __slab_free+0x17a/0x2c0 RSP: ffffb3fd8927b980
    Apr  9 09:43:11 net kernel: [ 3902.972703] ---[ end trace f87e1d03970b7d09 ]---

    Stack trace 2, one day later:
    Code:
    Apr 10 08:30:11 net kernel: [ 1877.879005] ------------[ cut here ]------------
    Apr 10 08:30:11 net kernel: [ 1877.879007] kernel BUG at /home/kernel/COD/linux/mm/slub.c:296!
    Apr 10 08:30:11 net kernel: [ 1877.879016] invalid opcode: 0000 [#1] SMP NOPTI
    Apr 10 08:30:11 net kernel: [ 1877.879018] Modules linked in: snd_opl3_synth snd_seq_midi_emul snd_cmipci snd_mpu401_uart snd_opl3_lib snd_hwdep gameport snd_pcm edac_mce_amd snd_seq_midi snd_seq_midi_event kvm_amd snd_rawmidi kvm irqbypass crct10dif_pclmul snd_seq crc32_pclmul ghash_clmulni_intel pcbc snd_seq_device snd_timer aesni_intel aes_x86_64 crypto_simd glue_helper input_leds joydev cryptd snd wmi_bmof soundcore k10temp ccp mac_hid shpchp binfmt_misc sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic usbhid hid amdkfd amd_iommu_v2 amdgpu chash gpu_sched radeon ttm drm_kms_helper syscopyarea igb sysfillrect sysimgblt fb_sys_fops dca ptp mxm_wmi drm i2c_piix4 alx pps_core ahci i2c_algo_bit mdio libahci gpio_amdpt gpio_generic wmi
    Apr 10 08:30:11 net kernel: [ 1877.879061] CPU: 14 PID: 1140 Comm: Xorg Not tainted 4.16.1-041601-generic #201804081334
    Apr 10 08:30:11 net kernel: [ 1877.879063] Hardware name: Gigabyte Technology Co., Ltd. AX370-Gaming K7/AX370-Gaming K7, BIOS F22 03/15/2018
    Apr 10 08:30:11 net kernel: [ 1877.879068] RIP: 0010:kfree+0x16b/0x180
    Apr 10 08:30:11 net kernel: [ 1877.879070] RSP: 0018:ffffbc7dc85abab0 EFLAGS: 00010246
    Apr 10 08:30:11 net kernel: [ 1877.879072] RAX: ffff9e6b3331f000 RBX: ffff9e6b3331f000 RCX: ffff9e6b3331f000
    Apr 10 08:30:11 net kernel: [ 1877.879074] RDX: 000000000002260d RSI: ffff9e6b5efa7160 RDI: ffff9e6b5e806e80
    Apr 10 08:30:11 net kernel: [ 1877.879075] RBP: ffffbc7dc85abac8 R08: 00000000000024ad R09: ffffffffc06b08e4
    Apr 10 08:30:11 net kernel: [ 1877.879077] R10: ffffedfa9fccc600 R11: 0000000000000000 R12: ffff9e6b3331f000
    Apr 10 08:30:11 net kernel: [ 1877.879078] R13: ffffffffc06b08e4 R14: ffff9e6b2e3d0000 R15: ffff9e6b3029d900
    Apr 10 08:30:11 net kernel: [ 1877.879081] FS:  00007f53e37f3580(0000) GS:ffff9e6b5ef80000(0000) knlGS:0000000000000000
    Apr 10 08:30:11 net kernel: [ 1877.879082] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Apr 10 08:30:11 net kernel: [ 1877.879084] CR2: 000055acfea32210 CR3: 00000007efb8e000 CR4: 00000000003406e0
    Apr 10 08:30:11 net kernel: [ 1877.879085] Call Trace:
    Apr 10 08:30:11 net kernel: [ 1877.879132]  dc_sink_free+0x34/0x40 [amdgpu]
    Apr 10 08:30:11 net kernel: [ 1877.879173]  dc_sink_release+0x24/0x30 [amdgpu]
    Apr 10 08:30:11 net kernel: [ 1877.879212]  dc_stream_free+0x22/0x50 [amdgpu]
    Apr 10 08:30:11 net kernel: [ 1877.879251]  dc_stream_release+0x2c/0x30 [amdgpu]
    Apr 10 08:30:11 net kernel: [ 1877.879294]  amdgpu_dm_connector_mode_valid+0xd1/0x240 [amdgpu]
    Apr 10 08:30:11 net kernel: [ 1877.879307]  ? drm_mode_connector_list_update+0xec/0x180 [drm]
    Apr 10 08:30:11 net kernel: [ 1877.879314]  drm_helper_probe_single_connector_modes+0x418/0x710 [drm_kms_helper]
    Apr 10 08:30:11 net kernel: [ 1877.879326]  drm_mode_getconnector+0x15d/0x340 [drm]
    Apr 10 08:30:11 net kernel: [ 1877.879330]  ? netlink_recvmsg+0x244/0x420
    Apr 10 08:30:11 net kernel: [ 1877.879341]  ? drm_mode_connector_property_set_ioctl+0x60/0x60 [drm]
    Apr 10 08:30:11 net kernel: [ 1877.879343] [drm] SADs count is: -2, don't need to read it
    Apr 10 08:30:11 net kernel: [ 1877.879352]  drm_ioctl_kernel+0x5f/0xb0 [drm]
    Apr 10 08:30:11 net kernel: [ 1877.879362]  drm_ioctl+0x31b/0x3d0 [drm]
    Apr 10 08:30:11 net kernel: [ 1877.879372]  ? drm_mode_connector_property_set_ioctl+0x60/0x60 [drm]
    Apr 10 08:30:11 net kernel: [ 1877.879400]  amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
    Apr 10 08:30:11 net kernel: [ 1877.879404]  do_vfs_ioctl+0xa8/0x620
    Apr 10 08:30:11 net kernel: [ 1877.879408]  ? handle_mm_fault+0xe3/0x220
    Apr 10 08:30:11 net kernel: [ 1877.879411]  ? __do_page_fault+0x270/0x4d0
    Apr 10 08:30:11 net kernel: [ 1877.879414]  SyS_ioctl+0x79/0x90
    Apr 10 08:30:11 net kernel: [ 1877.879417]  do_syscall_64+0x73/0x130
    Apr 10 08:30:11 net kernel: [ 1877.879421]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    Apr 10 08:30:11 net kernel: [ 1877.879424] RIP: 0033:0x7f53e0bf45d7
    Apr 10 08:30:11 net kernel: [ 1877.879425] RSP: 002b:00007fff82f5b1a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    Apr 10 08:30:11 net kernel: [ 1877.879427] RAX: ffffffffffffffda RBX: 000055acfe8bdf10 RCX: 00007f53e0bf45d7
    Apr 10 08:30:11 net kernel: [ 1877.879429] RDX: 00007fff82f5b1e0 RSI: 00000000c05064a7 RDI: 000000000000000d
    Apr 10 08:30:11 net kernel: [ 1877.879430] RBP: 00007fff82f5b1e0 R08: 0000000000000008 R09: 0000000000000008
    Apr 10 08:30:11 net kernel: [ 1877.879432] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000c05064a7
    Apr 10 08:30:11 net kernel: [ 1877.879433] R13: 000000000000000d R14: 000000000000000d R15: 00007fff82f5b1e0
    Apr 10 08:30:11 net kernel: [ 1877.879435] Code: 80 74 05 41 0f b6 72 69 4c 89 d7 e8 00 1e f9 ff eb 85 41 b8 01 00 00 00 48 89 d9 48 89 da 4c 89 d6 e8 9a f6 ff ff e9 6c ff ff ff <0f> 0b 48 8b 3d 7c 97 1c 01 e9 c8 fe ff ff 0f 1f 80 00 00 00 00
    Apr 10 08:30:11 net kernel: [ 1877.879466] RIP: kfree+0x16b/0x180 RSP: ffffbc7dc85abab0
    Apr 10 08:30:11 net kernel: [ 1877.879468] ---[ end trace 480d2dfde7a7e9da ]---
    Apr 10 08:30:11 net kernel: [ 1878.326140] BUG: unable to handle kernel paging request at ffffbc7dc85abbe0
    Apr 10 08:30:11 net kernel: [ 1878.326151] IP: __ww_mutex_lock.isra.3+0x283/0x670
    Apr 10 08:30:11 net kernel: [ 1878.326153] PGD 81e87e067 P4D 81e87e067 PUD 81e87f067 PMD 7ee091067 PTE 0
    Apr 10 08:30:11 net kernel: [ 1878.326159] Oops: 0000 [#2] SMP NOPTI
    Apr 10 08:30:11 net kernel: [ 1878.326162] Modules linked in: snd_opl3_synth snd_seq_midi_emul snd_cmipci snd_mpu401_uart snd_opl3_lib snd_hwdep gameport snd_pcm edac_mce_amd snd_seq_midi snd_seq_midi_event kvm_amd snd_rawmidi kvm irqbypass crct10dif_pclmul snd_seq crc32_pclmul ghash_clmulni_intel pcbc snd_seq_device snd_timer aesni_intel aes_x86_64 crypto_simd glue_helper input_leds joydev cryptd snd wmi_bmof soundcore k10temp ccp mac_hid shpchp binfmt_misc sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic usbhid hid amdkfd amd_iommu_v2 amdgpu chash gpu_sched radeon ttm drm_kms_helper syscopyarea igb sysfillrect sysimgblt fb_sys_fops dca ptp mxm_wmi drm i2c_piix4 alx pps_core ahci i2c_algo_bit mdio libahci gpio_amdpt gpio_generic wmi
    Apr 10 08:30:11 net kernel: [ 1878.326209] CPU: 12 PID: 1318 Comm: InputThread Tainted: G      D          4.16.1-041601-generic #201804081334
    Apr 10 08:30:11 net kernel: [ 1878.326211] Hardware name: Gigabyte Technology Co., Ltd. AX370-Gaming K7/AX370-Gaming K7, BIOS F22 03/15/2018
    Apr 10 08:30:11 net kernel: [ 1878.326214] RIP: 0010:__ww_mutex_lock.isra.3+0x283/0x670
    Apr 10 08:30:11 net kernel: [ 1878.326216] RSP: 0018:ffffbc7dc90ab870 EFLAGS: 00010286
    Apr 10 08:30:11 net kernel: [ 1878.326218] RAX: ffffbc7dc85abbd8 RBX: ffff9e6b3b116a20 RCX: 000000000001c1be
    Apr 10 08:30:11 net kernel: [ 1878.326220] RDX: ffff9e6b347a1701 RSI: ffff9e6b3601dc00 RDI: ffffbc7dc90ab890
    Apr 10 08:30:11 net kernel: [ 1878.326222] RBP: ffffbc7dc90ab8f0 R08: 0000000000000000 R09: ffff9e6aafdf0100
    Apr 10 08:30:11 net kernel: [ 1878.326223] R10: ffffbc7dc90ab908 R11: 0000000000000000 R12: 0000000000000001
    Apr 10 08:30:11 net kernel: [ 1878.326225] R13: ffff9e6b3b116a18 R14: ffffbc7dc90abc48 R15: 0000000000000000
    Apr 10 08:30:11 net kernel: [ 1878.326228] FS:  00007f53b3fff700(0000) GS:ffff9e6b5ef00000(0000) knlGS:0000000000000000
    Apr 10 08:30:11 net kernel: [ 1878.326230] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Apr 10 08:30:11 net kernel: [ 1878.326231] CR2: ffffbc7dc85abbe0 CR3: 00000007efb8e000 CR4: 00000000003406e0
    Apr 10 08:30:11 net kernel: [ 1878.326233] Call Trace:
    Apr 10 08:30:11 net kernel: [ 1878.326238]  __ww_mutex_lock_interruptible_slowpath+0x16/0x20
    Apr 10 08:30:11 net kernel: [ 1878.326241]  ? __ww_mutex_lock_interruptible_slowpath+0x16/0x20
    Apr 10 08:30:11 net kernel: [ 1878.326244]  ww_mutex_lock_interruptible+0x5a/0x70
    Apr 10 08:30:11 net kernel: [ 1878.326261]  drm_modeset_lock+0x9a/0xb0 [drm]
    Apr 10 08:30:11 net kernel: [ 1878.326274]  drm_modeset_lock_all_ctx+0x24/0xb0 [drm]
    Apr 10 08:30:11 net kernel: [ 1878.326327]  amdgpu_dm_atomic_check+0x39f/0x4d0 [amdgpu]
    Apr 10 08:30:11 net kernel: [ 1878.326340]  drm_atomic_check_only+0x389/0x550 [drm]
    Apr 10 08:30:11 net kernel: [ 1878.326352]  drm_atomic_commit+0x18/0x60 [drm]
    Apr 10 08:30:11 net kernel: [ 1878.326361]  drm_atomic_helper_update_plane+0xe9/0x100 [drm_kms_helper]
    Apr 10 08:30:11 net kernel: [ 1878.326374]  __setplane_internal+0x1e5/0x270 [drm]
    Apr 10 08:30:11 net kernel: [ 1878.326386]  drm_mode_cursor_universal+0xfd/0x210 [drm]
    Apr 10 08:30:11 net kernel: [ 1878.326399]  drm_mode_cursor_common+0x187/0x200 [drm]
    Apr 10 08:30:11 net kernel: [ 1878.326411]  ? drm_mode_setplane+0x240/0x240 [drm]
    Apr 10 08:30:11 net kernel: [ 1878.326422]  drm_mode_cursor_ioctl+0x4a/0x60 [drm]
    Apr 10 08:30:11 net kernel: [ 1878.326432]  drm_ioctl_kernel+0x5f/0xb0 [drm]
    Apr 10 08:30:11 net kernel: [ 1878.326443]  drm_ioctl+0x31b/0x3d0 [drm]
    Apr 10 08:30:11 net kernel: [ 1878.326454]  ? drm_mode_setplane+0x240/0x240 [drm]
    Apr 10 08:30:11 net kernel: [ 1878.326485]  amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
    Apr 10 08:30:11 net kernel: [ 1878.326490]  do_vfs_ioctl+0xa8/0x620
    Apr 10 08:30:11 net kernel: [ 1878.326493]  ? vfs_read+0x8e/0x130
    Apr 10 08:30:11 net kernel: [ 1878.326496]  SyS_ioctl+0x79/0x90
    Apr 10 08:30:11 net kernel: [ 1878.326500]  do_syscall_64+0x73/0x130
    Apr 10 08:30:11 net kernel: [ 1878.326504]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    Apr 10 08:30:11 net kernel: [ 1878.326506] RIP: 0033:0x7f53e0bf45d7
    Apr 10 08:30:11 net kernel: [ 1878.326508] RSP: 002b:00007f53b3ffd318 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    Apr 10 08:30:11 net kernel: [ 1878.326510] RAX: ffffffffffffffda RBX: 000055acfea35170 RCX: 00007f53e0bf45d7
    Apr 10 08:30:11 net kernel: [ 1878.326512] RDX: 00007f53b3ffd350 RSI: 00000000c01c64a3 RDI: 000000000000000d
    Apr 10 08:30:11 net kernel: [ 1878.326514] RBP: 00007f53b3ffd350 R08: 000055acfea35d30 R09: 0000000000000780
    Apr 10 08:30:11 net kernel: [ 1878.326515] R10: 000055acfea6ce20 R11: 0000000000000246 R12: 00000000c01c64a3
    Apr 10 08:30:11 net kernel: [ 1878.326517] R13: 000000000000000d R14: 00000000000009ef R15: 0000000000000001
    Apr 10 08:30:11 net kernel: [ 1878.326519] Code: cd 03 00 00 45 84 ff 0f 85 43 02 00 00 48 89 df e8 43 34 00 00 e9 39 ff ff ff 49 8b 45 20 48 85 c0 0f 84 38 03 00 00 49 8b 4e 08 <48> 8b 50 08 48 39 d1 0f 88 27 03 00 00 48 39 d1 75 09 49 39 c6
    Apr 10 08:30:11 net kernel: [ 1878.326554] RIP: __ww_mutex_lock.isra.3+0x283/0x670 RSP: ffffbc7dc90ab870
    Apr 10 08:30:11 net kernel: [ 1878.326555] CR2: ffffbc7dc85abbe0
    Apr 10 08:30:11 net kernel: [ 1878.326558] ---[ end trace 480d2dfde7a7e9db ]---
    AMDGPU vs Radeon
    Right now I blacklisted AMDGPU and reverted my kernel parameters to use radeon instead. I also updated the initramfs to make sure AMDGPU is not loaded. lsmod shows that I'm successfully using radeon now, and the other module was correctly blacklisted. I will test this new configuration today to see if I can finally get a stable system, and I'll update this thread if it works.


    The question is. Should I file this as a bug in AMDGPU? or as a Ryzen bug? Where?


    ############
    Update: I left the PC for a few hours, and it didn't freeze. Now I don't get kernel panics. I'll test it a little bit more, but it looks like it's pretty much related to AMDGPU after all. So I'll just report the bug later.
    ############
    Last edited by dc740; 04-10-2018, 08:55 PM.

  • #2
    To prevent random kernel lock ups with Ryzen, enable RCU_NOCB_CPU and boot the kernel with the rcu_nocbs=0-X command line parameter. X is the cpu thread count -1

    Or disable C6 from the bios.

    Comment


    • #3
      Originally posted by debianxfce View Post
      To prevent random kernel lock ups with Ryzen, enable RCU_NOCB_CPU and boot the kernel with the rcu_nocbs=0-X command line parameter. X is the cpu thread count -1

      Or disable C6 from the bios.
      I actually didn't try this suggestion when I first saw it because the trace looked so different, but I don't really like using the radeon module, so I reverted to AMDGPU again.

      In summary
      rcu_nocbs=0-15 seems to be working OK(I only tested this for a couple of hours). I'll update this thread in a couple of days in case I get new freezes. I prefer to test the kernel parameter + AMDGPU rather than my other alternative, which was sticking to the radeon module and blacklisting amdgpu

      Thank you.

      #################
      UPDATE:
      I added the kernel parameter the system hangs anyway. The logs keep pointing to AMDGPU. I'm (sadly) going to retry using "radeon" and blacklisting "amdgpu", which seems to be my only hope (and the only thing I didn't try for much time, but at least it didn't freeze)
      Last edited by dc740; 04-10-2018, 06:32 PM.

      Comment


      • #4
        Originally posted by dc740 View Post


        I added the kernel parameter the system hangs anyway.

        The kernel command line does not have effect unless you have enabled RCU_NOCB_CPU in the kernel configuration.

        Comment


        • #5
          Originally posted by debianxfce View Post


          The kernel command line does not have effect unless you have enabled RCU_NOCB_CPU in the kernel configuration.
          Oh. Yeah, I see that many people is doing that now: https://blog.programster.org/ubuntu-...rnel-for-ryzen
          They also disable features in the BIOS.
          I'll stick to blacklisting amdgpu until they get that fixed on the mainstream kernels. I definitely don't have the time to compile a custom kernel anymore, and the blacklist workaround seems to be working perfectly so far. I only tried a couple of hours of idle and it seems to be stable, or at least it doesn't hang after 20 minutes like it used to, so I can live with the radeon driver until someone else fixes it.

          Thanks anyway.

          Comment


          • #6
            Originally posted by dc740 View Post

            I definitely don't have the time to compile a custom kernel anymore.
            With Ryzen the kernel compiles under a minute. You can use my distribution that has a custom kernel features for Ryzen and Amd graphics or make your own custom kernel, see the second video. Stock distribution mainline kernels are slow and buggy as you see.

            1.https://www.youtube.com/watch?v=fKJ-IatUfis

            2. https://www.youtube.com/watch?v=G3AxgH2bbsE

            Comment


            • #7
              Hi folks

              I'm planning to buy an AMD 2700x to be used as Continuous Integration Server with Debian Testing(basically to run tests of a big app written in Ruby), so it won't have a desktop environment. Do you think this bug could affect me also? Should I disable C6?

              Thanks!

              Comment


              • #8
                Originally posted by 007lva View Post
                Hi folks

                I'm planning to buy an AMD 2700x to be used as Continuous Integration Server with Debian Testing(basically to run tests of a big app written in Ruby), so it won't have a desktop environment. Do you think this bug could affect me also? Should I disable C6?

                Thanks!
                You will see. It is not a big job to disable C6 from bios or config the kernel.

                Comment


                • #9
                  Originally posted by 007lva View Post
                  Hi folks

                  I'm planning to buy an AMD 2700x to be used as Continuous Integration Server with Debian Testing(basically to run tests of a big app written in Ruby), so it won't have a desktop environment. Do you think this bug could affect me also? Should I disable C6?

                  Thanks!
                  Yes, it may affect you. I haven't recompiled the kernel yet, but the problems seems to be quite random. The bug reports show some people that still have instability even when disabling C6. As a developer myself, I have to admit it's hard to tell whether it's related to Ryzen or something else on every user case.

                  Many people will report their success with one method, and that method may not be the same for other users with different motherboard, or different hardware. Adding the kernel parameter worked for debianxfce, but there are cases where the kernel parameters didn't help (I have to assume they didn't know they had to recompile the kernel to enable them).

                  For me, I "fixed" it by disabling AMDGPU, and now the system is rock solid. I've been using it intensively for the last 5 days without any problem, and I used to have at least one freeze a day. For future reference, I have the latest BIOS installed (F22), and IOMMU=soft in the kernel parameters (because the BIOS implementation is buggy and it didn't like my sound card using the hardware IOMMU).
                  Last edited by dc740; 04-16-2018, 09:01 AM.

                  Comment


                  • #10
                    Originally posted by dc740 View Post
                    I have the latest BIOS installed (F22), and IOMMU=soft in the kernel parameters (because the BIOS implementation is buggy and it didn't like my sound card using the hardware IOMMU).
                    Buy Asus mobos only. I did return the gigabyte b350 gaming 3 mobo after updating the "fail safe" bios and the mobo did not boot anymore. Asus Prime B350 series is good.

                    Comment

                    Working...
                    X