Announcement

Collapse
No announcement yet.

amdgpu driver problems on new AMD ryzen 4750u on new kernels (5.8.X, 5.9.X)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • amdgpu driver problems on new AMD ryzen 4750u on new kernels (5.8.X, 5.9.X)

    Dear all,

    My new thinkpad T14 AMD gen1 has some problems with the amdpu drivers. In particular sometimes, the following kernel error messages appear in the logs:

    Code:
    2020-12-07T18:34:22.81574 kern.err: [ 10.030223] amdgpu 0000:07:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.0.1 (-110).
    2020-12-07T18:34:23.84077 kern.err: [ 11.055240] amdgpu 0000:07:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.1.1 (-110).
    2020-12-07T18:34:24.86475 kern.err: [ 12.078264] amdgpu 0000:07:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.2.1 (-110).
    2020-12-07T18:34:25.88978 kern.err: [ 13.102275] amdgpu 0000:07:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.3.1 (-110).
    2020-12-07T18:34:25.90081 kern.err: [ 13.113430] [drm:process_one_work] *ERROR* ib ring test failed (-110).
    When this happens the machine boots, and X seem to work, but the machine eventually crashes under high load (typically with I/O tasks). Normally it produces a page fault. Strangely the trace is always in some modues related with zfs:

    Code:
    2020-12-11T11:18:49.53924 kern.warn: [10080.745031] general protection fault, probably for non-canonical address 0xa00800000000: 0000 [#1] SMP NOPTI
    2020-12-11T11:18:49.54070 kern.warn: [10080.745035] CPU: 2 PID: 32597 Comm: gsqz2.x Tainted: P O 5.9.13_1 #1
    2020-12-11T11:18:49.54073 kern.warn: [10080.745036] Hardware name: LENOVO 20UDCTO1WW/20UDCTO1WW, BIOS R1BET58W(1.27 ) 10/20/2020
    2020-12-11T11:18:49.54076 kern.warn: [10080.745055] RIP: 0010:dbuf_find+0x86/0x1a0 [zfs]
    2020-12-11T11:18:49.54078 kern.warn: [10080.745056] Code: 7b 01 00 49 89 57 28 4a 8b 04 f0 48 85 c0 0f 84 bb 00 00 00 48 8b 0c 24 49 89 d6 eb 0d 48 8b 40 38 48 85 c0 0f 84 a5 00 00 00 <48> 39 18 75 ee 48 39 68 20 75 e8 44 38 68 68 75 e2 48 39 48 58 75
    2020-12-11T11:18:49.54082 kern.warn: [10080.745057] RSP: 0018:ffffbbea19897b38 EFLAGS: 00010206
    2020-12-11T11:18:49.54085 kern.warn: [10080.745058] RAX: 0000a00800000000 RBX: 000000000000082c RCX: 000000000000cad9
    2020-12-11T11:18:49.54088 kern.warn: [10080.745059] RDX: ffff91888dc7cd80 RSI: 000000000000082c RDI: ffffffffc137ce70
    2020-12-11T11:18:49.54090 kern.warn: [10080.745059] RBP: ffff918988e91800 R08: 11cebcd6982c231b R09: 9ae16a3b2f90404f
    2020-12-11T11:18:49.54094 kern.warn: [10080.745060] R10: ffffbbea19897ed8 R11: ffff91899dd40000 R12: 0000000000024f60
    2020-12-11T11:18:49.54097 kern.warn: [10080.745061] R13: 0000000000000000 R14: ffff91888dc7cd80 R15: ffffffffc137ce70
    2020-12-11T11:18:49.54101 kern.warn: [10080.745062] FS: 00007f177ab4c740(0000) GS:ffff9189cee80000(0000) knlGS:0000000000000000
    2020-12-11T11:18:49.54105 kern.warn: [10080.745062] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    2020-12-11T11:18:49.54107 kern.warn: [10080.745063] CR2: 000000000eaffef8 CR3: 000000063c17a000 CR4: 0000000000350ee0
    2020-12-11T11:18:49.54111 kern.warn: [10080.745064] Call Trace:
    2020-12-11T11:18:49.54114 kern.warn: [10080.745083] ? zio_create+0x45b/0x530 [zfs]
    2020-12-11T11:18:49.54116 kern.warn: [10080.745096] dbuf_hold_impl+0x60/0x5f0 [zfs]
    2020-12-11T11:18:49.54119 kern.warn: [10080.745110] dbuf_hold+0x2c/0x60 [zfs]
    2020-12-11T11:18:49.54121 kern.warn: [10080.745124] dmu_buf_hold_array_by_dnode+0xd8/0x490 [zfs]
    2020-12-11T11:18:49.54123 kern.warn: [10080.745127] ? __kmalloc_node+0x1d2/0x3b0
    2020-12-11T11:18:49.54125 kern.warn: [10080.745141] dmu_read_uio_dnode+0x47/0xf0 [zfs]
    2020-12-11T11:18:49.54129 kern.warn: [10080.745157] ? zfs_rangelock_enter_impl+0x266/0x560 [zfs]
    2020-12-11T11:18:49.54131 kern.warn: [10080.745170] dmu_read_uio_dbuf+0x42/0x60 [zfs]
    2020-12-11T11:18:49.54133 kern.warn: [10080.745184] zfs_read+0x123/0x4b0 [zfs]
    2020-12-11T11:18:49.54135 kern.warn: [10080.745199] zpl_read_common_iovec+0xa2/0xf0 [zfs]
    2020-12-11T11:18:49.54137 kern.warn: [10080.745213] zpl_iter_read+0x109/0x180 [zfs]
    2020-12-11T11:18:49.54140 kern.warn: [10080.745215] new_sync_read+0x114/0x1a0
    2020-12-11T11:18:49.54142 kern.warn: [10080.745216] vfs_read+0xf6/0x180
    2020-12-11T11:18:49.54144 kern.warn: [10080.745218] ksys_read+0x5f/0xe0
    2020-12-11T11:18:49.54146 kern.warn: [10080.745220] do_syscall_64+0x33/0x40
    2020-12-11T11:18:49.54149 kern.warn: [10080.745223] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    2020-12-11T11:18:49.54151 kern.warn: [10080.745224] RIP: 0033:0x7f177ac3d1ee
    2020-12-11T11:18:49.54153 kern.warn: [10080.745225] Code: c0 e9 c6 fe ff ff 50 48 8d 3d e6 ef 09 00 e8 39 e6 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
    020-12-11T11:18:49.54156 kern.warn: [10080.745226] RSP: 002b:00007ffee20b7f98 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
    2020-12-11T11:18:49.54158 kern.warn: [10080.745227] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f177ac3d1ee
    2020-12-11T11:18:49.54161 kern.warn: [10080.745228] RDX: 0000000000020000 RSI: 0000000000eb6610 RDI: 0000000000000005
    2020-12-11T11:18:49.54163 kern.warn: [10080.745228] RBP: 0000000000ef1200 R08: 00007f177b106290 R09: 0000000000000001
    2020-12-11T11:18:49.54165 kern.warn: [10080.745229] R10: 0000000000000000 R11: 0000000000000246 R12: 000000007ffff000
    2020-12-11T11:18:49.54167 kern.warn: [10080.745230] R13: 0000000000020000 R14: 0000000000eb6610 R15: 00007ffee20b80d4
    2020-12-11T11:18:49.54169 kern.warn: [10080.745231] Modules linked in: acpi_call(O) 8021q garp mrp stp llc cdc_ether usbnet snd_usb_audio snd_usbmidi_lib r8152 snd_rawmidi uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev btusb btrtl mc btbcm btintel joydev snd_soc_dmic snd_acp3x_pdm_dma snd_acp3x_rn snd_soc_core snd_compress snd_pcm_dmaengine ac97_bus tps6598x typec roles edac_mce_amd iwlmvm kvm irqbypass crct10dif_pclmul ghash_clmulni_intel mac80211 snd_hda_codec_realtek aesni_intel crypto_simd cryptd snd_hda_codec_generic snd_hda_codec_hdmi glue_helper libarc4 rapl snd_hda_intel snd_intel_dspcfg psmouse input_leds pcspkr snd_hda_codec sp5100_tco snd_rn_pci_acp3x wmi_bmof iwlwifi i2c_piix4 snd_pci_acp3x k10temp snd_hda_core snd_hwdep r8169 ipmi_devintf tpm_crb cfg80211 thinkpad_acpi snd_pcm ccp ipmi_msghandler realtek evdev ledtrig_audio ac mac_hid tpm_tis i2c_multi_instantiate tpm_tis_core tpm i2c_scmi rng_core i2c_designware_platform tiny_power_button acpi_cpufreq i2c_designware_core
    2020-12-11T11:18:49.54175 kern.warn: [10080.745258] snd_seq snd_seq_device snd_timer snd soundcore vhost_vsock vmw_vsock_virtio_transport_common vsock vhost_net vhost tap vhost_iotlb uhid hci_vhci bluetooth ecdh_generic rfkill ecc crc16 vfio_iommu_type1 vfio uinput userio ppp_generic slhc tun loop nvram btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_generic cuse fuse hid_logitech_hidpp hid_logitech_dj hid_generic usbmouse usbkbd usbhid hid zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) amdgpu iommu_v2 gpu_sched i2c_algo_bit ttm rtsx_pci_sdmmc mmc_core drm_kms_helper syscopyarea sysfillrect sysimgblt zcommon(PO) fb_sys_fops cec znvpair(PO) rc_core xhci_pci spl(O) xhci_pci_renesas xhci_hcd ehci_pci drm ehci_hcd crc32_pclmul crc32c_intel serio_raw usbcore agpgart rtsx_pci wmi button battery video dm_mirror dm_region_hash dm_log dm_mod
    2020-12-11T11:18:49.54178 kern.warn: [10080.745288] ---[ end trace 5d30cf1275d6e0a0 ]---
    2020-12-11T11:18:49.54181 kern.warn: [10080.745301] RIP: 0010:dbuf_find+0x86/0x1a0 [zfs]
    2020-12-11T11:18:49.54184 kern.warn: [10080.745301] Code: 7b 01 00 49 89 57 28 4a 8b 04 f0 48 85 c0 0f 84 bb 00 00 00 48 8b 0c 24 49 89 d6 eb 0d 48 8b 40 38 48 85 c0 0f 84 a5 00 00 00 <48> 39 18 75 ee 48 39 68 20 75 e8 44 38 68 68 75 e2 48 39 48 58 75
    2020-12-11T11:18:49.54186 kern.warn: [10080.745302] RSP: 0018:ffffbbea19897b38 EFLAGS: 00010206
    2020-12-11T11:18:49.54189 kern.warn: [10080.745303] RAX: 0000a00800000000 RBX: 000000000000082c RCX: 000000000000cad9
    2020-12-11T11:18:49.54191 kern.warn: [10080.745304] RDX: ffff91888dc7cd80 RSI: 000000000000082c RDI: ffffffffc137ce70
    2020-12-11T11:18:49.54193 kern.warn: [10080.745304] RBP: ffff918988e91800 R08: 11cebcd6982c231b R09: 9ae16a3b2f90404f
    2020-12-11T11:18:49.54196 kern.warn: [10080.745305] R10: ffffbbea19897ed8 R11: ffff91899dd40000 R12: 0000000000024f60
    2020-12-11T11:18:49.54198 kern.warn: [10080.745305] R13: 0000000000000000 R14: ffff91888dc7cd80 R15: ffffffffc137ce70
    2020-12-11T11:18:49.54201 kern.warn: [10080.745306] FS: 00007f177ab4c740(0000) GS:ffff9189cee80000(0000) knlGS:0000000000000000
    2020-12-11T11:18:49.54204 kern.warn: [10080.745307] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    2020-12-11T11:18:49.54206 kern.warn: [10080.745308] CR2: 000000000eaffef8 CR3: 000000063c17a000 CR4: 0000000000350ee0
    This does not happen always, only about 8/10 times. In some other situations the machine boots with the IB ring test failed messages, but on top of that I get errors of the type "failed to write reg XXX". In these cases the machine also hangs but in fact much faster.

    Finally, 1/10 cases, the machine boots without any error messages and then it works perfectly.

    I have seen these error in kernels 5.8.16-5.8.18 and 5.9.8-5.9.13. This happened with openZFS 2.0 and also with zfs 0.8.5. I have also updated the BIOS to the latest version and tried to get the latest firmware from the linux kernel site. I boot the kernel with the command line parameters "amd_iommu=on iommu=soft" in order to avoid other error, but this does not seem to make a difference. I have tried several kernel command line options without success. I am using VOID llinux.

    This hardware is supposed to be directly supported by the kernel... Does anyone have any idea of what is happenning? Is this a hardware error?
Working...
X