Announcement

Collapse
No announcement yet.

Debian 10 random freezes

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Debian 10 random freezes

    Hello,
    I do experience random freezes on Debian 10. Happens randomly, i.e. I leave system on idle and whole OS is frozen after some time, nothing works, use Firefox + QtCreator + Rhytmbox and random freeze, sometimes happens after 10min after boot, sometimes after 3 hours.
    Sometimes mouse and keyboard + audio works, sometimes everything just stops responding.

    Kernel 5.4.0.0, nVidia proprietary driver, xOrg, gnome.

    Spec:
    CPU: Threadripper 3960X
    Motherboard: Asus Zenith II Extreme
    GPU: MSI GTX 1080 Ti
    Ram: GSkill 3600 CL14 (kit is 3800 CL 14 rated, down-clocked so IF match up) - tested memory works fine
    Boot SSD: nvme Samsung 2TB 960 Pro + multiple 4TB Samsung 860 Evo + 512/256G Samsmun SSD

    Tried with mce=off doesn't help.

    Any advice more then welcome.

  • #2
    Some progress on this issue:

    using pti=off things got better, I was able to stay without freeze for 2 days, but unfortunately today I got still random freeze.

    This time around I was able to collect some logs:

    Code:
    Apr 1 18:28:41 User kernel: [ 5079.936816] ------------[ cut here ]------------
    Apr 1 18:28:41 User kernel: [ 5079.936822] invalid opcode: 0000 [#1] SMP NOPTI
    Apr 1 18:28:41 User kernel: [ 5079.936824] CPU: 37 PID: 35452 Comm: nvidia-settings Tainted: P OE 5.4.0-0.bpo.4-amd64 #1 Debian 5.4.19-1~bpo10+1
    Apr 1 18:28:41 User kernel: [ 5079.936825] Hardware name: System manufacturer System Product Name/ROG ZENITH II EXTREME, BIOS 0902 03/17/2020
    Apr 1 18:28:41 User kernel: [ 5079.936829] RIP: 0010:__list_del_entry_valid.cold.1+0x20/0x4c
    Apr 1 18:28:41 User kernel: [ 5079.936830] Code: a6 ae ad e8 fc 3b ce ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 90 a6 ae ad e8 e8 3b ce ff 0f 0b 48 c7 c7 40 a7 ae ad e8 da 3b ce ff <0f> 0b 48 89 f2 48 89 fe 48 c7 c7 00 a7 ae ad e8 c6 3b ce ff 0f 0b
    Apr 1 18:28:41 User kernel: [ 5079.936832] RSP: 0000:ffffa8c60722fc40 EFLAGS: 00010046
    Apr 1 18:28:41 User kernel: [ 5079.936833] RAX: 0000000000000054 RBX: fffff3e056a36408 RCX: 0000000000000000
    Apr 1 18:28:41 User kernel: [ 5079.936834] RDX: 0000000000000000 RSI: ffff98316d757688 RDI: ffff98316d757688
    Apr 1 18:28:41 User kernel: [ 5079.936835] RBP: fffff3e05b430048 R08: 00000000000006c7 R09: 0000000000000063
    Apr 1 18:28:41 User kernel: [ 5079.936836] R10: 0000000000000000 R11: ffffa8c60722faf0 R12: fffff3e056a36400
    Apr 1 18:28:41 User kernel: [ 5079.936836] R13: fffff3e05b430040 R14: 0000000000000000 R15: ffff98318f1fbb80
    Apr 1 18:28:41 User kernel: [ 5079.936838] FS: 00007f1c5e3d0740(0000) GS:ffff98316d740000(0000) knlGS:0000000000000000
    Apr 1 18:28:41 User kernel: [ 5079.936839] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Apr 1 18:28:41 User kernel: [ 5079.936839] CR2: 00007f1c53f61000 CR3: 0000000561d40000 CR4: 0000000000340ea0
    Apr 1 18:28:41 User kernel: [ 5079.936840] Call Trace:
    Apr 1 18:28:41 User kernel: [ 5079.936844] get_page_from_freelist+0x1b3/0x1260
    Apr 1 18:28:41 User kernel: [ 5079.936847] __alloc_pages_nodemask+0x163/0x310
    Apr 1 18:28:41 User kernel: [ 5079.936850] alloc_pages_vma+0x74/0x1f0
    Apr 1 18:28:41 User kernel: [ 5079.936852] __handle_mm_fault+0x48a/0x1280
    Apr 1 18:28:41 User kernel: [ 5079.936854] handle_mm_fault+0xc2/0x1f0
    Apr 1 18:28:41 User kernel: [ 5079.936856] __do_page_fault+0x23e/0x4f0
    Apr 1 18:28:41 User kernel: [ 5079.936859] page_fault+0x34/0x40
    Apr 1 18:28:41 User kernel: [ 5079.936861] RIP: 0033:0x7f1c5ef17154
    Apr 1 18:28:41 User kernel: [ 5079.936863] Code: 1f 80 00 00 00 00 48 8b 08 8b 50 08 4c 01 f9 48 83 fa 26 74 0a 48 83 fa 08 0f 85 1d 10 00 00 48 8b 50 10 48 83 c0 18 4c 01 fa <48> 89 11 48 39 c3 77 d4 4d 8b b2 d0 01 00 00 4d 85 f6 0f 85 7e fa
    Apr 1 18:28:41 User kernel: [ 5079.936863] RSP: 002b:00007ffd487bce30 EFLAGS: 00010206
    Apr 1 18:28:41 User kernel: [ 5079.936864] RAX: 00007f1c52587de8 RBX: 00007f1c526c15b8 RCX: 00007f1c53f61000
    Apr 1 18:28:41 User kernel: [ 5079.936865] RDX: 00007f1c5290e650 RSI: 00007f1c52465cf8 RDI: 00007f1c5ef339f0
    Apr 1 18:28:41 User kernel: [ 5079.936866] RBP: 00007ffd487bcf30 R08: 00007f1c526c2830 R09: 0000000000000000
    Apr 1 18:28:41 User kernel: [ 5079.936867] R10: 0000000000889540 R11: 0000000000889540 R12: 0000000000000000
    Apr 1 18:28:41 User kernel: [ 5079.936867] R13: 0000000000000000 R14: 0000000000000000 R15: 00007f1c52461000
    Apr 1 18:28:41 User kernel: [ 5079.936869] Modules linked in: rfcomm vmnet(OE) ppdev parport_pc parport vmw_vsock_vmci_transport vsock joydev snd_usb_audio vmw_vmci snd_usbmidi_lib snd_rawmidi snd_seq_device mc vmmon(OE) snd_hda_codec_hdmi bnep edac_mce_amd kvm_amd kvm irqbypass btusb btrtl btbcm btintel bluetooth crct10dif_pclmul crc32_pclmul nls_ascii nls_cp437 nvidia_drm(POE) vfat nvidia_modeset(POE) fat ghash_clmulni_intel fuse nvidia(POE) drbg eeepc_wmi asus_wmi ansi_cprng battery aesni_intel snd_hda_intel ecdh_generic snd_intel_nhlt efi_pstore ecc sparse_keymap crypto_simd snd_hda_codec rfkill cryptd glue_helper wmi_bmof sg video drm_kms_helper snd_hda_core snd_hwdep drm snd_pcm efivars pcspkr snd_timer ccp snd ipmi_devintf sp5100_tco ipmi_msghandler soundcore rng_core watchdog k10temp evdev acpi_cpufreq nct6775 hwmon_vid efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic hid_generic usbhid hid sd_mod ahci libahci xhci_pci igb xhci_hcd libata mxm_wmi crc32c_intel nvme usbcore dca scsi_mod
    Apr 1 18:28:41 User kernel: [ 5079.936897] ptp nvme_core pps_core i2c_algo_bit i2c_piix4 usb_common wmi i2c_designware_platform button i2c_designware_core
    Apr 1 18:28:41 User kernel: [ 5079.936903] ---[ end trace 29229a2768299bfc ]---
    And today with pti=off, situation - was browsing google maps in 3d and OS hung, switching to different tty to restart gdm but ended up with hard reset:

    Code:
    Apr 3 17:56:56 User kernel: [ 1004.665416] ------------[ cut here ]------------
    Apr 3 17:56:56 User kernel: [ 1004.665427] invalid opcode: 0000 [#12] SMP NOPTI
    Apr 3 17:56:56 User kernel: [ 1004.665432] CPU: 37 PID: 6049 Comm: (realmd) Tainted: P D OE 5.4.0-0.bpo.4-amd64 #1 Debian 5.4.19-1~bpo10+1
    Apr 3 17:56:56 User kernel: [ 1004.665436] Hardware name: System manufacturer System Product Name/ROG ZENITH II EXTREME, BIOS 0902 03/17/2020
    Apr 3 17:56:56 User kernel: [ 1004.665440] RIP: 0010:__list_add_valid.cold.0+0x12/0x28
    Apr 3 17:56:56 User kernel: [ 1004.665444] Code: 85 5d 00 00 00 48 8b 50 08 48 39 f2 0f 85 42 00 00 00 b8 01 00 00 00 c3 48 89 d1 48 c7 c7 a8 a5 ee 81 48 89 c2 e8 10 3c ce ff <0f> 0b 48 89 c1 4c 89 c6 48 c7 c7 00 a6 ee 81 e8 fc 3b ce ff 0f 0b
    Apr 3 17:56:56 User kernel: [ 1004.665448] RSP: 0018:ffff9f72c6aefcd0 EFLAGS: 00010046
    Apr 3 17:56:56 User kernel: [ 1004.665454] RAX: 0000000000000075 RBX: ffff8a3b2d76ed18 RCX: 0000000000000000
    Apr 3 17:56:56 User kernel: [ 1004.665457] RDX: 0000000000000000 RSI: ffff8a3b2d757688 RDI: ffff8a3b2d757688
    Apr 3 17:56:56 User kernel: [ 1004.665461] RBP: ffff8a3b4f1fbb80 R08: 0000000000000924 R09: 0000000000000084
    Apr 3 17:56:56 User kernel: [ 1004.665464] R10: 0000000000000000 R11: ffff9f72c6aefb80 R12: ffff8a3b2d76ed38
    Apr 3 17:56:56 User kernel: [ 1004.665467] R13: fffff6115eb65ac0 R14: fffff6115eb65ac8 R15: fffff6115c478fc8
    Apr 3 17:56:56 User kernel: [ 1004.665474] FS: 0000000000000000(0000) GS:ffff8a3b2d740000(0000) knlGS:0000000000000000
    Apr 3 17:56:56 User kernel: [ 1004.665477] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Apr 3 17:56:56 User kernel: [ 1004.665481] CR2: 0000000000000000 CR3: 00000006a1a2c000 CR4: 0000000000340ea0
    Apr 3 17:56:56 User kernel: [ 1004.665484] Call Trace:
    Apr 3 17:56:56 User kernel: [ 1004.665487] free_unref_page_commit+0x91/0x100
    Apr 3 17:56:56 User kernel: [ 1004.665491] free_unref_page_list+0x111/0x190
    Apr 3 17:56:56 User kernel: [ 1004.665493] release_pages+0x207/0x430
    Apr 3 17:56:56 User kernel: [ 1004.665496] ? lru_deactivate_fn+0x2b0/0x2b0
    Apr 3 17:56:56 User kernel: [ 1004.665499] pagevec_lru_move_fn+0xb8/0xd0
    Apr 3 17:56:56 User kernel: [ 1004.665501] lru_add_drain+0x11/0x20
    Apr 3 17:56:56 User kernel: [ 1004.665503] exit_mmap+0x80/0x180
    Apr 3 17:56:56 User kernel: [ 1004.665504] ? futex_cleanup+0xbc/0x460
    Apr 3 17:56:56 User kernel: [ 1004.665507] ? __khugepaged_exit+0x108/0x120
    Apr 3 17:56:56 User kernel: [ 1004.665509] ? kmem_cache_free+0x287/0x2a0
    Apr 3 17:56:56 User kernel: [ 1004.665511] mmput+0x54/0x130
    Apr 3 17:56:56 User kernel: [ 1004.665514] do_exit+0x30f/0xb10
    Apr 3 17:56:56 User kernel: [ 1004.665516] ? __x64_sys_execve+0x34/0x40
    Apr 3 17:56:56 User kernel: [ 1004.665518] rewind_stack_do_exit+0x17/0x20
    Apr 3 17:56:56 User kernel: [ 1004.665520] Modules linked in: rfcomm vmnet(OE) ppdev parport_pc parport vmw_vsock_vmci_transport vsock vmw_vmci vmmon(OE) bnep btusb btrtl btbcm btintel bluetooth snd_usb_audio drbg snd_usbmidi_lib snd_rawmidi ansi_cprng snd_seq_device ecdh_generic mc ecc sg joydev snd_hda_codec_hdmi edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul nls_ascii nls_cp437 vfat nvidia_drm(POE) nvidia_modeset(POE) fat fuse ghash_clmulni_intel aesni_intel nvidia(POE) crypto_simd cryptd glue_helper snd_hda_intel eeepc_wmi efi_pstore snd_intel_nhlt asus_wmi snd_hda_codec battery drm_kms_helper sparse_keymap rfkill snd_hda_core video efivars pcspkr wmi_bmof snd_hwdep drm snd_pcm snd_timer snd ccp ipmi_devintf sp5100_tco rng_core soundcore ipmi_msghandler k10temp watchdog evdev acpi_cpufreq nct6775 hwmon_vid efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic hid_generic usbhid hid sd_mod ahci libahci libata xhci_pci xhci_hcd mxm_wmi igb crc32c_intel scsi_mod nvme usbcore dca
    Apr 3 17:56:56 User kernel: [ 1004.665549] nvme_core ptp pps_core i2c_algo_bit i2c_piix4 usb_common wmi i2c_designware_platform i2c_designware_core button
    Apr 3 17:56:56 User kernel: [ 1004.665555] ---[ end trace 769a9af4a5db8c2c ]---
    As You can see this bugs happens random and I can' reproduce it easily.

    Now I'm testing with mitigations=off but I can see now that it won't help much because:
    I use gcc compilation of my project in loop (100 iterations) mostly to test out stability, very useful when testing RAM OC stability as it will error out much sooner then i.e. memTestx86 and nowadays it just fails to end that test. It's random, sometimes it error out at 15 sometimes at 35 sometimes it pass !.
    Error that I go is:

    corrupted size vs. prev_size in fastbins
    So this is defnetly memory issue.
    I use G.Skill memory and tested it on Windows 10 Pro with mem test that comes with Ryzen DRam calculator, no errors. My memory is G.Skill Trident Z Neo DDR4-3800MHz CL14-16-16-36 1.50V 32GB that is running 3600MHz CL14-14-14-34 @ 1.43V (and yes, completly waste of money on those sticks). My other sticks, 3200Mhz CL14-14-14-34 @ 1.35V are dead as I killed them on OC with 1.5V - well they error out on XMP, works fine, no errors, on JEDEC (waiting for RMA in queue due to currentglobal sition).
    On Windows - no issue, compilations, blender, furrmark (CPU & GPU 100% at same time) etc. no crashes / freezes etc. but that's not indication for anything - in past I did got "stable" mem OC on Windows but failed on Debian with compilation loop.

    To my knowledge, please correct me if I'm wrong, most likely culprit is RAM and incorrect configuration. Will fiddle with memory more and see if that helps.

    Comment


    • #3
      [ISSUE RESOLVED]

      So I did found root of this issue, but was waiting to validate.
      It's unstable under-volt of SOC and/or CPU.
      Removing negative offset fixed everything and from time that I made above post I started testing and Debian is rock solid.
      Strange part is that on Windows it all worked so I think only conclusion is that Windows don't utilize HW to it's fullest potential and hence I didn't noticed issue there.

      Just to be clear here, after applying under-volt I stres tested, using compile loop among other tests, on both Windows and Debian and all worked fine. From that time I did updated BIOS twice and apparently ASUS did changed voltages as part of their "well described change log" to "performance improvements" so issue happened and because I past I spend less time on Debian I just didn't noticed this issue in time.
      Last edited by Noname; 04-16-2020, 03:04 PM.

      Comment

      Working...
      X