Announcement

Collapse
No announcement yet.

Some FreeBSD Users Are Still Running Into Random Lock-Ups With Ryzen

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Some FreeBSD Users Are Still Running Into Random Lock-Ups With Ryzen

    Phoronix: Some FreeBSD Users Are Still Running Into Random Lock-Ups With Ryzen

    While Linux has been playing happily with Ryzen CPUs as long as you weren't affected by the performance marginality problem where you had to swap out for a newer CPU (and Threadripper and EPYC CPUs have been running splendid in all of my testing with not having any worries), it seems the BSDs (at least FreeBSD) are still having some quirks to address...

    http://www.phoronix.com/scan.php?pag...-Lock-Ups-2018

  • #2
    I don't think this is Linux only, on a default kernel on Arch Linux the system can freeze as well when idle or under low load. I had to use the linux-ryzen-git kernel to avoid the random lock-ups.

    Comment


    • #3
      I used to have these lockups on idle with Arch/Antergos Ryzen 7 1700 until I downclocked my memory to 3k from 3.2k.

      It didn't cause that much perf. loss on gaming etc. so I'm a happy camper. I wonder if my case is specific or it points to the underlying issue somehow.

      Also very disappointed in AMD for ignoring this issue and not admiting HW failure.

      Comment


      • #4
        Makes me wonder, which cpu would ​I pick today... Intel or AMD... More vulnerable or little (but annoying) unstable...

        Comment


        • #5
          Originally posted by Phoronix
          While Linux has been playing happily with Ryzen CPUs as long as you weren't affected by the performance marginality problem
          This is untrue.

          Comment


          • #6
            Originally posted by Hi-Angel View Post
            This looks really bad.

            Comment


            • #7
              Well, I’m pretty sure that processor bug hunting is going to be a big thing in the next few months, so Intel and AMD will need to start fixing things like this...

              Comment


              • #8
                Originally posted by Leopard View Post

                This looks really bad.
                The bug by far is not the worst. It's just a bug, that's it. It just happens, irrelevant to OS or hw. The worst is that AMD doesn't write about it, it looks like nobody is working, nobody care. It's a MAJOR problem on AMD side, and AMD does nothing about it.

                You know, you can do lots of good stuff, but then single fault may beat everything you did good. It is this bug for AMD.

                Comment


                • #9
                  Originally posted by phoronix View Post
                  Phoronix: Some FreeBSD Users Are Still Running Into Random Lock-Ups With Ryzen

                  While Linux has been playing happily with Ryzen CPUs [...]

                  http://www.phoronix.com/scan.php?pag...-Lock-Ups-2018
                  What? For the record, here's my attempt at getting some attention on the instability issues that can still plague Ryzen CPUs on GNU/Linux, which I copy here, verbatim, for all to see:

                  Hello Michael,

                  I thought you might want to share some information regarding AMD Ryzen CPUs as it is way too hard to find on the web and diluted with the former instability problems that plagued Ryzen (segfaults).
                  Our experience (3 different systems with Ryzen 7 1700X and Ryzen 7 1800X) has proven they can still be unstable (even after going through an RMA to get chips made after week 30 of last year) when under very light load (mostly idling).

                  The RMA fixed the segfaults issues that were triggered under heavy load.

                  There are crashes that persisted though, as made manifest by the RCU callbacks of the Linux kernel:

                  Dec 26 21:16:24 ring-prod-mtl-02.mtl.sfl kernel: INFO: rcu_sched detected stalls on CPUs/tasks:
                  Dec 26 21:16:24 ring-prod-mtl-02.mtl.sfl kernel: 0-...: (0 ticks this GP) idle=748/0/0 softirq=1243126/1243126 fqs=0
                  Dec 26 21:16:24 ring-prod-mtl-02.mtl.sfl kernel: 1-...: (43 GPs behind) idle=314/0/0 softirq=205373/205373 fqs=0
                  Dec 26 21:16:24 ring-prod-mtl-02.mtl.sfl kernel: 2-...: (1 ticks this GP) idle=adc/0/0 softirq=1642967/1642967 fqs=0
                  Dec 26 21:16:24 ring-prod-mtl-02.mtl.sfl kernel: 3-...: (8 GPs behind) idle=dec/0/0 softirq=177729/177729 fqs=0
                  Dec 26 21:16:24 ring-prod-mtl-02.mtl.sfl kernel: 6-...: (0 ticks this GP) idle=de0/0/0 softirq=1205409/1205409 fqs=0
                  Dec 26 21:16:24 ring-prod-mtl-02.mtl.sfl kernel: 7-...: (9 GPs behind) idle=af8/0/0 softirq=405383/405383 fqs=0
                  Dec 26 21:16:24 ring-prod-mtl-02.mtl.sfl kernel: 9-...: (0 ticks this GP) idle=3c4/0/0 softirq=24711/24711 fqs=0
                  Dec 26 21:16:24 ring-prod-mtl-02.mtl.sfl kernel: 12-...: (47 GPs behind) idle=038/0/0 softirq=139639/139640 fqs=0
                  Dec 26 21:16:24 ring-prod-mtl-02.mtl.sfl kernel: 13-...: (126 GPs behind) idle=bec/0/0 softirq=24216/24216 fqs=0
                  Dec 26 21:16:24 ring-prod-mtl-02.mtl.sfl kernel: 14-...: (0 ticks this GP) idle=228/0/0 softirq=215446/215446 fqs=0
                  Dec 26 21:16:24 ring-prod-mtl-02.mtl.sfl kernel: 15-...: (90 GPs behind) idle=ce4/0/0 softirq=27427/27427 fqs=0
                  Dec 26 21:16:24 ring-prod-mtl-02.mtl.sfl kernel: (detected by 4, t=5322 jiffies, g=1719621, c=1719620, q=645)
                  Dec 26 21:16:24 ring-prod-mtl-02.mtl.sfl kernel: Sending NMI from CPU 4 to CPUs 0:
                  Dec 26 21:16:24 ring-prod-mtl-02.mtl.sfl kernel: Sending NMI from CPU 4 to CPUs 1:
                  Dec 26 21:16:24 ring-prod-mtl-02.mtl.sfl kernel: usb 3-2.1: USB disconnect, device number 124
                  Dec 26 21:17:01 ring-prod-mtl-02.mtl.sfl CRON[27616]: pam_unicron:session): session opened for user root by (uid=0)
                  Dec 26 21:17:13 ring-prod-mtl-02.mtl.sfl kernel: Sending NMI from CPU 4 to CPUs 2:
                  Dec 26 21:17:13 ring-prod-mtl-02.mtl.sfl kernel: Sending NMI from CPU 4 to CPUs 3:
                  Dec 26 21:17:13 ring-prod-mtl-02.mtl.sfl kernel: NMI watchdog: Watchdog detected hard LOCKUP on cpu 14

                  Some readers of Phoronix seems to be affected as well: https://www.phoronix.com/forums/foru...u-owners/page4

                  Those freeze the whole machine and it must be power cycled. It occurred about once per day.

                  After spending an awful lot of time investigating the issue, it seems that enabling RCU callbacks offloading using rcuo kernel threads is a valid workaround.

                  Basically one needs to compile the Linux kernel with the "CONFIG_RCU_NOCB_CPU" flag enabled, and then pass this kernel boot parameter: rcu_nocbs=0-15 (this value needs to be adapted based on the number of 'logical' cores of the processor).
                  This article details the procedure: http://blog.programster.org/ubuntu-1...rnel-for-ryzen

                  It affects Debian (and derived) as well as Fedora with kernels 4.13.X, at least.

                  The best explanation/guest of the workaround/problem that I could find was this comment: https://bugzilla.kernel.org/show_bug.cgi?id=196683#c53
                  That issue seems to track that problem, although it doesn't seem to have gained traction from AMD kernel developers so far.

                  It'd be nice if you could prepare some article putting this latest Ryzen instability problem in the light so that (hopefully) it attracts enough attention and gets fixed!

                  Thanks,

                  Maxim
                  This mail was sent to [email protected] on the 8th of January. AMD will swap parts but is otherwise muted and unhelpful on any problems and resolutions...

                  Since I wrote this email, I discovered that compiling with CONFIG_RCU_NOCB_CPU is not enough, I also had to disable C-State otherwise it'd still crash.

                  Comment


                  • #10
                    As my original Ryzen CPU had shown the random segfault issue, I got a replacement chip via AMD RMA. Since then there are no more segfaults, but as other users, I've had random freezes when the system was idling.
                    The CONFIG_RCU_NOCB_CPU didn't have any effect on my machine, but disabling C6 states via the firmware settings made the computer run rock solid. Not a single freeze since I changed this four months ago.

                    Comment

                    Working...
                    X