Announcement

Collapse
No announcement yet.

Some FreeBSD Users Are Still Running Into Random Lock-Ups With Ryzen

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #51
    Originally posted by vortex View Post

    I am sure if someone can find a way to replicate the issue, they will be able to fix it.
    This also don't happen on Windows AFAIK, so, it is a linux bug of some type.

    If you can't replicate it since it is too random, then it is a guessing game, and not much progress can be done.

    So, if you have this issue, then try to find a way to replicate it, then report the bugs.
    Running the dhtcluster service from OpenDHT triggers it easily... (https://github.com/savoirfairelinux/opendht). FYI, this is the tech used by GNU Ring for its distributed network. Running this service seems to trigger a crash per day about. Notice that if you run anything intensive the CPU won't lock. The system needs to be fairly idle for the crash to happen, so I'd suggest leaving that service running alone.

    I've proposed to AMD that they try an Ansible script I have to deploy this service on a Debian 9 system if they'd like to reproduce on their side but they have just ignored it.

    Comment


    • #52
      Originally posted by shmerl View Post
      Ryzen is very sensitive to power stability. In short, you need UPS to avoid problems.
      Probably true, but having a UPS doesn't fix the Ryzen soft lock problem. I tested 3 different systems equipped with 1800X and 1700X CPUs, two of which are plugged onto a commercial grade UPS and they crash all the same.

      Comment


      • #53
        Originally posted by monraaf View Post

        This was fixed in 4.13rc-something. These soft-lockups have been visible on large SPARC machines as well. This issue is not specific to Ryzen/Threadripper.

        My office has one Threadripper machine which had this issue as well. With the updated kernel, the problem went away.

        The machine is running 24/7 and is used for heavy number crunching.
        The 4.13rc-something resolution might apply to Threadripper, but it doesn't apply to Ryzen 1700X and 1800X: they still have the CPU soft lock issue on 4.14 (all 3 different systems).

        Comment


        • #54
          Originally posted by typerrrrrrrr View Post
          Forcing off the C6 states (https://github.com/r4m0n/ZenStates-Linux) with a script in /etc/rc.local has changed my setup from having an average uptime of 1 - 1.5 days into 50+ days (generally I've rebooted for other reasons so haven't gone beyond that number). Every other attempted fix (RCU, randomizing VA space, etc) had nearly zero effect.
          That is very interesting indeed, many thanks for sharing it.

          Comment


          • #55
            I get the random lockup issue more often with the Nouveau driver than with nVidia drivers. I think Nouveau does something that tends to provoke it.

            Comment


            • #56
              Originally posted by Apteryx View Post

              Probably true, but having a UPS doesn't fix the Ryzen soft lock problem. I tested 3 different systems equipped with 1800X and 1700X CPUs, two of which are plugged onto a commercial grade UPS and they crash all the same.
              I guess. I've been bugged by random freezes and MCE errors on Linux despite RMA-ing the CPU, motherboard and even RAM. And I suspect it's power related.

              Comment


              • #57
                Originally posted by Apteryx View Post

                Are you also running with C-State disabled in UEFI? I also still had crashes with the CONFIG_RCU_NOCB_CPU workaround (thanks for discovering it). But it doesn't mean that C-State alone is the cure: I also had crashes with C-State off but *without* the CONFIG_RCU_NOCB_CPU hack. At this point I'm throwing everything I can at the problem, so ASLR is off, C-State is off, opcache is off and using CONFIG_RCU_NOCB_CPU. It seems to be holding that way -- touching wood.
                I did run with C-State disabled instead for a while but I heard that this might disable boost. Still not sure about that but it also uses more power than CONFIG_RCU_NOCB_CPU. I haven't measured it but others have. I think only these two fixes make any difference, ASLR is almost certainly not related to this, but I don't think either fix is perfect. It may be that this affects some users more than others. I boot in legacy mode rather than UEFI, I doubt that makes any difference but you never know.

                Comment


                • #58

                  Is the "Ryzen lock up issue(that gets fixed by disabling C-state)" ? still wide-spread ?
                  Is the same problem still present in the Ryzen 2+Vega combo series of chips also ?

                  Comment


                  • #59
                    Originally posted by Shinobi View Post
                    Is the "Ryzen lock up issue(that gets fixed by disabling C-state)" ? still wide-spread ?
                    Is the same problem still present in the Ryzen 2+Vega combo series of chips also ?
                    Didnt had issues on later kernel releases, while I had lockups on older ones, while 4.15 was in rc state.

                    Comment


                    • #60
                      Originally posted by partizann View Post

                      Didnt had issues on later kernel releases, while I had lockups on older ones, while 4.15 was in rc state.
                      That is nice to hear. Would it be handy for you copy-paste the exact full kernel version, that works for you ?

                      Comment

                      Working...
                      X