Announcement

Collapse
No announcement yet.

Ryzen-Test & Stress-Run Make It Easy To Cause Segmentation Faults On Zen CPUs

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #71
    Does this fail with the low-end Ryzens as well? They should leave more room for adequate power supply etc.

    Comment


    • #72
      Originally posted by Beherit View Post
      And it's also a compatibility layer.

      See?

      You'll also find WSL listed as an example of a compatibility layers a bit further down the article.

      Still not convinced?
      I do think that the term "compatibility layer" is used in an imprecise fashion in the Wikipedia articles.
      The Windows Subsystem for Linux is no more a compatibility layer than the Win32 subsystem, or the now-defunct OS/2 and POSIX (Interix/SUA) subsystems.


      Originally posted by GreatEmerald View Post
      Does this fail with the low-end Ryzens as well? They should leave more room for adequate power supply etc.
      The low-end Ryzens do not support SMT, so are mostly not affected.

      Comment


      • #73
        It would seem as if kill-ryzen doesn't kill my ryzen at all (see above). I guess it must be either memory timings or AMD's kernel.

        Michael : Would you maybe try AMD's ROCm kernel?

        Comment


        • #74
          Originally posted by GreatEmerald View Post
          Does this fail with the low-end Ryzens as well? They should leave more room for adequate power supply etc.
          Yes, it does. I've tested three Ryzen machines (8 cores, 6 cores, 4 cores), and all of them show the same issue (namely: a segfault during parallel compilation). One of the machines (the 8-core) was ordered/assembled/installed by me, the other two were computers at our office, ordered/assembled/installed by our office IT (and certainly are not, and were never overclocked).

          That's also why I'm convinced that a significant number of Ryzen CPUs is affected by it...

          Comment


          • #75
            1800x, crosshair vi hero, 16G corsair (manually set timings and voltage in uefi), 4.12.4-1-ARCH-x86_64, microcode: 0x8001126.
            Ran the ryzen test for just over 30mins with no crash.

            Comment


            • #76
              R5 1600X, MSI B350 Tomahawk, 16G Corsair, 4.12.4-1-ARCH-x86_64, microcode: 0x0800111c

              Seg fault after about 2 hours
              Last edited by vein; 08-05-2017, 07:09 AM.

              Comment


              • #77
                Originally posted by storma View Post
                1800x, crosshair vi hero, 16G corsair (manually set timings and voltage in uefi), 4.12.4-1-ARCH-x86_64, microcode: 0x8001126.
                Ran the ryzen test for just over 30mins with no crash.
                Not seeing a segfault for 30 minutes is by far not enough to conclude that the system is not showing the problem. From my experience the average time until the first segfault occurs is somewhere around 2 hours. Sometimes much less, sometimes much more, also on the same hardware.
                I wouldn't rule out that a system is affected by the segfault bug, unless it has been running continuous stress testing for at least 48 hours.

                Comment


                • #78
                  Originally posted by oleid View Post
                  Edit: No problem after 1.5h++
                  You are using Arch, right? I have an unstable system in Ubuntu and Fedora 26, but in Fedora 26 it takes much longer to fail. Antergos (which is Arch based) is also much better in avoiding failure, it seems. I am still double checking this running more tests in my system. But with Antergos kill_ryzen.sh would take at least 10.000 seconds to fail (which is almost 3hours) in the first two tries. Under Ubuntu it takes a couple of minute tops. My theory is that gcc 6.3 generates code that triggers the bug easily, while GCC 7.1.1 (which seems to be the default compiler in Arch toolchain) is better, generating code that is less likely to trigger the bug. Since Linux and BSD are compiled using gcc, this may explain why it is easier to see the problem under those system. And WSL also runs lots of code compiled by GCC under windows.

                  Note that I am not saying that GCC is the culprit here. It probably is generating good and valid code, since it runs well in other CPUs. But if we could find out what is this code path, maybe we can help AMD engineers to find a workaround (even if asking GCC and other compilers to avoid that code path).

                  Obs: AMD, if this helps you I want a flagship processor every year :-)

                  Edit: If you want to make sure that your system is good you should run kill_ryzen.sh for at least 24 hours. I've been there, tweaking my system (and turning off SMT) I could get 5 hours and then bang. Another segfault.
                  Last edited by pjssilva; 08-05-2017, 07:42 AM.

                  Comment


                  • #79
                    We've had access to an EPYC 7601 system at work and ran highly parallel HPC workloads (HPL, HPCG, Stream etc.) on it. Not a single segfault in three days. I also went and compiled the 4.12 Linux kernel with GCC 7.1.0 with a parallelization factor of up to 256 (make -j256) multiple times, completely stable.

                    On the other hand I have a Ryzen 7 1600X on an Asus Prime B350M-A with 16 GB of DDR4 memory running at 2933 MHz at home, with the default 4.12.2 Arch Linux kernel. I just checked out darktable from GitHub and did a make -j12, immediately had one of the GCC instances crashing with a segfault.

                    Comment


                    • #80
                      Another hint: once the system starts failing, it looks like, sometimes, it enters a state where failure is much more likely. To get back to "normal" I have the feeling that you have to at least turn off the computer and wait a little. Maybe even clear CMOS (but I am not sure about that). My system was taking 10.000s to fail with kill_ryzen.sh under Antergos. After a failed test I rebooted the system and tried again and it fails in 80s. Unfortunately I am not close to my system now. I am accessing it using ssh. So I can not turn it off, wait 10 minutes and try again.

                      Comment

                      Working...
                      X