Announcement

Collapse
No announcement yet.

Ryzen-Test & Stress-Run Make It Easy To Cause Segmentation Faults On Zen CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #71
    R5 1600X, MSI B350 Tomahawk, 16G Corsair, 4.12.4-1-ARCH-x86_64, microcode: 0x0800111c

    Seg fault after about 2 hours
    Last edited by vein; 05 August 2017, 07:09 AM.

    Comment


    • #72
      Originally posted by storma View Post
      1800x, crosshair vi hero, 16G corsair (manually set timings and voltage in uefi), 4.12.4-1-ARCH-x86_64, microcode: 0x8001126.
      Ran the ryzen test for just over 30mins with no crash.
      Not seeing a segfault for 30 minutes is by far not enough to conclude that the system is not showing the problem. From my experience the average time until the first segfault occurs is somewhere around 2 hours. Sometimes much less, sometimes much more, also on the same hardware.
      I wouldn't rule out that a system is affected by the segfault bug, unless it has been running continuous stress testing for at least 48 hours.

      Comment


      • #73
        Originally posted by oleid View Post
        Edit: No problem after 1.5h++
        You are using Arch, right? I have an unstable system in Ubuntu and Fedora 26, but in Fedora 26 it takes much longer to fail. Antergos (which is Arch based) is also much better in avoiding failure, it seems. I am still double checking this running more tests in my system. But with Antergos kill_ryzen.sh would take at least 10.000 seconds to fail (which is almost 3hours) in the first two tries. Under Ubuntu it takes a couple of minute tops. My theory is that gcc 6.3 generates code that triggers the bug easily, while GCC 7.1.1 (which seems to be the default compiler in Arch toolchain) is better, generating code that is less likely to trigger the bug. Since Linux and BSD are compiled using gcc, this may explain why it is easier to see the problem under those system. And WSL also runs lots of code compiled by GCC under windows.

        Note that I am not saying that GCC is the culprit here. It probably is generating good and valid code, since it runs well in other CPUs. But if we could find out what is this code path, maybe we can help AMD engineers to find a workaround (even if asking GCC and other compilers to avoid that code path).

        Obs: AMD, if this helps you I want a flagship processor every year :-)

        Edit: If you want to make sure that your system is good you should run kill_ryzen.sh for at least 24 hours. I've been there, tweaking my system (and turning off SMT) I could get 5 hours and then bang. Another segfault.
        Last edited by pjssilva; 05 August 2017, 07:42 AM.

        Comment


        • #74
          We've had access to an EPYC 7601 system at work and ran highly parallel HPC workloads (HPL, HPCG, Stream etc.) on it. Not a single segfault in three days. I also went and compiled the 4.12 Linux kernel with GCC 7.1.0 with a parallelization factor of up to 256 (make -j256) multiple times, completely stable.

          On the other hand I have a Ryzen 7 1600X on an Asus Prime B350M-A with 16 GB of DDR4 memory running at 2933 MHz at home, with the default 4.12.2 Arch Linux kernel. I just checked out darktable from GitHub and did a make -j12, immediately had one of the GCC instances crashing with a segfault.

          Comment


          • #75
            Another hint: once the system starts failing, it looks like, sometimes, it enters a state where failure is much more likely. To get back to "normal" I have the feeling that you have to at least turn off the computer and wait a little. Maybe even clear CMOS (but I am not sure about that). My system was taking 10.000s to fail with kill_ryzen.sh under Antergos. After a failed test I rebooted the system and tried again and it fails in 80s. Unfortunately I am not close to my system now. I am accessing it using ssh. So I can not turn it off, wait 10 minutes and try again.

            Comment


            • #76
              Originally posted by sturmflut View Post
              We've had access to an EPYC 7601 system at work and ran highly parallel HPC workloads (HPL, HPCG, Stream etc.) on it. Not a single segfault in three days. I also went and compiled the 4.12 Linux kernel with GCC 7.1.0 with a parallelization factor of up to 256 (make -j256) multiple times, completely stable.

              On the other hand I have a Ryzen 7 1600X on an Asus Prime B350M-A with 16 GB of DDR4 memory running at 2933 MHz at home, with the default 4.12.2 Arch Linux kernel. I just checked out darktable from GitHub and did a make -j12, immediately had one of the GCC instances crashing with a segfault.
              Can you run kill_ryzen.sh in this Epyc system? There is already a report of a failed system in Reddit:

              https://www.reddit.com/r/Amd/comment...g_performance/

              Edit: A real test takes many hours, 24 hours seems the best.

              Comment


              • #77
                I've been also able to reproduce it with my own test: 17 parallel compilation of it87 module (https://github.com/groeck/it87)
                Though it takes more time to fail (about 15 mins)

                Interestingly it fails in grep and not gcc!
                xargs: grep: terminated by signal 11

                It seems related with a very heavy load that runs and stops many processes.

                Edit: line from dmesg
                [ 5451.347874] grep[31531]: segfault at 7fdc9ac952e0 ip 00007fdc9aa65c20 sp 00007ffe0c142038 error 4 in libc-2.23.so[7fdc9aa3b000+1c0000]

                For reference lines I get from kill-ryzen.sh:
                [ 6819.611864] bash[3513]: segfault at 47ab00 ip 00007fa323b5ed13 sp 00007fffdcdfe160 error 7 in libc-2.23.so[7fa3239f8000+1c0000]
                [ 6840.225315] traps: bash[31022] trap invalid opcode ip:48db90 sp:7ffe38799968 error:0 in bash[400000+f4000]
                [ 7040.246725] bash[12027]: segfault at 64 ip 00000000004b8a20 sp 00007fff9377a438 error 4 in bash[400000+f4000]
                [ 7040.247237] bash[11960]: segfault at 64 ip 00000000004b8a20 sp 00007ffc34f03d18 error 4 in bash[400000+f4000]
                [ 7164.667555] traps: bash[27611] trap invalid opcode ip:48db90 sp:7ffdc8269018 error:0 in bash[400000+f4000]
                [ 7197.126370] bash[28846]: segfault at 0 ip 00007fa06a237746 sp 00007fff416b8950 error 4 in libc-2.23.so[7fa06a1ac000+1c0000]
                [ 7224.309228] bash[21097]: segfault at e86 ip 00000000004b8a20 sp 00007fff49d9a958 error 4 in bash[400000+f4000]

                Edit2: tested again with everything at stock except disabling: core boost, svm (virtualization), global c-state, iommu.
                No dice, failed exactly the same.

                Note: in all cases I was running amd staging 4.11 kernel compiled with gcc 5.4 on ubuntu
                Last edited by gurv; 05 August 2017, 10:33 AM.

                Comment


                • #78
                  Originally posted by nightmarex View Post
                  Will ask, stay tuned."

                  I can't type without sounding pissed off, I'm sorry, however "will ask, stay tuned" seems like you would provide information back as to the state of affairs there. Just letting you know they keep waiting for any response.
                  My apologies, you were at least partially right. I did say I would ask for confirmation on whether filing support tickets etc... was still the right thing to do.
                  Test signature

                  Comment


                  • #79
                    Originally posted by bridgman View Post

                    My apologies, you were at least partially right. I did say I would ask for confirmation on whether filing support tickets etc... was still the right thing to do.
                    Yeah it would be nice to have any kind of update on this problem, especially seeing as Epyc also exhibits it: https://www.reddit.com/r/Amd/comment...g_performance/

                    I've not had any problem whatsoever in my day to day usage but I must admit I am a bit worried that it could cause issues with other workloads at some point in the future.
                    Or it could also lower the resale value of this B1 stepping

                    Comment


                    • #80
                      I used to see this bug on one system used for highly parallel loads (compiling and video encoding), however after a BIOS update w/ AGESA 1.0.0.6a, the problem seems to be gone.

                      Comment

                      Working...
                      X