Announcement

Collapse
No announcement yet.

Ryzen-Test & Stress-Run Make It Easy To Cause Segmentation Faults On Zen CPUs

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #81
    Originally posted by sturmflut View Post
    We've had access to an EPYC 7601 system at work and ran highly parallel HPC workloads (HPL, HPCG, Stream etc.) on it. Not a single segfault in three days. I also went and compiled the 4.12 Linux kernel with GCC 7.1.0 with a parallelization factor of up to 256 (make -j256) multiple times, completely stable.

    On the other hand I have a Ryzen 7 1600X on an Asus Prime B350M-A with 16 GB of DDR4 memory running at 2933 MHz at home, with the default 4.12.2 Arch Linux kernel. I just checked out darktable from GitHub and did a make -j12, immediately had one of the GCC instances crashing with a segfault.
    Can you run kill_ryzen.sh in this Epyc system? There is already a report of a failed system in Reddit:

    https://www.reddit.com/r/Amd/comment...g_performance/

    Edit: A real test takes many hours, 24 hours seems the best.

    Comment


    • #82
      I've been also able to reproduce it with my own test: 17 parallel compilation of it87 module (https://github.com/groeck/it87)
      Though it takes more time to fail (about 15 mins)

      Interestingly it fails in grep and not gcc!
      xargs: grep: terminated by signal 11

      It seems related with a very heavy load that runs and stops many processes.

      Edit: line from dmesg
      [ 5451.347874] grep[31531]: segfault at 7fdc9ac952e0 ip 00007fdc9aa65c20 sp 00007ffe0c142038 error 4 in libc-2.23.so[7fdc9aa3b000+1c0000]

      For reference lines I get from kill-ryzen.sh:
      [ 6819.611864] bash[3513]: segfault at 47ab00 ip 00007fa323b5ed13 sp 00007fffdcdfe160 error 7 in libc-2.23.so[7fa3239f8000+1c0000]
      [ 6840.225315] traps: bash[31022] trap invalid opcode ip:48db90 sp:7ffe38799968 error:0 in bash[400000+f4000]
      [ 7040.246725] bash[12027]: segfault at 64 ip 00000000004b8a20 sp 00007fff9377a438 error 4 in bash[400000+f4000]
      [ 7040.247237] bash[11960]: segfault at 64 ip 00000000004b8a20 sp 00007ffc34f03d18 error 4 in bash[400000+f4000]
      [ 7164.667555] traps: bash[27611] trap invalid opcode ip:48db90 sp:7ffdc8269018 error:0 in bash[400000+f4000]
      [ 7197.126370] bash[28846]: segfault at 0 ip 00007fa06a237746 sp 00007fff416b8950 error 4 in libc-2.23.so[7fa06a1ac000+1c0000]
      [ 7224.309228] bash[21097]: segfault at e86 ip 00000000004b8a20 sp 00007fff49d9a958 error 4 in bash[400000+f4000]

      Edit2: tested again with everything at stock except disabling: core boost, svm (virtualization), global c-state, iommu.
      No dice, failed exactly the same.

      Note: in all cases I was running amd staging 4.11 kernel compiled with gcc 5.4 on ubuntu
      Last edited by gurv; 08-05-2017, 10:33 AM.

      Comment


      • #83
        Originally posted by nightmarex View Post
        Will ask, stay tuned."

        I can't type without sounding pissed off, I'm sorry, however "will ask, stay tuned" seems like you would provide information back as to the state of affairs there. Just letting you know they keep waiting for any response.
        My apologies, you were at least partially right. I did say I would ask for confirmation on whether filing support tickets etc... was still the right thing to do.

        Comment


        • #84
          Originally posted by bridgman View Post

          My apologies, you were at least partially right. I did say I would ask for confirmation on whether filing support tickets etc... was still the right thing to do.
          Yeah it would be nice to have any kind of update on this problem, especially seeing as Epyc also exhibits it: https://www.reddit.com/r/Amd/comment...g_performance/

          I've not had any problem whatsoever in my day to day usage but I must admit I am a bit worried that it could cause issues with other workloads at some point in the future.
          Or it could also lower the resale value of this B1 stepping

          Comment


          • #85
            I used to see this bug on one system used for highly parallel loads (compiling and video encoding), however after a BIOS update w/ AGESA 1.0.0.6a, the problem seems to be gone.

            Comment


            • #86
              The Epyc report looks like a red herring - it only shows conftest segfaults when running PTS and you can get those on any CPU AFAICS.

              Here's what I have so far from running the same test on Kaveri:

              Code:
              [135426.023713] conftest[7321]: segfault at 0 ip 00000000004005c0 sp 00007ffc50e9c100 error 4 in conftest[400000+1000]
              [135426.380393] conftest[7350]: segfault at 0 ip 00007f182564ea56 sp 00007ffce0ef89d8 error 4 in libc-2.19.so[7f182550b000+1be000]
              [135820.308320] conftest[904]: segfault at 0 ip 00000000004005c0 sp 00007ffda06058c0 error 4 in conftest[400000+1000]
              [135820.705215] conftest[1101]: segfault at 0 ip 00007f75fcebea56 sp 00007fff4d29efc8 error 4 in libc-2.19.so[7f75fcd7b000+1be000]
              [136234.276660] conftest[31854]: segfault at 0 ip 00000000004005c0 sp 00007ffc41f98ba0 error 4 in conftest[400000+1000]
              [136235.639014] conftest[31934]: segfault at 0 ip 00007f0848a1ea56 sp 00007ffca99e0888 error 4 in libc-2.19.so[7f08488db000+1be000]
              [136646.822395] conftest[27283]: segfault at 0 ip 00000000004005c0 sp 00007ffe6758ea60 error 4 in conftest[400000+1000]
              [136648.542144] conftest[27819]: segfault at 0 ip 00007ff88dc66a56 sp 00007ffc327e03f8 error 4 in libc-2.19.so[7ff88db23000+1be000]
              [137037.665595] conftest[22331]: segfault at 0 ip 00000000004005c0 sp 00007ffc9f4fade0 error 4 in conftest[400000+1000]
              [137038.893662] conftest[22518]: segfault at 0 ip 00007f91f542ea56 sp 00007ffeb61ca558 error 4 in libc-2.19.so[7f91f52eb000+1be000]
              [137436.512426] conftest[6036]: segfault at 0 ip 00000000004005c0 sp 00007ffeb6ee6b90 error 4 in conftest[400000+1000]
              [137437.066146] conftest[6247]: segfault at 0 ip 00007f324f026a56 sp 00007fffa59a6838 error 4 in libc-2.19.so[7f324eee3000+1be000]
              [137909.062105] conftest[3874]: segfault at 0 ip 00000000004005c0 sp 00007fffafeffee0 error 4 in conftest[400000+1000]
              [137910.077744] conftest[3993]: segfault at 0 ip 00007f13b1b76a56 sp 00007ffe03d113a8 error 4 in libc-2.19.so[7f13b1a33000+1be000]
              [138275.255002] conftest[1637]: segfault at 0 ip 00000000004005c0 sp 00007fff2281f870 error 4 in conftest[400000+1000]
              [138276.942921] conftest[2564]: segfault at 0 ip 00007f57f415ea56 sp 00007ffc69761cc8 error 4 in libc-2.19.so[7f57f401b000+1be000]
              [138687.993744] conftest[23730]: segfault at 0 ip 00000000004005c0 sp 00007ffe8d08c350 error 4 in conftest[400000+1000]
              [138688.968782] conftest[23944]: segfault at 0 ip 00007f55bc4cea56 sp 00007fffe432cb98 error 4 in libc-2.19.so[7f55bc38b000+1be000]
              [139104.852573] conftest[9705]: segfault at 0 ip 00000000004005c0 sp 00007ffcac10a4a0 error 4 in conftest[400000+1000]
              [139105.563900] conftest[10188]: segfault at 0 ip 00007fad16276a56 sp 00007ffe907a3698 error 4 in libc-2.19.so[7fad16133000+1be000]
              EDIT - I also get the same conftest segfault messages on an Intel 2600K at the office.

              Going back to yard work
              Last edited by bridgman; 08-05-2017, 03:41 PM.

              Comment


              • #87
                Originally posted by DanielG View Post
                Looks like FreeBSD and DragonflyBSD developers have debugged and fixed it in their kernels: https://svnweb.freebsd.org/base?view...evision=321899

                Might be worth mentioning in the newspost
                Yikes!!! I certainly hope this report isn't true. If it's really a microcode bug preventing interrupt returns someone has screwed up big time. But it doesn't make any sense. This type of locking would show up easily and quickly in any type of thorough thread testing. Heck it would probably show up fairly quickly in random fault coverage tests. In any case I certainly hope it can be fixed quickly, as it's important for everyone that the Ryzen architecture succeeds.

                Comment


                • #88
                  Originally posted by muncrief View Post

                  Yikes!!! I certainly hope this report isn't true. If it's really a microcode bug preventing interrupt returns someone has screwed up big time. But it doesn't make any sense. This type of locking would show up easily and quickly in any type of thorough thread testing. Heck it would probably show up fairly quickly in random fault coverage tests. In any case I certainly hope it can be fixed quickly, as it's important for everyone that the Ryzen architecture succeeds.
                  I hope it can be fixed too, but it's a CPU, so a fix might compromise performance. Worst case, it can't be fixed.

                  Comment


                  • #89
                    Originally posted by DanielG View Post
                    Looks like FreeBSD and DragonflyBSD developers have debugged and fixed it in their kernels: https://svnweb.freebsd.org/base?view...evision=321899

                    Might be worth mentioning in the newspost
                    Matthew Dillon was very specific when he wrote that the workaround (not fix), only makes the bug appear less frequently.

                    Comment


                    • #90
                      Originally posted by bridgman View Post
                      The Epyc report looks like a red herring - it only shows conftest segfaults when running PTS and you can get those on any CPU AFAICS.

                      Here's what I have so far from running the same test on Kaveri:

                      Code:
                      [135426.023713] conftest[7321]: segfault at 0 ip 00000000004005c0 sp 00007ffc50e9c100 error 4 in conftest[400000+1000]
                      [135426.380393] conftest[7350]: segfault at 0 ip 00007f182564ea56 sp 00007ffce0ef89d8 error 4 in libc-2.19.so[7f182550b000+1be000]
                      [135820.308320] conftest[904]: segfault at 0 ip 00000000004005c0 sp 00007ffda06058c0 error 4 in conftest[400000+1000]
                      [135820.705215] conftest[1101]: segfault at 0 ip 00007f75fcebea56 sp 00007fff4d29efc8 error 4 in libc-2.19.so[7f75fcd7b000+1be000]
                      [136234.276660] conftest[31854]: segfault at 0 ip 00000000004005c0 sp 00007ffc41f98ba0 error 4 in conftest[400000+1000]
                      [136235.639014] conftest[31934]: segfault at 0 ip 00007f0848a1ea56 sp 00007ffca99e0888 error 4 in libc-2.19.so[7f08488db000+1be000]
                      [136646.822395] conftest[27283]: segfault at 0 ip 00000000004005c0 sp 00007ffe6758ea60 error 4 in conftest[400000+1000]
                      [136648.542144] conftest[27819]: segfault at 0 ip 00007ff88dc66a56 sp 00007ffc327e03f8 error 4 in libc-2.19.so[7ff88db23000+1be000]
                      [137037.665595] conftest[22331]: segfault at 0 ip 00000000004005c0 sp 00007ffc9f4fade0 error 4 in conftest[400000+1000]
                      [137038.893662] conftest[22518]: segfault at 0 ip 00007f91f542ea56 sp 00007ffeb61ca558 error 4 in libc-2.19.so[7f91f52eb000+1be000]
                      [137436.512426] conftest[6036]: segfault at 0 ip 00000000004005c0 sp 00007ffeb6ee6b90 error 4 in conftest[400000+1000]
                      [137437.066146] conftest[6247]: segfault at 0 ip 00007f324f026a56 sp 00007fffa59a6838 error 4 in libc-2.19.so[7f324eee3000+1be000]
                      [137909.062105] conftest[3874]: segfault at 0 ip 00000000004005c0 sp 00007fffafeffee0 error 4 in conftest[400000+1000]
                      [137910.077744] conftest[3993]: segfault at 0 ip 00007f13b1b76a56 sp 00007ffe03d113a8 error 4 in libc-2.19.so[7f13b1a33000+1be000]
                      [138275.255002] conftest[1637]: segfault at 0 ip 00000000004005c0 sp 00007fff2281f870 error 4 in conftest[400000+1000]
                      [138276.942921] conftest[2564]: segfault at 0 ip 00007f57f415ea56 sp 00007ffc69761cc8 error 4 in libc-2.19.so[7f57f401b000+1be000]
                      [138687.993744] conftest[23730]: segfault at 0 ip 00000000004005c0 sp 00007ffe8d08c350 error 4 in conftest[400000+1000]
                      [138688.968782] conftest[23944]: segfault at 0 ip 00007f55bc4cea56 sp 00007fffe432cb98 error 4 in libc-e2.19.so[7f55bc38b000+1be000]
                      [139104.852573] conftest[9705]: segfault at 0 ip 00000000004005c0 sp 00007ffcac10a4a0 error 4 in conftest[400000+1000]
                      [139105.563900] conftest[10188]: segfault at 0 ip 00007fad16276a56 sp 00007ffe907a3698 error 4 in libc-2.19.so[7fad16133000+1be000]
                      EDIT - I also get the same conftest segfault messages on an Intel 2600K at the office.

                      Going back to yard work
                      Yep, this confttest nonsense is likely unrelated to the actual real report of gcc and friends crashing. Some configure conftest are simply badly coded and always segfault, e.g. either BSD tests crashing on Linux, or simply defect code that accidentally worked on old glibc or Linux kernel, and now fail due different memory layout, address space randomization, etc, ... or kernel bug fixes.

                      Comment

                      Working...
                      X