Announcement

Collapse
No announcement yet.

Ryzen-Test & Stress-Run Make It Easy To Cause Segmentation Faults On Zen CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #81
    The Epyc report looks like a red herring - it only shows conftest segfaults when running PTS and you can get those on any CPU AFAICS.

    Here's what I have so far from running the same test on Kaveri:

    Code:
    [135426.023713] conftest[7321]: segfault at 0 ip 00000000004005c0 sp 00007ffc50e9c100 error 4 in conftest[400000+1000]
    [135426.380393] conftest[7350]: segfault at 0 ip 00007f182564ea56 sp 00007ffce0ef89d8 error 4 in libc-2.19.so[7f182550b000+1be000]
    [135820.308320] conftest[904]: segfault at 0 ip 00000000004005c0 sp 00007ffda06058c0 error 4 in conftest[400000+1000]
    [135820.705215] conftest[1101]: segfault at 0 ip 00007f75fcebea56 sp 00007fff4d29efc8 error 4 in libc-2.19.so[7f75fcd7b000+1be000]
    [136234.276660] conftest[31854]: segfault at 0 ip 00000000004005c0 sp 00007ffc41f98ba0 error 4 in conftest[400000+1000]
    [136235.639014] conftest[31934]: segfault at 0 ip 00007f0848a1ea56 sp 00007ffca99e0888 error 4 in libc-2.19.so[7f08488db000+1be000]
    [136646.822395] conftest[27283]: segfault at 0 ip 00000000004005c0 sp 00007ffe6758ea60 error 4 in conftest[400000+1000]
    [136648.542144] conftest[27819]: segfault at 0 ip 00007ff88dc66a56 sp 00007ffc327e03f8 error 4 in libc-2.19.so[7ff88db23000+1be000]
    [137037.665595] conftest[22331]: segfault at 0 ip 00000000004005c0 sp 00007ffc9f4fade0 error 4 in conftest[400000+1000]
    [137038.893662] conftest[22518]: segfault at 0 ip 00007f91f542ea56 sp 00007ffeb61ca558 error 4 in libc-2.19.so[7f91f52eb000+1be000]
    [137436.512426] conftest[6036]: segfault at 0 ip 00000000004005c0 sp 00007ffeb6ee6b90 error 4 in conftest[400000+1000]
    [137437.066146] conftest[6247]: segfault at 0 ip 00007f324f026a56 sp 00007fffa59a6838 error 4 in libc-2.19.so[7f324eee3000+1be000]
    [137909.062105] conftest[3874]: segfault at 0 ip 00000000004005c0 sp 00007fffafeffee0 error 4 in conftest[400000+1000]
    [137910.077744] conftest[3993]: segfault at 0 ip 00007f13b1b76a56 sp 00007ffe03d113a8 error 4 in libc-2.19.so[7f13b1a33000+1be000]
    [138275.255002] conftest[1637]: segfault at 0 ip 00000000004005c0 sp 00007fff2281f870 error 4 in conftest[400000+1000]
    [138276.942921] conftest[2564]: segfault at 0 ip 00007f57f415ea56 sp 00007ffc69761cc8 error 4 in libc-2.19.so[7f57f401b000+1be000]
    [138687.993744] conftest[23730]: segfault at 0 ip 00000000004005c0 sp 00007ffe8d08c350 error 4 in conftest[400000+1000]
    [138688.968782] conftest[23944]: segfault at 0 ip 00007f55bc4cea56 sp 00007fffe432cb98 error 4 in libc-2.19.so[7f55bc38b000+1be000]
    [139104.852573] conftest[9705]: segfault at 0 ip 00000000004005c0 sp 00007ffcac10a4a0 error 4 in conftest[400000+1000]
    [139105.563900] conftest[10188]: segfault at 0 ip 00007fad16276a56 sp 00007ffe907a3698 error 4 in libc-2.19.so[7fad16133000+1be000]
    EDIT - I also get the same conftest segfault messages on an Intel 2600K at the office.

    Going back to yard work
    Last edited by bridgman; 05 August 2017, 03:41 PM.
    Test signature

    Comment


    • #82
      Originally posted by DanielG View Post
      Looks like FreeBSD and DragonflyBSD developers have debugged and fixed it in their kernels: https://svnweb.freebsd.org/base?view...evision=321899

      Might be worth mentioning in the newspost
      Yikes!!! I certainly hope this report isn't true. If it's really a microcode bug preventing interrupt returns someone has screwed up big time. But it doesn't make any sense. This type of locking would show up easily and quickly in any type of thorough thread testing. Heck it would probably show up fairly quickly in random fault coverage tests. In any case I certainly hope it can be fixed quickly, as it's important for everyone that the Ryzen architecture succeeds.

      Comment


      • #83
        Originally posted by muncrief View Post

        Yikes!!! I certainly hope this report isn't true. If it's really a microcode bug preventing interrupt returns someone has screwed up big time. But it doesn't make any sense. This type of locking would show up easily and quickly in any type of thorough thread testing. Heck it would probably show up fairly quickly in random fault coverage tests. In any case I certainly hope it can be fixed quickly, as it's important for everyone that the Ryzen architecture succeeds.
        I hope it can be fixed too, but it's a CPU, so a fix might compromise performance. Worst case, it can't be fixed.

        Comment


        • #84
          Originally posted by DanielG View Post
          Looks like FreeBSD and DragonflyBSD developers have debugged and fixed it in their kernels: https://svnweb.freebsd.org/base?view...evision=321899

          Might be worth mentioning in the newspost
          Matthew Dillon was very specific when he wrote that the workaround (not fix), only makes the bug appear less frequently.

          Comment


          • #85
            Originally posted by bridgman View Post
            The Epyc report looks like a red herring - it only shows conftest segfaults when running PTS and you can get those on any CPU AFAICS.

            Here's what I have so far from running the same test on Kaveri:

            Code:
            [135426.023713] conftest[7321]: segfault at 0 ip 00000000004005c0 sp 00007ffc50e9c100 error 4 in conftest[400000+1000]
            [135426.380393] conftest[7350]: segfault at 0 ip 00007f182564ea56 sp 00007ffce0ef89d8 error 4 in libc-2.19.so[7f182550b000+1be000]
            [135820.308320] conftest[904]: segfault at 0 ip 00000000004005c0 sp 00007ffda06058c0 error 4 in conftest[400000+1000]
            [135820.705215] conftest[1101]: segfault at 0 ip 00007f75fcebea56 sp 00007fff4d29efc8 error 4 in libc-2.19.so[7f75fcd7b000+1be000]
            [136234.276660] conftest[31854]: segfault at 0 ip 00000000004005c0 sp 00007ffc41f98ba0 error 4 in conftest[400000+1000]
            [136235.639014] conftest[31934]: segfault at 0 ip 00007f0848a1ea56 sp 00007ffca99e0888 error 4 in libc-2.19.so[7f08488db000+1be000]
            [136646.822395] conftest[27283]: segfault at 0 ip 00000000004005c0 sp 00007ffe6758ea60 error 4 in conftest[400000+1000]
            [136648.542144] conftest[27819]: segfault at 0 ip 00007ff88dc66a56 sp 00007ffc327e03f8 error 4 in libc-2.19.so[7ff88db23000+1be000]
            [137037.665595] conftest[22331]: segfault at 0 ip 00000000004005c0 sp 00007ffc9f4fade0 error 4 in conftest[400000+1000]
            [137038.893662] conftest[22518]: segfault at 0 ip 00007f91f542ea56 sp 00007ffeb61ca558 error 4 in libc-2.19.so[7f91f52eb000+1be000]
            [137436.512426] conftest[6036]: segfault at 0 ip 00000000004005c0 sp 00007ffeb6ee6b90 error 4 in conftest[400000+1000]
            [137437.066146] conftest[6247]: segfault at 0 ip 00007f324f026a56 sp 00007fffa59a6838 error 4 in libc-2.19.so[7f324eee3000+1be000]
            [137909.062105] conftest[3874]: segfault at 0 ip 00000000004005c0 sp 00007fffafeffee0 error 4 in conftest[400000+1000]
            [137910.077744] conftest[3993]: segfault at 0 ip 00007f13b1b76a56 sp 00007ffe03d113a8 error 4 in libc-2.19.so[7f13b1a33000+1be000]
            [138275.255002] conftest[1637]: segfault at 0 ip 00000000004005c0 sp 00007fff2281f870 error 4 in conftest[400000+1000]
            [138276.942921] conftest[2564]: segfault at 0 ip 00007f57f415ea56 sp 00007ffc69761cc8 error 4 in libc-2.19.so[7f57f401b000+1be000]
            [138687.993744] conftest[23730]: segfault at 0 ip 00000000004005c0 sp 00007ffe8d08c350 error 4 in conftest[400000+1000]
            [138688.968782] conftest[23944]: segfault at 0 ip 00007f55bc4cea56 sp 00007fffe432cb98 error 4 in libc-e2.19.so[7f55bc38b000+1be000]
            [139104.852573] conftest[9705]: segfault at 0 ip 00000000004005c0 sp 00007ffcac10a4a0 error 4 in conftest[400000+1000]
            [139105.563900] conftest[10188]: segfault at 0 ip 00007fad16276a56 sp 00007ffe907a3698 error 4 in libc-2.19.so[7fad16133000+1be000]
            EDIT - I also get the same conftest segfault messages on an Intel 2600K at the office.

            Going back to yard work
            Yep, this confttest nonsense is likely unrelated to the actual real report of gcc and friends crashing. Some configure conftest are simply badly coded and always segfault, e.g. either BSD tests crashing on Linux, or simply defect code that accidentally worked on old glibc or Linux kernel, and now fail due different memory layout, address space randomization, etc, ... or kernel bug fixes.

            Comment


            • #86
              What, aside from CPU is common to the problems? Common motherboard? Overclocking? Power supply voltage jitter due to work load? Re CPU's, is it only with the x series like the 1800x to 1700x or it includes the 65 watt CPU's as well?
              Crosstalk can occur if mother board signal traces are too close to each other.

              I recall in some of my ancient high frequency circuit designs that we had to put a ground trace between pairs of signal lines. Could that be the cause of the random abort problem?
              Last edited by lsatenstein; 06 August 2017, 04:16 PM.

              Comment


              • #87
                Originally posted by garegin View Post
                Can someone tell me why this doesn't happen in windows?
                My answer to you. Simply because most Linux and Windows users do not run 8 or 9 concurrent large large compiles concurrently and repeatedly.

                Comment


                • #88
                  I do not wish to blame a vendor, but from the statistics I have seen, and reported on the web, including configurations, the common motherboard vendor is MSI.
                  In an earlier posting, the Asus motherboard user reported no errors.

                  I posted an idea about cross talk between adjacent lines laid out too close together on the mother board. It may be that between each line they need to lay a ground line.
                  I think the problem is not the CPU chip, but the hardware layout with the mother board.

                  Comment


                  • #89
                    Linux is just busted again.. what else is new.. you can work around this by setting make -j1

                    Edit: With Ryzen 7 1700, on FreeBSD 12-CURRENT, building gcc-6.4.0 from ports default opts except:
                    FORCE_MAKE_JOBS=YES MAKE_JOBS_NUMBER=17 CPU load shows all 16 threads at 100% I built it 3 times, no segfaults. the build takes about 20 minutes. Ryzen is fine.. Linux well.. hmm.. ya idk.. Oh wait maybe I hit it with Mesa, testing.. looks like maybe, 1 success 1 fail. Just more prevalent on Linux? hmm..

                    The Gentoo people are still fucking crazy, they should be building with generic, not the intel cpu options and znver1 doesn't even work yet.
                    Last edited by k1e0x; 07 August 2017, 09:53 AM.

                    Comment


                    • #90
                      Originally posted by lsatenstein View Post
                      I do not wish to blame a vendor, but from the statistics I have seen, and reported on the web, including configurations, the common motherboard vendor is MSI.
                      In an earlier posting, the Asus motherboard user reported no errors.

                      I posted an idea about cross talk between adjacent lines laid out too close together on the mother board. It may be that between each line they need to lay a ground line.
                      I think the problem is not the CPU chip, but the hardware layout with the mother board.
                      Nope, it's not only MSI: https://docs.google.com/spreadsheets...#gid=950983791

                      Comment

                      Working...
                      X