Announcement

Collapse
No announcement yet.

Ryzen-Test & Stress-Run Make It Easy To Cause Segmentation Faults On Zen CPUs

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • kemalihsan
    replied
    Guys, such artificial torture tests may fail any CPU. Let's not panic.
    The "situation" can also be temporal depending on the updates on BIOS, kernel, gcc, libc, etc...
    By the way, on different Linuxes different results, can it be also scheduler stuff? Do people posting about their scheduler?
    How about other compilers, interpreters? Pyhton, java,...?
    You may try compiling ns3 (http://www.nsnam.org/) network simulator which is one of the toughest compile job I have ever seen so far!
    I would love to see a "compilation of responses" from AMD and from motherboard manufacturers. Also from kernel guys!!!

    All the best...

    Leave a comment:


  • soulsource
    replied
    Originally posted by lsatenstein View Post
    I do not wish to blame a vendor, but from the statistics I have seen, and reported on the web, including configurations, the common motherboard vendor is MSI.
    In an earlier posting, the Asus motherboard user reported no errors.

    I posted an idea about cross talk between adjacent lines laid out too close together on the mother board. It may be that between each line they need to lay a ground line.
    I think the problem is not the CPU chip, but the hardware layout with the mother board.
    Nope, it's not only MSI: https://docs.google.com/spreadsheets...#gid=950983791

    Leave a comment:


  • k1e0x
    replied
    Linux is just busted again.. what else is new.. you can work around this by setting make -j1

    Edit: With Ryzen 7 1700, on FreeBSD 12-CURRENT, building gcc-6.4.0 from ports default opts except:
    FORCE_MAKE_JOBS=YES MAKE_JOBS_NUMBER=17 CPU load shows all 16 threads at 100% I built it 3 times, no segfaults. the build takes about 20 minutes. Ryzen is fine.. Linux well.. hmm.. ya idk.. Oh wait maybe I hit it with Mesa, testing.. looks like maybe, 1 success 1 fail. Just more prevalent on Linux? hmm..

    The Gentoo people are still fucking crazy, they should be building with generic, not the intel cpu options and znver1 doesn't even work yet.
    Last edited by k1e0x; 08-07-2017, 09:53 AM.

    Leave a comment:


  • lsatenstein
    replied
    I do not wish to blame a vendor, but from the statistics I have seen, and reported on the web, including configurations, the common motherboard vendor is MSI.
    In an earlier posting, the Asus motherboard user reported no errors.

    I posted an idea about cross talk between adjacent lines laid out too close together on the mother board. It may be that between each line they need to lay a ground line.
    I think the problem is not the CPU chip, but the hardware layout with the mother board.

    Leave a comment:


  • lsatenstein
    replied
    Originally posted by garegin View Post
    Can someone tell me why this doesn't happen in windows?
    My answer to you. Simply because most Linux and Windows users do not run 8 or 9 concurrent large large compiles concurrently and repeatedly.

    Leave a comment:


  • lsatenstein
    replied
    What, aside from CPU is common to the problems? Common motherboard? Overclocking? Power supply voltage jitter due to work load? Re CPU's, is it only with the x series like the 1800x to 1700x or it includes the 65 watt CPU's as well?
    Crosstalk can occur if mother board signal traces are too close to each other.

    I recall in some of my ancient high frequency circuit designs that we had to put a ground trace between pairs of signal lines. Could that be the cause of the random abort problem?
    Last edited by lsatenstein; 08-06-2017, 04:16 PM.

    Leave a comment:


  • Qaridarium
    replied
    Originally posted by debianxfce View Post

    Cool down you open source believer. It is very expensive to fix errors in silicon. It is cheap to fix software bugs.
    dude you even call yourself "debian"+"xfce".. both Open-Source with copyleft licenses....

    in fact this topic of hardware bugs here is all about closed source solutions who prove the point that ISA+BIOS+Microcode+Firmware should be open-source to find bugs like this.

    Leave a comment:


  • rene
    replied
    Originally posted by bridgman View Post
    The Epyc report looks like a red herring - it only shows conftest segfaults when running PTS and you can get those on any CPU AFAICS.

    Here's what I have so far from running the same test on Kaveri:

    Code:
    [135426.023713] conftest[7321]: segfault at 0 ip 00000000004005c0 sp 00007ffc50e9c100 error 4 in conftest[400000+1000]
    [135426.380393] conftest[7350]: segfault at 0 ip 00007f182564ea56 sp 00007ffce0ef89d8 error 4 in libc-2.19.so[7f182550b000+1be000]
    [135820.308320] conftest[904]: segfault at 0 ip 00000000004005c0 sp 00007ffda06058c0 error 4 in conftest[400000+1000]
    [135820.705215] conftest[1101]: segfault at 0 ip 00007f75fcebea56 sp 00007fff4d29efc8 error 4 in libc-2.19.so[7f75fcd7b000+1be000]
    [136234.276660] conftest[31854]: segfault at 0 ip 00000000004005c0 sp 00007ffc41f98ba0 error 4 in conftest[400000+1000]
    [136235.639014] conftest[31934]: segfault at 0 ip 00007f0848a1ea56 sp 00007ffca99e0888 error 4 in libc-2.19.so[7f08488db000+1be000]
    [136646.822395] conftest[27283]: segfault at 0 ip 00000000004005c0 sp 00007ffe6758ea60 error 4 in conftest[400000+1000]
    [136648.542144] conftest[27819]: segfault at 0 ip 00007ff88dc66a56 sp 00007ffc327e03f8 error 4 in libc-2.19.so[7ff88db23000+1be000]
    [137037.665595] conftest[22331]: segfault at 0 ip 00000000004005c0 sp 00007ffc9f4fade0 error 4 in conftest[400000+1000]
    [137038.893662] conftest[22518]: segfault at 0 ip 00007f91f542ea56 sp 00007ffeb61ca558 error 4 in libc-2.19.so[7f91f52eb000+1be000]
    [137436.512426] conftest[6036]: segfault at 0 ip 00000000004005c0 sp 00007ffeb6ee6b90 error 4 in conftest[400000+1000]
    [137437.066146] conftest[6247]: segfault at 0 ip 00007f324f026a56 sp 00007fffa59a6838 error 4 in libc-2.19.so[7f324eee3000+1be000]
    [137909.062105] conftest[3874]: segfault at 0 ip 00000000004005c0 sp 00007fffafeffee0 error 4 in conftest[400000+1000]
    [137910.077744] conftest[3993]: segfault at 0 ip 00007f13b1b76a56 sp 00007ffe03d113a8 error 4 in libc-2.19.so[7f13b1a33000+1be000]
    [138275.255002] conftest[1637]: segfault at 0 ip 00000000004005c0 sp 00007fff2281f870 error 4 in conftest[400000+1000]
    [138276.942921] conftest[2564]: segfault at 0 ip 00007f57f415ea56 sp 00007ffc69761cc8 error 4 in libc-2.19.so[7f57f401b000+1be000]
    [138687.993744] conftest[23730]: segfault at 0 ip 00000000004005c0 sp 00007ffe8d08c350 error 4 in conftest[400000+1000]
    [138688.968782] conftest[23944]: segfault at 0 ip 00007f55bc4cea56 sp 00007fffe432cb98 error 4 in libc-e2.19.so[7f55bc38b000+1be000]
    [139104.852573] conftest[9705]: segfault at 0 ip 00000000004005c0 sp 00007ffcac10a4a0 error 4 in conftest[400000+1000]
    [139105.563900] conftest[10188]: segfault at 0 ip 00007fad16276a56 sp 00007ffe907a3698 error 4 in libc-2.19.so[7fad16133000+1be000]
    EDIT - I also get the same conftest segfault messages on an Intel 2600K at the office.

    Going back to yard work
    Yep, this confttest nonsense is likely unrelated to the actual real report of gcc and friends crashing. Some configure conftest are simply badly coded and always segfault, e.g. either BSD tests crashing on Linux, or simply defect code that accidentally worked on old glibc or Linux kernel, and now fail due different memory layout, address space randomization, etc, ... or kernel bug fixes.

    Leave a comment:


  • Beherit
    replied
    Originally posted by DanielG View Post
    Looks like FreeBSD and DragonflyBSD developers have debugged and fixed it in their kernels: https://svnweb.freebsd.org/base?view...evision=321899

    Might be worth mentioning in the newspost
    Matthew Dillon was very specific when he wrote that the workaround (not fix), only makes the bug appear less frequently.

    Leave a comment:


  • tjukken
    replied
    Originally posted by muncrief View Post

    Yikes!!! I certainly hope this report isn't true. If it's really a microcode bug preventing interrupt returns someone has screwed up big time. But it doesn't make any sense. This type of locking would show up easily and quickly in any type of thorough thread testing. Heck it would probably show up fairly quickly in random fault coverage tests. In any case I certainly hope it can be fixed quickly, as it's important for everyone that the Ryzen architecture succeeds.
    I hope it can be fixed too, but it's a CPU, so a fix might compromise performance. Worst case, it can't be fixed.

    Leave a comment:

Working...
X