Announcement

Collapse
No announcement yet.

Ryzen-Test & Stress-Run Make It Easy To Cause Segmentation Faults On Zen CPUs

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • muncrief
    replied
    Originally posted by DanielG View Post
    Looks like FreeBSD and DragonflyBSD developers have debugged and fixed it in their kernels: https://svnweb.freebsd.org/base?view...evision=321899

    Might be worth mentioning in the newspost
    Yikes!!! I certainly hope this report isn't true. If it's really a microcode bug preventing interrupt returns someone has screwed up big time. But it doesn't make any sense. This type of locking would show up easily and quickly in any type of thorough thread testing. Heck it would probably show up fairly quickly in random fault coverage tests. In any case I certainly hope it can be fixed quickly, as it's important for everyone that the Ryzen architecture succeeds.

    Leave a comment:


  • bridgman
    replied
    The Epyc report looks like a red herring - it only shows conftest segfaults when running PTS and you can get those on any CPU AFAICS.

    Here's what I have so far from running the same test on Kaveri:

    Code:
    [135426.023713] conftest[7321]: segfault at 0 ip 00000000004005c0 sp 00007ffc50e9c100 error 4 in conftest[400000+1000]
    [135426.380393] conftest[7350]: segfault at 0 ip 00007f182564ea56 sp 00007ffce0ef89d8 error 4 in libc-2.19.so[7f182550b000+1be000]
    [135820.308320] conftest[904]: segfault at 0 ip 00000000004005c0 sp 00007ffda06058c0 error 4 in conftest[400000+1000]
    [135820.705215] conftest[1101]: segfault at 0 ip 00007f75fcebea56 sp 00007fff4d29efc8 error 4 in libc-2.19.so[7f75fcd7b000+1be000]
    [136234.276660] conftest[31854]: segfault at 0 ip 00000000004005c0 sp 00007ffc41f98ba0 error 4 in conftest[400000+1000]
    [136235.639014] conftest[31934]: segfault at 0 ip 00007f0848a1ea56 sp 00007ffca99e0888 error 4 in libc-2.19.so[7f08488db000+1be000]
    [136646.822395] conftest[27283]: segfault at 0 ip 00000000004005c0 sp 00007ffe6758ea60 error 4 in conftest[400000+1000]
    [136648.542144] conftest[27819]: segfault at 0 ip 00007ff88dc66a56 sp 00007ffc327e03f8 error 4 in libc-2.19.so[7ff88db23000+1be000]
    [137037.665595] conftest[22331]: segfault at 0 ip 00000000004005c0 sp 00007ffc9f4fade0 error 4 in conftest[400000+1000]
    [137038.893662] conftest[22518]: segfault at 0 ip 00007f91f542ea56 sp 00007ffeb61ca558 error 4 in libc-2.19.so[7f91f52eb000+1be000]
    [137436.512426] conftest[6036]: segfault at 0 ip 00000000004005c0 sp 00007ffeb6ee6b90 error 4 in conftest[400000+1000]
    [137437.066146] conftest[6247]: segfault at 0 ip 00007f324f026a56 sp 00007fffa59a6838 error 4 in libc-2.19.so[7f324eee3000+1be000]
    [137909.062105] conftest[3874]: segfault at 0 ip 00000000004005c0 sp 00007fffafeffee0 error 4 in conftest[400000+1000]
    [137910.077744] conftest[3993]: segfault at 0 ip 00007f13b1b76a56 sp 00007ffe03d113a8 error 4 in libc-2.19.so[7f13b1a33000+1be000]
    [138275.255002] conftest[1637]: segfault at 0 ip 00000000004005c0 sp 00007fff2281f870 error 4 in conftest[400000+1000]
    [138276.942921] conftest[2564]: segfault at 0 ip 00007f57f415ea56 sp 00007ffc69761cc8 error 4 in libc-2.19.so[7f57f401b000+1be000]
    [138687.993744] conftest[23730]: segfault at 0 ip 00000000004005c0 sp 00007ffe8d08c350 error 4 in conftest[400000+1000]
    [138688.968782] conftest[23944]: segfault at 0 ip 00007f55bc4cea56 sp 00007fffe432cb98 error 4 in libc-2.19.so[7f55bc38b000+1be000]
    [139104.852573] conftest[9705]: segfault at 0 ip 00000000004005c0 sp 00007ffcac10a4a0 error 4 in conftest[400000+1000]
    [139105.563900] conftest[10188]: segfault at 0 ip 00007fad16276a56 sp 00007ffe907a3698 error 4 in libc-2.19.so[7fad16133000+1be000]
    EDIT - I also get the same conftest segfault messages on an Intel 2600K at the office.

    Going back to yard work
    Last edited by bridgman; 08-05-2017, 03:41 PM.

    Leave a comment:


  • vito
    replied
    I used to see this bug on one system used for highly parallel loads (compiling and video encoding), however after a BIOS update w/ AGESA 1.0.0.6a, the problem seems to be gone.

    Leave a comment:


  • gurv
    replied
    Originally posted by bridgman View Post

    My apologies, you were at least partially right. I did say I would ask for confirmation on whether filing support tickets etc... was still the right thing to do.
    Yeah it would be nice to have any kind of update on this problem, especially seeing as Epyc also exhibits it: https://www.reddit.com/r/Amd/comment...g_performance/

    I've not had any problem whatsoever in my day to day usage but I must admit I am a bit worried that it could cause issues with other workloads at some point in the future.
    Or it could also lower the resale value of this B1 stepping

    Leave a comment:


  • bridgman
    replied
    Originally posted by nightmarex View Post
    Will ask, stay tuned."

    I can't type without sounding pissed off, I'm sorry, however "will ask, stay tuned" seems like you would provide information back as to the state of affairs there. Just letting you know they keep waiting for any response.
    My apologies, you were at least partially right. I did say I would ask for confirmation on whether filing support tickets etc... was still the right thing to do.

    Leave a comment:


  • gurv
    replied
    I've been also able to reproduce it with my own test: 17 parallel compilation of it87 module (https://github.com/groeck/it87)
    Though it takes more time to fail (about 15 mins)

    Interestingly it fails in grep and not gcc!
    xargs: grep: terminated by signal 11

    It seems related with a very heavy load that runs and stops many processes.

    Edit: line from dmesg
    [ 5451.347874] grep[31531]: segfault at 7fdc9ac952e0 ip 00007fdc9aa65c20 sp 00007ffe0c142038 error 4 in libc-2.23.so[7fdc9aa3b000+1c0000]

    For reference lines I get from kill-ryzen.sh:
    [ 6819.611864] bash[3513]: segfault at 47ab00 ip 00007fa323b5ed13 sp 00007fffdcdfe160 error 7 in libc-2.23.so[7fa3239f8000+1c0000]
    [ 6840.225315] traps: bash[31022] trap invalid opcode ip:48db90 sp:7ffe38799968 error:0 in bash[400000+f4000]
    [ 7040.246725] bash[12027]: segfault at 64 ip 00000000004b8a20 sp 00007fff9377a438 error 4 in bash[400000+f4000]
    [ 7040.247237] bash[11960]: segfault at 64 ip 00000000004b8a20 sp 00007ffc34f03d18 error 4 in bash[400000+f4000]
    [ 7164.667555] traps: bash[27611] trap invalid opcode ip:48db90 sp:7ffdc8269018 error:0 in bash[400000+f4000]
    [ 7197.126370] bash[28846]: segfault at 0 ip 00007fa06a237746 sp 00007fff416b8950 error 4 in libc-2.23.so[7fa06a1ac000+1c0000]
    [ 7224.309228] bash[21097]: segfault at e86 ip 00000000004b8a20 sp 00007fff49d9a958 error 4 in bash[400000+f4000]

    Edit2: tested again with everything at stock except disabling: core boost, svm (virtualization), global c-state, iommu.
    No dice, failed exactly the same.

    Note: in all cases I was running amd staging 4.11 kernel compiled with gcc 5.4 on ubuntu
    Last edited by gurv; 08-05-2017, 10:33 AM.

    Leave a comment:


  • pjssilva
    replied
    Originally posted by sturmflut View Post
    We've had access to an EPYC 7601 system at work and ran highly parallel HPC workloads (HPL, HPCG, Stream etc.) on it. Not a single segfault in three days. I also went and compiled the 4.12 Linux kernel with GCC 7.1.0 with a parallelization factor of up to 256 (make -j256) multiple times, completely stable.

    On the other hand I have a Ryzen 7 1600X on an Asus Prime B350M-A with 16 GB of DDR4 memory running at 2933 MHz at home, with the default 4.12.2 Arch Linux kernel. I just checked out darktable from GitHub and did a make -j12, immediately had one of the GCC instances crashing with a segfault.
    Can you run kill_ryzen.sh in this Epyc system? There is already a report of a failed system in Reddit:

    https://www.reddit.com/r/Amd/comment...g_performance/

    Edit: A real test takes many hours, 24 hours seems the best.

    Leave a comment:


  • pjssilva
    replied
    Another hint: once the system starts failing, it looks like, sometimes, it enters a state where failure is much more likely. To get back to "normal" I have the feeling that you have to at least turn off the computer and wait a little. Maybe even clear CMOS (but I am not sure about that). My system was taking 10.000s to fail with kill_ryzen.sh under Antergos. After a failed test I rebooted the system and tried again and it fails in 80s. Unfortunately I am not close to my system now. I am accessing it using ssh. So I can not turn it off, wait 10 minutes and try again.

    Leave a comment:


  • sturmflut
    replied
    We've had access to an EPYC 7601 system at work and ran highly parallel HPC workloads (HPL, HPCG, Stream etc.) on it. Not a single segfault in three days. I also went and compiled the 4.12 Linux kernel with GCC 7.1.0 with a parallelization factor of up to 256 (make -j256) multiple times, completely stable.

    On the other hand I have a Ryzen 7 1600X on an Asus Prime B350M-A with 16 GB of DDR4 memory running at 2933 MHz at home, with the default 4.12.2 Arch Linux kernel. I just checked out darktable from GitHub and did a make -j12, immediately had one of the GCC instances crashing with a segfault.

    Leave a comment:


  • pjssilva
    replied
    Originally posted by oleid View Post
    Edit: No problem after 1.5h++
    You are using Arch, right? I have an unstable system in Ubuntu and Fedora 26, but in Fedora 26 it takes much longer to fail. Antergos (which is Arch based) is also much better in avoiding failure, it seems. I am still double checking this running more tests in my system. But with Antergos kill_ryzen.sh would take at least 10.000 seconds to fail (which is almost 3hours) in the first two tries. Under Ubuntu it takes a couple of minute tops. My theory is that gcc 6.3 generates code that triggers the bug easily, while GCC 7.1.1 (which seems to be the default compiler in Arch toolchain) is better, generating code that is less likely to trigger the bug. Since Linux and BSD are compiled using gcc, this may explain why it is easier to see the problem under those system. And WSL also runs lots of code compiled by GCC under windows.

    Note that I am not saying that GCC is the culprit here. It probably is generating good and valid code, since it runs well in other CPUs. But if we could find out what is this code path, maybe we can help AMD engineers to find a workaround (even if asking GCC and other compilers to avoid that code path).

    Obs: AMD, if this helps you I want a flagship processor every year :-)

    Edit: If you want to make sure that your system is good you should run kill_ryzen.sh for at least 24 hours. I've been there, tweaking my system (and turning off SMT) I could get 5 hours and then bang. Another segfault.
    Last edited by pjssilva; 08-05-2017, 07:42 AM.

    Leave a comment:

Working...
X