Announcement

Collapse
No announcement yet.

Ryzen-Test & Stress-Run Make It Easy To Cause Segmentation Faults On Zen CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    EDIT: Correction: well it seems i was a bit fast on the trigger that all my systems have the problem. After reading the post from Debianxfce I rebooted with a custom kernel that I use often, which is a bit slimmer and trimmed for debian, after his advice, but where I forgot to include BT support. Here there is no bugs so far. And it's been running on the 1700x for 5 min with no bugs which is a new record.

    I get the same error "segfault at 0 ip 00000000004005b0 sp 00007fffa1423b20 error 4 in conftest[400000+1000]" when running the bench test. But no errors when running heavy freesurfer loads on all cores 24/7. 1600x on a ASUS PRIME X370-PRO, no difference with reg. memory.

    Ubuntu 17.04

    Will try the old intel system i have as well as an old 8350. Will check the other debian system as well with the same custom kernel.

    Kind regards

    B.

    Last edited by Brutalix; 04 August 2017, 12:51 PM. Reason: Correction of previous statement.

    Comment


    • #32
      Michael could you try this on all of your Ryzen systems. Some people can reproduce and some can't. it would be nice to know the likely hood of getting a bad CPU. Also some people have been getting random reboots/freezes when leaving their systems at idle. Could you try leaving a system idle for a day or more to see if you can reproduce that issue?

      Comment


      • #33
        Originally posted by chuckula View Post

        I got a better idea: Instead of requiring Linux users to turn off an important security feature that apparently works with Intel chips, ARM chips, POWER chips, MIPs chips, all AMD chips other than RyZen, etc., why doesn't AMD figure out what's going on and fix their product.
        This problem happens on intel chips as well, the "kill_ryzen" script was ridiculed for throwing segfaults on every cpu it was tested on. So can we keep things in the realm of reality?

        AMD knows of this issue they've been silent as more evidence is brought to light however this bug doesn't seem to effect everyone and it hits to varying degrees to those that it does. Truth is there are several issues. Stock voltage seems too low to sustain proper function, cooling is inadequate (brief temp spikes), higher registers seems to trigger cpu lockups, p-states will make a system unstable when idle and cause crashes, as well as just some basic data corruption from a race condition in what appears to be the fabric.

        I am not bashing AMD here, other than the silence which is seriously uncool (John there's threads you promised to drop an update in and haven't please do so), however AMD is still responding to tickets and trying to work with people in these use cases which is okay I suppose.

        I actually don't have the same problems as I got a arctic 360 tons of voltage and custom state management that curbs 99% of any of those problems, i just get random memory corruption here and there which is probably due to dual rank and it's speed. Sigh, early adoption.

        Comment


        • #34
          Originally posted by fuzz View Post

          which kernel option was that?
          config_rcu_nocb_cpu_all

          Comment


          • #35
            Originally posted by chuckula View Post

            I got a better idea: Instead of requiring Linux users to turn off an important security feature that apparently works with Intel chips, ARM chips, POWER chips, MIPs chips, all AMD chips other than RyZen, etc., why doesn't AMD figure out what's going on and fix their product.
            I think this is more about finding the cause so it can be fixed. If these things stop with disabling this "imortant" security feature people can look into why it fails there, so it can be fixed.

            Comment


            • #36
              Thanks, thanks a lot Michael. I knew we were not just complaining in vein in that AMD forum. I was certain that you would be able to reproduce it and I really liked that you can do it with your stress test suite, with different kind of workloads and not only with single compilations sessions! Very, very good job indeed!

              Comment


              • #37
                Originally posted by Qaridarium

                and they still try to claim that we do not need open-source hardware ISA and opensource firmware and opensource bios...
                and then closed source microcode cause massiv problems and they still downplay the role of opensource sollutions.
                Can we post-pone this debate for now? This has been discussed so many times concerning so many types of hardware and in this particular case people are actually reporting ways to circumvent or even fix (if it proves permanent) the problem. Please don't drift off into another political debate...

                Comment


                • #38
                  This probrem may be caused by 0x40 address shifts errata?

                  Investigated by japanese user.

                  Ryzenにまつわる2つの問題 - 覚書 http://satoru-takeuchi.hatenablog.co...7/04/24/135914
                  日記 (2017 年 6 月下旬) http://www.e-hdk.com/diary/d201706c.html#20-2

                  Comment


                  • #39
                    This is anecdotal evidence but I underclocked my A10-7870K to 0.2V below normal and while everything seemed fine in Windows, when in Linux the system had L1 parity errors that were corrected most of the time. At -0.1V the problem did not appear, so I'm not surprised a problem with voltages could lead to what people see with Ryzen.

                    Also in another thread on Phoronix I explained how to underclock the RX 470/480 by patching the amdgpu-pro kernel source. After a few months it appears the voltage I chose (820mV) is too low as it sometimes leads to page faults in the GPU. When that happens only a reboot helps.

                    Comment


                    • #40
                      Originally posted by nightmarex View Post

                      This problem happens on intel chips as well, the "kill_ryzen" script was ridiculed for throwing segfaults on every cpu it was tested on.
                      Incorrect, kill_ryzen would only lead to segfaults on Ivy bridge and this is a know hardware defect that newer generations don't have (or is fixed in microcode already).

                      Comment

                      Working...
                      X