Announcement

Collapse
No announcement yet.

Some Ryzen Linux Users Are Facing Issues With Heavy Compilation Loads

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Since switching to AGESA 1.0.0.6 this morning, I've already had one freeze under a "Generic-x86-64" kernel despite that optimisation level working for 14 hours on 1.0.0.4 yesterday. Either I was lucky yesterday or I'm battling multiple issues. I've just switched to GCC 7.1 in the hope that helps but I can't see how it would if you're not using -march. I'm only using it to build the kernel. You can do that by passing CC=gcc-7.1.0.

    What I find interesting is that I haven't seen any segfaults. Every incident has been an entire system lockup. I've only rebuilt a small handful of packages and most of my system is still built against -march=nehalem from my old system.

    I have been running the netconsole kernel module to send console messages over UDP. I usually do get some output following the freeze but there doesn't appear to be any pattern to the stack traces. It seems like something quite fundamental is failing.

    I did try disabling XMP early on but it didn't help. I'll try it again now that I've updated the BIOS. I haven't tweaked any other BIOS settings but I'll look into that. I could also try running Fedora off a stick for a while to see how long that holds up and I might even try borrowing a Fedora kernel for my Gentoo system.
    Last edited by Chewi; 06-03-2017, 08:34 AM.

    Comment


    • #62
      Yea, to me it sounds like motherboard issues, likely not enough voltage to drive stock clocks. There sure have been enough motherboard issues so far, so it wouldn't be very surprising.

      I wonder if anyone tried using the same processor in two motherboards, one known to work and one known not to.

      Comment


      • #63
        What about fake chips

        That was often AMD case “Only” 60 Thousands Fake AMD Chips Arrested, Million Already Shipped

        KingFish
        5 Jan, 2005

        “Based on tips provided by Advanced Micro Devices (AMD) Taiwan, the police Friday raided an electronics company located in Tainan, southern Taiwan, and seized a total of 60 000 suspect AMD CPUs. The suspect AMD CPUs, including K7 [AMD Athlon XP] and K8 [AMD Athlon 64] models, were defective CPUs that would normally have been destroyed,” claims an article posted on Taiwan-based web-site DigiTimes.

        http://icrontic.com/article/2005-onl...lready_shipped


        Then waste company sold AMD chips to China and Germany instead of destroy them
        Amazon accidentally ships counterfeit AMD APUs


        https://www.extremetech.com/computin...rfeit-amd-apus


        or


        The Counterfeit Electronics Problem


        In the Fall of 2003, AMD conducted some raids in Europe,
        where some of its low speed, low priced microprocessors we being relabeled as high speed, high priced chips. On investigation it was found that some resellers in Shenzhen, China were performing the remarking. AMD also purchased some microprocessors from the resellers and found them to be fakes (Takahash,2004).


        In January 2005, Advanced Micro Devices (AMD), working

        in cooperation with Taiwanese authorities, seized a total of

        60,000 counterfeit AMD microprocessors worth US $9.46 mil ....

        -


        https://file.scirp.org/pdf/JSS_2013121215153599.pdf

        Comment


        • #64
          I've been following this problem on gentoo's forum and it's almost impossible to know what works and what doesn't. The crashes are nondeterministic so users might change bios setting and then get lucky for a while so they assume the bios change solved the problem. There could also be several separate problems with the same symptoms but requiring separate fixes.

          Originally posted by Beherit View Post
          DragonBSD developer Matt Dillon wrote a workaround for a hardware bug in Ryzen: http://gitweb.dragonflybsd.org/drago...d301557fd9ac20
          Can someone clarify this? Is is saying that returning from interrupt to a "high user %rip address near the end of the user address space (top of user stack)" sometimes crashes. Do dragonfly run programs that execute code from stack? Or is it saying that a crash might happen when the return pointer is read from stack when the stack is near the end of the user address space?

          If there is a hardware bug depending on specific address in user address space it would make sense that compile jobs triggers it. Linux use address space layout randomization to put memory segments on different addresses on each run. A compile job forks a lot of processes each with their own layout. Try compiling without ASLR. "echo 0 > /proc/sys/kernel/randomize_va_space".

          Comment


          • #65
            That's interesting but users aren't just seeing issues under load. Mine freezes while doing practically nothing at all. As I said before though, there may be multiple issues at play.

            Also be aware that I saw weird segfaults from Java under ppc64 when ASLR is disabled.
            Last edited by Chewi; 06-03-2017, 10:17 AM.

            Comment


            • #66
              Originally posted by GreatEmerald View Post
              Yea, to me it sounds like motherboard issues, likely not enough voltage to drive stock clocks. There sure have been enough motherboard issues so far, so it wouldn't be very surprising.

              I wonder if anyone tried using the same processor in two motherboards, one known to work and one known not to.
              The segfault error reported here is very different than "just" random crashes or reboot. Voltage surges or undervoltage during load, would trigger errors during any heavy load, regardless of OS or activity. The workloads described here are too specific compared to the randomness of a dodgy motherboard.

              I, for one, would find it very surprising indeed if this particular issue when compiling, is caused by motherboard problems so mysterious that they only occur when using the gcc compiler in Linux. There are no reports on this amongst Windows developers, I've not been able to find a single one.

              Digging through forums and googling, those reporting this are 80% Gentoo users, 5% Fedora, 5% Debian and 10% Ubuntu. Here's the earliest report of this I was able to find, courtesy of the Australian edition of Linux Format, dated Apr 11.

              FreeBSD has one report which can be related to this, but yet to be verified. DragonBSD confirms using two workarounds on the Ryzen platform. One to fix CPU clockrate detection (thus has nothing to do with this), and the second is a Ryzen hardware bug I'd wish details on how it's triggered and what happens when it is.

              I'm curious if this fix, by recompiling bash, works for everyone. And also if segfaults occur when using LLVM/Clang on the same source code as with gcc.

              Comment


              • #67
                Well when I was messing with overclocking I got this problem. Also, I could run one Cinebench, but not two in a row.

                Comment


                • #68
                  Hi, Matt Dillon here. Yes, I did find what I believe to be a hardware issue with Ryzen related to concurrent operations. In a nutshell, for any given hyperthread pair, if one hyperthread is in a cpu-bound loop of any kind (can be in user mode), and the other hyperthread is returning from an interrupt via IRETQ, the hyperthread issuing the IRETQ can stall indefinitely until the other hyperthread with the cpu-bound loop pauses (aka HLT until next interrupt). After this situation occurs, the system appears to destabilize. The situation does not occur if the cpu-bound loop is on a different core than the core doing the IRETQ. The %rip the IRETQ returns to (e.g. userland %rip address) matters a *LOT*. The problem occurs more often with high %rip addresses such as near the top of the user stack, which is where DragonFly's signal trampoline traditionally resides. So a user program taking a signal on one thread while another thread is cpu-bound can cause this behavior. Changing the location of the signal trampoline makes it more difficult to reproduce the problem. I have not been able to completely mitigate it. When a cpu-thread stalls in this manner it appears to stall INSIDE the microcode for IRETQ. It doesn't make it to the return pc, and the cpu thread cannot take any IPIs or other hardware interrupts while in this state.

                  The bug is completely unrelated to overclocking. It is deterministically reproducable.

                  I sent a full test case off to AMD in April.

                  I should caution here that I only have ONE Ryzen system (1700X, Asus mobo), so its certainly possible that it is a bug in that system or a bug in DragonFly (though it seems unlikely given the particular hyperthread pairing characteristics of the bug). Only IRETQ seems to trigger it in the manner described above, which means that AMD can probably fix it with a microcode update.

                  -Matt

                  Comment


                  • #69
                    Well this could be somewhat worrying if it's not just a software bug... From what I've understood Ryzen is supposed to do some pretty aggressive runtime optimization with stuff like heavy handed out-of-order execution. If turning SMT off really fixes this bug and it's present on all modern versions of GCC my first guess would be that it's out-of-order execution clashing with SMT and generating segmentation faults when memory reads and writes are put out of sequence when they shouldn't be.

                    Then again it's not like Intel has never had chip bugs in their hardware...

                    Comment


                    • #70
                      If its a hardware bug, what would be the repercussions of that ? Could this be fixed in later ryzen batches ? Well I am jumping ahead here.

                      Comment

                      Working...
                      X