Announcement

Collapse
No announcement yet.

Some Ryzen Linux Users Are Facing Issues With Heavy Compilation Loads

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Yea, to me it sounds like motherboard issues, likely not enough voltage to drive stock clocks. There sure have been enough motherboard issues so far, so it wouldn't be very surprising.

    I wonder if anyone tried using the same processor in two motherboards, one known to work and one known not to.

    Comment


    • #62
      What about fake chips

      That was often AMD case “Only” 60 Thousands Fake AMD Chips Arrested, Million Already Shipped

      KingFish 5 Jan, 2005

      “Based on tips provided by Advanced Micro Devices (AMD) Taiwan, the police Friday raided an electronics company located in Tainan, southern Taiwan, and seized a total of 60 000 suspect AMD CPUs. The suspect AMD CPUs, including K7 [AMD Athlon XP] and K8 [AMD Athlon 64] models, were defective CPUs that would normally have been destroyed,” claims an article posted on Taiwan-based web-site DigiTimes.




      Then waste company sold AMD chips to China and Germany instead of destroy them
      Amazon accidentally ships counterfeit AMD APUs


      Amazon has reportedly got a bit of egg on its face after shipping counterfeit AMD CPUs from its own warehouses. Both companies are reportedly working to track down how false hardware was injected into the supply chain.



      or


      The Counterfeit Electronics Problem


      In the Fall of 2003, AMD conducted some raids in Europe,
      where some of its low speed, low priced microprocessors we being relabeled as high speed, high priced chips. On investigation it was found that some resellers in Shenzhen, China were performing the remarking. AMD also purchased some microprocessors from the resellers and found them to be fakes (Takahash,2004).


      In January 2005, Advanced Micro Devices (AMD), working

      in cooperation with Taiwanese authorities, seized a total of

      60,000 counterfeit AMD microprocessors worth US $9.46 mil ....

      -


      Comment


      • #63
        I've been following this problem on gentoo's forum and it's almost impossible to know what works and what doesn't. The crashes are nondeterministic so users might change bios setting and then get lucky for a while so they assume the bios change solved the problem. There could also be several separate problems with the same symptoms but requiring separate fixes.

        Originally posted by Beherit View Post
        DragonBSD developer Matt Dillon wrote a workaround for a hardware bug in Ryzen: http://gitweb.dragonflybsd.org/drago...d301557fd9ac20
        Can someone clarify this? Is is saying that returning from interrupt to a "high user %rip address near the end of the user address space (top of user stack)" sometimes crashes. Do dragonfly run programs that execute code from stack? Or is it saying that a crash might happen when the return pointer is read from stack when the stack is near the end of the user address space?

        If there is a hardware bug depending on specific address in user address space it would make sense that compile jobs triggers it. Linux use address space layout randomization to put memory segments on different addresses on each run. A compile job forks a lot of processes each with their own layout. Try compiling without ASLR. "echo 0 > /proc/sys/kernel/randomize_va_space".

        Comment


        • #64
          That's interesting but users aren't just seeing issues under load. Mine freezes while doing practically nothing at all. As I said before though, there may be multiple issues at play.

          Also be aware that I saw weird segfaults from Java under ppc64 when ASLR is disabled.
          Last edited by Chewi; 03 June 2017, 10:17 AM.

          Comment


          • #65
            Originally posted by GreatEmerald View Post
            Yea, to me it sounds like motherboard issues, likely not enough voltage to drive stock clocks. There sure have been enough motherboard issues so far, so it wouldn't be very surprising.

            I wonder if anyone tried using the same processor in two motherboards, one known to work and one known not to.
            The segfault error reported here is very different than "just" random crashes or reboot. Voltage surges or undervoltage during load, would trigger errors during any heavy load, regardless of OS or activity. The workloads described here are too specific compared to the randomness of a dodgy motherboard.

            I, for one, would find it very surprising indeed if this particular issue when compiling, is caused by motherboard problems so mysterious that they only occur when using the gcc compiler in Linux. There are no reports on this amongst Windows developers, I've not been able to find a single one.

            Digging through forums and googling, those reporting this are 80% Gentoo users, 5% Fedora, 5% Debian and 10% Ubuntu. Here's the earliest report of this I was able to find, courtesy of the Australian edition of Linux Format, dated Apr 11.

            FreeBSD has one report which can be related to this, but yet to be verified. DragonBSD confirms using two workarounds on the Ryzen platform. One to fix CPU clockrate detection (thus has nothing to do with this), and the second is a Ryzen hardware bug I'd wish details on how it's triggered and what happens when it is.

            I'm curious if this fix, by recompiling bash, works for everyone. And also if segfaults occur when using LLVM/Clang on the same source code as with gcc.

            Comment


            • #66
              Well when I was messing with overclocking I got this problem. Also, I could run one Cinebench, but not two in a row.

              Comment


              • #67
                Hi, Matt Dillon here. Yes, I did find what I believe to be a hardware issue with Ryzen related to concurrent operations. In a nutshell, for any given hyperthread pair, if one hyperthread is in a cpu-bound loop of any kind (can be in user mode), and the other hyperthread is returning from an interrupt via IRETQ, the hyperthread issuing the IRETQ can stall indefinitely until the other hyperthread with the cpu-bound loop pauses (aka HLT until next interrupt). After this situation occurs, the system appears to destabilize. The situation does not occur if the cpu-bound loop is on a different core than the core doing the IRETQ. The %rip the IRETQ returns to (e.g. userland %rip address) matters a *LOT*. The problem occurs more often with high %rip addresses such as near the top of the user stack, which is where DragonFly's signal trampoline traditionally resides. So a user program taking a signal on one thread while another thread is cpu-bound can cause this behavior. Changing the location of the signal trampoline makes it more difficult to reproduce the problem. I have not been able to completely mitigate it. When a cpu-thread stalls in this manner it appears to stall INSIDE the microcode for IRETQ. It doesn't make it to the return pc, and the cpu thread cannot take any IPIs or other hardware interrupts while in this state.

                The bug is completely unrelated to overclocking. It is deterministically reproducable.

                I sent a full test case off to AMD in April.

                I should caution here that I only have ONE Ryzen system (1700X, Asus mobo), so its certainly possible that it is a bug in that system or a bug in DragonFly (though it seems unlikely given the particular hyperthread pairing characteristics of the bug). Only IRETQ seems to trigger it in the manner described above, which means that AMD can probably fix it with a microcode update.

                -Matt

                Comment


                • #68
                  Well this could be somewhat worrying if it's not just a software bug... From what I've understood Ryzen is supposed to do some pretty aggressive runtime optimization with stuff like heavy handed out-of-order execution. If turning SMT off really fixes this bug and it's present on all modern versions of GCC my first guess would be that it's out-of-order execution clashing with SMT and generating segmentation faults when memory reads and writes are put out of sequence when they shouldn't be.

                  Then again it's not like Intel has never had chip bugs in their hardware...
                  "Why should I want to make anything up? Life's bad enough as it is without wanting to invent any more of it."

                  Comment


                  • #69
                    If its a hardware bug, what would be the repercussions of that ? Could this be fixed in later ryzen batches ? Well I am jumping ahead here.

                    Comment


                    • #70
                      Originally posted by sarfarazahmad View Post
                      If its a hardware bug, what would be the repercussions of that ? Could this be fixed in later ryzen batches ? Well I am jumping ahead here.
                      I would assume that every chip have "hardware bugs", it's fixes are implemented in BIOS/microcode. I seriously doubt that is a hardware bug, no matter how non-deterministc it might be, it would happen for every user, and that is clearly not the case, so It must be software bug (BIOS or higher level software).

                      Comment

                      Working...
                      X