Announcement

Collapse
No announcement yet.

AMD Confirms Linux Performance Marginality Problem Affecting Some, Doesn't Affect Epyc / TR

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by shmerl View Post

    The problem is, it's not deterministic, so it's hard to narrow it down. No one reliably reproduced it yet as far as I know.
    From: https://www.phoronix.com/forums/foru...on-loads/page2 (I marked the relevant part in bold):

    Originally posted by dillon View Post
    Hi, Matt Dillon here. Yes, I did find what I believe to be a hardware issue with Ryzen related to concurrent operations. In a nutshell, for any given hyperthread pair, if one hyperthread is in a cpu-bound loop of any kind (can be in user mode), and the other hyperthread is returning from an interrupt via IRETQ, the hyperthread issuing the IRETQ can stall indefinitely until the other hyperthread with the cpu-bound loop pauses (aka HLT until next interrupt). After this situation occurs, the system appears to destabilize. The situation does not occur if the cpu-bound loop is on a different core than the core doing the IRETQ. The %rip the IRETQ returns to (e.g. userland %rip address) matters a *LOT*. The problem occurs more often with high %rip addresses such as near the top of the user stack, which is where DragonFly's signal trampoline traditionally resides. So a user program taking a signal on one thread while another thread is cpu-bound can cause this behavior. Changing the location of the signal trampoline makes it more difficult to reproduce the problem. I have not been able to completely mitigate it. When a cpu-thread stalls in this manner it appears to stall INSIDE the microcode for IRETQ. It doesn't make it to the return pc, and the cpu thread cannot take any IPIs or other hardware interrupts while in this state.

    The bug is completely unrelated to overclocking. It is deterministically reproducable.

    I sent a full test case off to AMD in April.

    Comment


    • Originally posted by Jumbotron View Post
      BRAVO...BRAVO...BRAVO !! FANTASTIC WORK MICHAEL !!! THIS is why I subscribed to you ! If you are not a subscriber yet....please do so NOW !! Michael's work here on Phoronix and his Phoromatic Test Suit is ESSENTIAL !! At the very least TIP the son of a gun !! We NEED Michael !!
      His work largely was clowning around without an understanding of which segfaults are the "right" segfaults, and the real 'impact' seems to be caused by bridgman not wanting to look like a heartless meanie here on Phoronix, where he has 10000(!) posts, while on AMD's own support forum, ignoring users' problems for three months is more like a sport.
      It must be painful for a customer to only get We're investigating the problem! Please create some nice tickets. Oh wait, don't create tickets. Oh hey, we can replace your defective CPU with another defective CPU, just RMA it. and a lot of radio silence.

      On a side note, declaring crashes a "performance marginality" is quite far in the realm of lies.

      Comment


      • Originally posted by luckz View Post
        It must be painful for a customer to only get We're investigating the problem! Please create some nice tickets. Oh wait, don't create tickets. Oh hey, we can replace your defective CPU with another defective CPU, just RMA it. and a lot of radio silence.
        Actually the idea is to replace the CPU with one which works as expected, but since we started replacing CPUs while still investigating we didn't always get it right at first.

        Comment


        • Originally posted by Beherit View Post
          From: https://www.phoronix.com/forums/foru...on-loads/page2 (I marked the relevant part in bold):
          It's not certain yet if the segfault issue and the issue found ba Matt Dillon are actually the same. The findings of mcl00 here point towards them being different issues.

          Comment


          • Originally posted by shmerl View Post

            That's a good question. I didn't quite understand that, since there was supposedly no new stepping of regular (non server) Ryzen CPUs.
            Then it's probably a binning issue with the initially mass-produced models. A few 1800X's should probably have become 1700's instead

            Comment


            • Or rather "1700LE" octa-cores with SMT disabled...

              Comment


              • Originally posted by bridgman View Post

                Actually the idea is to replace the CPU with one which works as expected, but since we started replacing CPUs while still investigating we didn't always get it right at first.
                Have you managed to narrow down the test case enough for you to find the root cause yet?

                In any case, will you at least consider communicating the root cause clearly and unambiguosly along with a solid repro test case once you *do* get to the bottom of this?

                FWIW, I fully understand that designing and validating CPUs is both time-consuming and difficult.

                EDIT: The only thing I've seen so far is this:

                Originally posted by amdmatt

                We have been working closely with a small but important subset of Linux users that have experienced segment faults when running heavy or looping compilations on their Ryzen CPU-based systems. The results of our testing and analysis indicate that segment faults can be caused by memory allocation, system configurations, thermal environments, and system marginality when running looping or heavy Linux compile workloads. The marginality is stimulated with very heavy workloads and when the system environment is not ideal. AMD is working with individual users to diagnose the issues.

                We are confident that we can help each of you identify the source of the marginality and eliminate the segment faults. We encourage all of our Linux users who are experiencing segment faults under compile workloads to continue working with AMD Customer Care. We are committed to solving this issue for all of you.

                source (AMD community post)
                Last edited by ermo; 08-09-2017, 04:53 AM.

                Comment


                • What's the UA code on your chip Michael? (it on the IHS).

                  The good chips people are getting back from RMA seems to to be UA1725SUS which was supposedly built around the 25 week of 2017. The thread ripper chips the reviewers got have a timestamp of UA1727SUS so they are newer (which AMD says isn't affected).

                  The following BUILD dates have instances of marginal processors

                  UA1707PGT (reported by fujii@amdcommunityforums)
                  UA1716PGT (reported by fujii@amdcommunityforums)

                  Comment


                  • Originally posted by Funks View Post
                    What's the UA code on your chip Michael? (it on the IHS).

                    The good chips people are getting back from RMA seems to to be UA1725SUS which was supposedly built around the 25 week of 2017. The thread ripper chips the reviewers got have a timestamp of UA1727SUS so they are newer (which AMD says isn't affected).

                    The following BUILD dates have instances of marginal processors

                    UA1707PGT (reported by fujii@amdcommunityforums)
                    UA1716PGT (reported by fujii@amdcommunityforums)
                    Another way to read this information without disassembling cpu cooler, barcode on the box, cpu serial number ?

                    Comment


                    • Originally posted by scorpio810 View Post

                      Another way to read this information without disassembling cpu cooler, barcode on the box, cpu serial number ?
                      I don't have such number anywhere on the box, which has a serial number though, but searching by it produces nothing. I'd also need to take off a heavy cooler to see the actual CPU.

                      Comment

                      Working...
                      X