Announcement

Collapse
No announcement yet.

AMD Confirms Linux Performance Marginality Problem Affecting Some, Doesn't Affect Epyc / TR

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Originally posted by RussianNeuroMancer View Post
    Yep, hopefully other solution besides RMA will be found, and hopefully it will be rolled out to users as soon as possible.
    Yup. But that being said. For those who think it's horrible that AMD would've been cautious about investigating this issue and that it's a reason to swear off AMD forever (at least until Intel does something as bad, or worse... again), I disagree.

    The issue will be dealt with, and until then, I'm still happy with my purchase. No new platform/chip is perfect. Different errata have differing severities, but all processors have them. I still got a great chip at a good price, and i knew what I was signing up for when I bought it.

    Comment


    • #62
      Originally posted by Veerappan View Post
      The issue will be dealt with, and until then, I'm still happy with my purchase. No new platform/chip is perfect. Different errata have differing severities, but all processors have them. I still got a great chip at a good price, and i knew what I was signing up for when I bought it.


      Wait a minute... he imply there is no issues with SoC, just too many faulty system out there? For real?

      Comment


      • #63
        Okay, I'm not exactly sure what it was or how it happened, but I solved this problem on my machine. Maybe someone here can get something from my experience, so I'll explain.

        I bought a Ryzen 1700x with an MSI B350 Tomahawk and 16GB of Crucial Ballistix DDR4 clocked at 2400MHz back in early June. I'm a Gentoo user, so I immediately started experiencing this problem. It actually manifested in both forms, the full system lockups and the segfaults while compiling. Naturally, I started digging and came upon the multiple threads discussing the issue.

        I tried everything. First, I replaced the cheap GT 210 GPU that I bought for graphical output with an RX 560, thinking there was a conflict with Nouveau, like some suggested. That didn't work. Then, I started fiddling with the motherboard. I upgraded to the latest beta BIOS for that board, AGESA 1.0.0.6, and that didn't help. From there, I started trying every combination of voltages for the CPU, RAM, and NB. The frequency of the failures changed slightly, but never stopped entirely. I tried disabling ASLR. Again, it altered the frequency of the failures, but never fully solved the problem. Finally, I resolved that there was a RAM problem, so I replaced the RAM with G.Skill Trident Z clocked at 3200HMz(keep in mind that I tried altering the frequencies of both sets of RAM). That didn't solve anything either.

        Along the line, I contact AMD, and they offered an RMA. I accepted, replaced the CPU, and absolutely nothing changed.

        After about a month of testing, the only conclusion that I could ever draw was that ASLR and the voltages played some role in causing/preventing the issue. It was around that time that I started hearing reports of individuals with higher end X370 boards fixing the issue or not experiencing it with the latest BIOS. I figured that it was time for one last-ditch effort, and I replaced the motherboard with an ASUS Crosshair VI Hero.

        Before I even booted into the OS, I updated the BIOS to the latest release. In my initial tests at stock settings, the issue presented even more frequently than with the previous board. This prompted me to check the BIOS settings to find that the automatic voltage values were seriously off base. I decided to treat the board as though I was going to do a run-of-the-mill overclock, and replaced all of the automatic values with reasonable ones for a Ryzen overclock. I also switched off the automatic power regulation features, like C6 and Cool and Quiet, that you would not want interfering with an overclocked CPU.

        When I rebooted and tested the machine, the issue had vanished entirely. I tested it multiple times, and although the compile times seemed to have increased by a couple of seconds, it would not fail. The machine has been running now since about the second week of July at an overclock of 3.8GHz with the RAM at 3200MHz, completely stable.

        My Theory: The issue has nothing to do with the CPUs themselves. There is a problem with voltage regulation on the motherboards, and the components probably require more voltage than what is being specified. It works fine under normal loads, but the stress of tasks like compilation causes the voltage to drop off just enough to trigger the problem. On lower end boards, even BIOS fixes aren't solving the problem because the hardware construction just isn't there to support what's needed. On higher end boards, you can manually set the voltages and use the latest BIOS to reign in the issue. That would also explain why Epyc and Threadripper aren't affected. It's not an architectural issue with Ryzen at all or even the Linux kernel. The boards for both platforms use different chipsets and much more robust construction.

        Obviously, this is just a theory, and it's only based on my experience alone, but it seems to make sense.

        Comment


        • #64
          Originally posted by ZombieNo7 View Post
          Before I even booted into the OS, I updated the BIOS to the latest release. In my initial tests at stock settings, the issue presented even more frequently than with the previous board. This prompted me to check the BIOS settings to find that the automatic voltage values were seriously off base. I decided to treat the board as though I was going to do a run-of-the-mill overclock, and replaced all of the automatic values with reasonable ones for a Ryzen overclock. I also switched off the automatic power regulation features, like C6 and Cool and Quiet, that you would not want interfering with an overclocked CPU.
          May I ask: What are those reasonable voltages you are using? It'd be nice to use them as starting values for my own experiments.

          Comment


          • #65
            Originally posted by k1e0x View Post
            heh, I think you guys need to resize how small of a subset you are. Linux/BSD users that are compiling their systems from scratch that also own ryzens.. what 0.001% of the computer industry or less?
            How do you figure this is about Linux or BSD "compiling ... systems from scratch"? The cause has not been found yet, you don't know how many problems this is responsible for. Compiling on Linux is just the only known, reproducible way to trigger it.

            Even if that was the case, with the number of Linux servers out there employing JIT compile almost all the time, there's a considerable number of servers that could potentially be affected. Hopefully those don't run on commodity hardware, but then again people do stupid things.

            Comment


            • #66
              Originally posted by ZombieNo7 View Post
              Okay, I'm not exactly sure what it was or how it happened, but I solved this problem on my machine. Maybe someone here can get something from my experience, so I'll explain.

              I bought a Ryzen 1700x with an MSI B350 Tomahawk and 16GB of Crucial Ballistix DDR4 clocked at 2400MHz back in early June. I'm a Gentoo user, so I immediately started experiencing this problem. It actually manifested in both forms, the full system lockups and the segfaults while compiling. Naturally, I started digging and came upon the multiple threads discussing the issue.

              I tried everything. First, I replaced the cheap GT 210 GPU that I bought for graphical output with an RX 560, thinking there was a conflict with Nouveau, like some suggested. That didn't work. Then, I started fiddling with the motherboard. I upgraded to the latest beta BIOS for that board, AGESA 1.0.0.6, and that didn't help. From there, I started trying every combination of voltages for the CPU, RAM, and NB. The frequency of the failures changed slightly, but never stopped entirely. I tried disabling ASLR. Again, it altered the frequency of the failures, but never fully solved the problem. Finally, I resolved that there was a RAM problem, so I replaced the RAM with G.Skill Trident Z clocked at 3200HMz(keep in mind that I tried altering the frequencies of both sets of RAM). That didn't solve anything either.

              Along the line, I contact AMD, and they offered an RMA. I accepted, replaced the CPU, and absolutely nothing changed.

              After about a month of testing, the only conclusion that I could ever draw was that ASLR and the voltages played some role in causing/preventing the issue. It was around that time that I started hearing reports of individuals with higher end X370 boards fixing the issue or not experiencing it with the latest BIOS. I figured that it was time for one last-ditch effort, and I replaced the motherboard with an ASUS Crosshair VI Hero.

              Before I even booted into the OS, I updated the BIOS to the latest release. In my initial tests at stock settings, the issue presented even more frequently than with the previous board. This prompted me to check the BIOS settings to find that the automatic voltage values were seriously off base. I decided to treat the board as though I was going to do a run-of-the-mill overclock, and replaced all of the automatic values with reasonable ones for a Ryzen overclock. I also switched off the automatic power regulation features, like C6 and Cool and Quiet, that you would not want interfering with an overclocked CPU.

              When I rebooted and tested the machine, the issue had vanished entirely. I tested it multiple times, and although the compile times seemed to have increased by a couple of seconds, it would not fail. The machine has been running now since about the second week of July at an overclock of 3.8GHz with the RAM at 3200MHz, completely stable.

              My Theory: The issue has nothing to do with the CPUs themselves. There is a problem with voltage regulation on the motherboards, and the components probably require more voltage than what is being specified. It works fine under normal loads, but the stress of tasks like compilation causes the voltage to drop off just enough to trigger the problem. On lower end boards, even BIOS fixes aren't solving the problem because the hardware construction just isn't there to support what's needed. On higher end boards, you can manually set the voltages and use the latest BIOS to reign in the issue. That would also explain why Epyc and Threadripper aren't affected. It's not an architectural issue with Ryzen at all or even the Linux kernel. The boards for both platforms use different chipsets and much more robust construction.

              Obviously, this is just a theory, and it's only based on my experience alone, but it seems to make sense.
              Negative.

              Seriously just don't. It's great that you want to help but this is an extremely hard problem and smart people are on it. Ya know.. about 10% of the computer industry has smart people that actually do the bulk of the work.. the rest are just there.

              The best lowdown on whats actually going on is here. https://bugs.freebsd.org/bugzilla/sh....cgi?id=219399

              What I would like to see someone test is GRSec Gentoo kernel with the emulation of GCC trampoline pages turned on.
              PAX_EMUTRAMP=yes

              Comment


              • #67
                Originally posted by k1e0x View Post
                The best lowdown on whats actually going on is here. https://bugs.freebsd.org/bugzilla/sh....cgi?id=219399
                This freebsd bug is a completely different story. Please don't mix unrelated issues.

                Originally posted by k1e0x View Post
                What I would like to see someone test is GRSec Gentoo kernel with the emulation of GCC trampoline pages turned on.
                PAX_EMUTRAMP=yes
                Any process can crash on Ryzen during heavy compilation: bash or even a process not participating in compilation. So I don't understand why you are trying to find some issue with gcc itself.

                Comment


                • #68
                  Originally posted by k1e0x View Post
                  Negative.

                  Seriously just don't. It's great that you want to help but this is an extremely hard problem and smart people are on it. Ya know.. about 10% of the computer industry has smart people that actually do the bulk of the work.. the rest are just there.

                  The best lowdown on whats actually going on is here. https://bugs.freebsd.org/bugzilla/sh....cgi?id=219399

                  What I would like to see someone test is GRSec Gentoo kernel with the emulation of GCC trampoline pages turned on.
                  PAX_EMUTRAMP=yes
                  Nice how you manage to call people stupid and then link to a thread where similar people try to figure out issues.

                  Comment


                  • #69
                    Originally posted by bug77 View Post
                    Hopefully those don't run on commodity hardware, but then again people do stupid things.
                    Remember good old days? http://www.pcworld.com/article/112891/article.html
                    By using commodity PC hardware, which is similar to that of home PCs, Google buys cheap and builds high levels of redundancy into its system
                    No shame in using commodity hardware if you know what you are doing.

                    Comment


                    • #70
                      The good news is that AMD is aware and does call it a problem. You know, it's not like Intel hasn't had these sort of "oopsie daisies".

                      Hoping for a good fix from AMD (the sooner the better of course!)

                      Comment

                      Working...
                      X