Announcement

Collapse
No announcement yet.

AMD Confirms Linux Performance Marginality Problem Affecting Some, Doesn't Affect Epyc / TR

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by Funks View Post
    Gotta pull the Heat sink to see it..

    The Good Ryzen chips people are getting back are 2017 - WEEK25, the ThreadRipper Chips are 2017 - WEEK27

    GoodRyzenChips = UA1725SUS
    ThreadRipper = UA1727SUT

    there might be a lot of "Marginal" ryzen chips older than 1725.
    That means AMD know the bug well before gentoo users find it initially and start shipping good chips after WEEK25?

    Comment


    • Originally posted by sdack View Post
      How much memory do you have and can you see how much memory is being used during the kernel compilation?

      Compiling a source code as large as the kernel with an unrestrained make -j will spawn not only lots of parallel processes, far more than your CPU has cores, but each process requires memory and you'll probably need 32MB or so. I know 16MB isn't enough for me, but I have to restrain it to a -j64. Once it exceeds the memory limit will it start swapping and this can occasionally lead to kernel panics or oops'es or lock ups. You might find a message in the system log.

      To reproduce the problem reported here should you use exclusively the Phoronix Test Suite and follow the instructions to reproduce the conditions to trigger it. Otherwise are you only adding more guesses onto this than is really necessary. Stick to what is currently the accepted method for reproducing the issue.
      I have 8GB of RAM. So, it could easily just be system instability. I ran the PTS command and the Ryzen killer script for several minutes but didn't trigger any crashes or segfaults. Maybe I need to run it for longer.

      A few times my computer has restarted itself without warning, seems to be after a period of idle. It's rare so hard to get any data. Also impossible to find anything in dmesg because my keyboard and mouse constantly spam it with evbug messages.

      Comment


      • Originally posted by rk17 View Post
        That means AMD know the bug well before gentoo users find it initially and start shipping good chips after WEEK25?
        No. It will only mean newer chips don't show this issue.

        Remind yourself that you are only looking at symptoms. A CPU can be perfectly fine, but have a narrower tolerance to environmental factors than a newer one. It doesn't make it per se a faulty CPU. To look at the exact cause would you first have to narrow it down to a specific unit within the CPU, and once found, then monitor voltages, currents, frequencies, temperatures, signals, etc. in order to see what specifically causes the unit to fail (when otherwise it's working just fine apparently). Only then could you for instance say something like that it's caused by a drop in voltage or a too high temperature. This doesn't need to have its cause within the CPU, but it can also be caused by an outside component on the mainboard like a resistor or a capacitor. In such a scenario would a proper fix be to unsolder the part and to replace it with a different one, but since that's not really an option for most people is it easier to replace the CPU with a more tolerant one, because, well, it is a component which can be taken out easily. In case you don't know, even such simple parts like a resistor are never exactly the same and have their own tolerances, which can affect logic components, because all logic signals are analogue signals, which we interpret as 0's and 1's. Thus do all components have tolerances and in return create an environment for other components in which these have to operate in. In order for this to be a fault within the CPU do you need to show that it's caused by components within it, which is practically impossible if you're not an AMD engineer.

        If AMD has fixed the issue before it even got detected then they likely didn't know about it, but it's more likely the result of general improvements in the chip design and/or manufacturing process, in order to increase the output of high-end Ryzen 7 CPUs like the 1800X and to pave the road to perhaps a future and faster Ryzen 7 1900X or 2000X. Again, you'll have to ask an AMD engineer to know the truth.
        Last edited by sdack; 10 August 2017, 11:38 AM.

        Comment


        • I've never heard of a hardware bug fixed by accident…

          If AMD have a new stepping with the problem resolved, then they knew about it >5 months before, which means they knew about the defect when launching Ryzen 7, which means there will be lawsuits.

          But it seems like Sweclockers got the B1 stepping, which just makes it heavily binned Ryzen chips.

          Comment


          • Originally posted by efikkan View Post
            I've never heard of a hardware bug fixed by accident…
            If you can believe a bug to be caused by an accident then you shouldn't have a problem in believing the opposite, of a bug being fixed by an accident.

            Or do you believe that everything people do happens purely intentional, purposely, deliberately and consciously, that we are spawns of the devil only driven by mischief?

            Comment


            • Originally posted by sdack View Post
              If you can believe a bug to be caused by an accident then you shouldn't have a problem in believing the opposite, of a bug being fixed by an accident.
              Then you know nothing about development or entropy.
              Bugs are created by accident all the time. The problem in Ryzen is a tricky synchronization issue, and the chances of fixing it without knowing about it is unlikely on an astronomical scale. Simply put; if it fixed in a new stepping, then they knew about it.

              Comment


              • Originally posted by efikkan View Post
                Then you know nothing about development or entropy.
                Bugs are created by accident all the time. The problem in Ryzen is a tricky synchronization issue, and the chances of fixing it without knowing about it is unlikely on an astronomical scale. Simply put; if it fixed in a new stepping, then they knew about it.
                I agree with you for the most part. People are gonna get their processors replaced, so I think the biggest problem was that AMD never communicated the problem to their tech support/customer service reps and the result was months of zero support. And not as big but still big, is they didn't make a press release acknowledging it to the rest of the world as soon as they became aware that it was public knowledge.

                Comment


                • Originally posted by bridgman View Post

                  Actually the idea is to replace the CPU with one which works as expected, but since we started replacing CPUs while still investigating we didn't always get it right at first.
                  I have been holding back on a purchase for a new PC and wanted to go with a Ryzen 7 1800x, but as I am using Gentoo I needed this issue acknowledged. Thanks to your posts I feel safe to go for it now. So thank you for your support here

                  Also, if the experience with my personal build goes well, maybe the next round of desktop upgrades at work will go the Ryzen route, too.

                  Also, thank you Michael for further raising awareness, and more testing. This is why I have Premium

                  Comment


                  • Originally posted by duby229 View Post

                    I agree with you for the most part. People are gonna get their processors replaced, so I think the biggest problem was that AMD never communicated the problem to their tech support/customer service reps and the result was months of zero support. And not as big but still big, is they didn't make a press release acknowledging it to the rest of the world as soon as they became aware that it was public knowledge.
                    People say linux subsystem on windows 10 also suffer from same problem. Are there any data on using VC++ compiler or some other compilers in Visual studio(running on mainland windows 10) to stress test ryzen like they did in linux and it ran atleast 20+ hours successfully?

                    Although there are no ryzen based CPU/APUs for laptops till now, Assuming if AMD released laptop APUs the same day as ryzen 7 and if the APUs are also affected by this bug and say a Asus laptop with ryzen APU supports windows 10 only officially by Asus, how to RMA the laptop in case if the user removes windows 10 and only uses linux? if you ask AMD help they say ask laptop maker and if I ask asus they say show the crash in windows 10.


                    Comment


                    • So I've run some more stress testing overnight and I've found that the single biggest impact on stability when running on extremely high loads has been to disable ASLR.

                      For those of you that don't know about ASLR you can read about it here: https://en.wikipedia.org/wiki/Addres..._randomization

                      You can temporarily disable it via the following command:

                      sudo sysctl kernel.randomize_va_space=0

                      You can have this setting survive a reboot via the following command:

                      echo 'kernel.randomize_va_space=0' | sudo tee -a /etc/sysctl.conf

                      Does anyone know the downsides of disabling ASLR? I know it's a security feature.

                      After doing this I was able to run my 1700 @ 3.6 GHz on STOCK voltage with LLC set to mode 1 overnight using Michael's stress test. Please also note that this was with SMT ENABLED. No reboots or segfaults. Temperatures reached up to 70 C but never higher using the stock cooler.

                      I'd be curious if disabling ASLR helps others the way that it's helped me.

                      Comment

                      Working...
                      X