Announcement

Collapse
No announcement yet.

AMD Confirms Linux Performance Marginality Problem Affecting Some, Doesn't Affect Epyc / TR

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Not true !
    norandmaps added to grub can't help now .. on my custom kernel 4.12.3, and 4.12.4 (kernel.org)
    https://www.phoronix.com/forums/foru...566#post966566
    Last edited by scorpio810; 10 August 2017, 02:03 PM.

    Comment


    • Originally posted by efikkan View Post
      Then you know nothing about development or entropy.
      Bugs are created by accident all the time. The problem in Ryzen is a tricky synchronization issue, and the chances of fixing it without knowing about it is unlikely on an astronomical scale. Simply put; if it fixed in a new stepping, then they knew about it.
      No. Take Address Space Layout Randomization (ASLR) of the Linux kernel for example. You cannot conclude from the existence of ASLR that the kernel devs know about every single attack of the past, or present or future. It's simply a prevention mechanism and otherwise doesn't fix anything or brings other advantages.

      Or a more simpler example... When you find your car tyres to be noisy and decide to go with a different brand of tyres, can it result additionally in a lower fuel consumption or a higher top-speed. It doesn't have to and you won't know for sure beforehand, unless somebody points it out to you or you're noticing it yourself. It also doesn't mean your original motivation for changing the tyres has changed when it shows to have further advantages. It's simply a coincidence. You still only wanted to reduce the noise, but got a lower fuel consumption or a higher top-speed as well. There's nothing unusual or astronomically rare about it. It can however be surprising when one didn't know about it.

      In the same way can a new stepping or chip revision resolve issues without one first knowing about them. Most of the time are new revisions created in order to make more money, because it also costs extra money. Not to mention AMD no longer owns the production, but lets GlobalFoundries produce their chips. So they'll be fixing known issues when it warrants the extra cost, but they will also be doing it to improve yield (more usable dies per wafer, less garbage) and to get faster chips.

      And from what has been said here on the forum has AMD been replacing chips before they knew how to help their costumers. So for me does this exclude any "evil doing" on the side of AMD. Rather does it tell me they're committed to resolve the issue beyond what one could consider reasonable and have replaced chips purely on chance and in hope that it's going to help.
      Last edited by sdack; 10 August 2017, 02:51 PM.

      Comment


      • Originally posted by rk17 View Post

        People say linux subsystem on windows 10 also suffer from same problem. Are there any data on using VC++ compiler or some other compilers in Visual studio(running on mainland windows 10) to stress test ryzen like they did in linux and it ran atleast 20+ hours successfully?

        Although there are no ryzen based CPU/APUs for laptops till now, Assuming if AMD released laptop APUs the same day as ryzen 7 and if the APUs are also affected by this bug and say a Asus laptop with ryzen APU supports windows 10 only officially by Asus, how to RMA the laptop in case if the user removes windows 10 and only uses linux? if you ask AMD help they say ask laptop maker and if I ask asus they say show the crash in windows 10.

        Windows has a totally different type of threading model. The threads in windows a fatter than threads in linux. I don't think it's possible to make windows work as hard as you can make linux work.

        Comment


        • Originally posted by rstrube View Post
          So I've run some more stress testing overnight and I've found that the single biggest impact on stability when running on extremely high loads has been to disable ASLR.

          Does anyone know the downsides of disabling ASLR? I know it's a security feature.
          ASLR is the protection against data leaks from buffer overflows etc. It's super important for servers, but less so for desktops. (But I would still prefer to have it enabled, your web browser and more are exposed to the Internet) Performance wise it will only have a negative impact, and it does add extra stress on the prefetcher, which is why it impacts the amount of incidents from the Ryzen bug. The Ryzen bug is a synchronization bug in the prefetcher. Disabling ASLR greatly reduces the incidents, but it does not eliminate them. Disabling the micro-op cache in the BIOS will also help, but that has a performance impact. But all of this are only acceptable measures until AMD creates a new stepping without the bug.

          Comment


          • Originally posted by efikkan View Post
            ASLR is the protection against data leaks from buffer overflows etc. ...
            That's not what it does. What it does it that it randomises the address space layout of executable code. This means that for one executable a function f() is found at address A, while the same function f() in another instance is found at address B. This makes it pretty hard for malicious code to call function f() and to abuse it, because the malicious code can no longer rely on the function f() to be located at always the same address. ASLR turns the code in an executable into a "moving target" for a hacker so to speak. For this reason do executables also need to be compiled as position independent code (PIC) so these themselves no longer use fixed addressing, but only use relative addressing, or else can these not benefit from ASLR and ASLR will be disabled for them.

            Malicious code is often injected with buffer overflow attacks where a program gets input from an unknown source (i.e. from the Internet), but assumes the sender to send only a limited amount of data. An attacker can use the naivety of the programmer to send more data than what was expected by the programmer, thereby causing a buffer in the program to overrun, which then begins to overwrite other data, i.e. the stack and with it the program counter. To make this short, it allows to inject code into a program and causes it to be executed. However, for the injected code to do really bad things does it need to now where to find certain functions, like those for reading files or opening an Internet connection.

            Malicious code can then, once an attack is successful, do all sorts of things. Leaking data is just one of the many things. ASLR is however not a, or the, protection against data leaks. Data can get leaked in several other ways and ASLR doesn't protect against these. A lot of programs aren't executable code, but are scripts run by interpreters such as php, tcl/tk, perl, etc, where ASLR can only protect the interpreter, but cannot prevent an attack on the scripts themselves.

            Now, what ASLR has to do with the prefetcher and the micro code cache can only be speculated. However, randomised address space should cause instruction caches to have more misses. So when a compile job spawns 100 times a compiler and each compiler instance no longer shares the same executable layout, but each instance has its layout randomized, then a CPU won't "see" the code of one executable being executed 100 times in parallel, but instead "sees" 100 different processes all running in parallel. This will lead to an increased number of misses in the instruction cache and with it the micro code cache. If that's a bug inside the CPU or if it's caused by an outside component is still uncertain. The fault can be within one of the caches, or only a nearby unit, which is affected by an increased heat. It can also be a remote unit, which isn't getting enough power in this situation. A fix to it could be as simple as increasing the size of a capacitor on the mainboards to provide more power to one of the 1331 pins on the AM4 socket, or a fix could be as complex as revising the entire CPU design. Maybe it's never going to get fixed, because it no longer shows with newer chips. That's entirely for AMD to say, but not for you or me.
            Last edited by sdack; 12 August 2017, 03:38 AM.

            Comment


            • Hi guys,
              anybody knows, why save-ryzen.sh does not reveal the error, while kill-ryzen.sh does?

              The only difference is that save-ryzen.sh calls takset -c and assigns the jobs to the cores in the right order 0, 1, 2, ...

              If we take Ryzen 7, then 0..7 are independent cores and 8..15 their HT siblings, right? So first putting the load to the independent cores and only then to their siblings does not produce the error?

              I think the Linux scheduler is actually doing the opposite. If you have a lot of single-threaded tasks, it will put a new task on a core, which sibling is already busy. Why is that? To save power? It definitely reduces the performance.

              On the opposite, one multi-threaded task is loaded first on cores, which siblings are idle. So for multi-threaded (think of make -j8) load the scheduler behaves as expected.

              Is there a way (a switch?) to tune the Linux scheduler to behave the same for many single threaded loads (think of 8x make -j1)? I.e. at first not loading the cores, whose siblings are busy?

              Comment


              • It seems I'm also affected with a Ryzen 1800X. Yesterday I contact with AMD support, but no automatic confirmation e-mail was sent. I will be waiting for a couple of days for their response but AMD should sent an automatic email for confirmation.

                There's other solution in the works instead of rma? Any bios update or something like that?

                Thanks,

                Comment


                • Yes, disable opcache control in your BIOS.
                  What your motherboard?
                  Last edited by scorpio810; 11 August 2017, 08:22 AM.

                  Comment


                  • Thanks scorpio810 , my motherboard it's an Asus PRIME X370-PRO updated to 805 bios. I see that today they released a new bios, 807, with the description: "Add CBS item". When I get home later I will try it. I don't know where is the opcache option in my motherboard, but seeing some msi motherboard bios captures I can see that the opcache is under "amd cbs" options, so maybe this new bios has it.

                    Thanks,

                    Comment


                    • Originally posted by Khudsa View Post
                      Thanks scorpio810 , my motherboard it's an Asus PRIME X370-PRO updated to 805 bios. I see that today they released a new bios, 807, with the description: "Add CBS item". When I get home later I will try it. I don't know where is the opcache option in my motherboard, but seeing some msi motherboard bios captures I can see that the opcache is under "amd cbs" options, so maybe this new bios has it.

                      Thanks,
                      It's not a fix though, it just reduces the occurrences. (and it has a performance impact, not sure how bad yet) The only real fix I've seen so far is RMA. AMD has known good batches and are offering them through their tech support services.
                      Last edited by duby229; 11 August 2017, 09:32 AM.

                      Comment

                      Working...
                      X