Announcement

Collapse
No announcement yet.

AMD Confirms Linux Performance Marginality Problem Affecting Some, Doesn't Affect Epyc / TR

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by efikkan View Post
    Disabling the micro-op cache in the BIOS will also help, but that has a performance impact. But all of this are only acceptable measures until AMD creates a new stepping without the bug.
    A new stepping is probably not required, the good RMA'd chip coming back have a build week of 25 (2017). ThreadRippers have a build week of 27 (2017).

    Comment


    • maybe now fully fixed in dragonflybsd? http://lists.dragonflybsd.org/piperm...st/626190.html

      Comment


      • Originally posted by Funks View Post
        On RYZEN - It would be weird if ZRAM has issues, if so Linux cache disk block would have problems as well. Compiling on ZRAM is a good thing for these tests as it's actually making the CPU do even more work.
        The only issues I am aware of with zram on Ryzen are (a) script sets maximum size to 64GB and it fills up with 16 threads (leading to out of space errors & build failures), (b) zram consumes enough memory that remaining RAM is insufficient and OOM killer starts killing things.

        The second one (running out of physical RAM) happens with 16GB and 16 threads - not sure how much having a swap partition helps.
        Test signature

        Comment


        • Originally posted by rene View Post
          maybe now fully fixed in dragonflybsd? http://lists.dragonflybsd.org/piperm...st/626190.html
          Yep, that looks good.
          Test signature

          Comment


          • Originally posted by Khudsa View Post
            ...
            edit: this also means that some cache is damaged on the cpu?
            It depends on which answer you like better...

            1.) Yes, your cache is broken and you can go and watch Game of Thrones now.

            2.) No. It doesn't mean the cache is the cause, but only that turning it off is a solution to it. Others have reported disabling ASLR in the kernel can also help. That, too, doesn't mean ASLR is broken. One can only speculate as to what the exact cause is. If a cache was broken then it would likely be more easily to reproduce I'd imagine. It's possible that turning off the cache is just the easiest solution with only a minimal impact on performance and without requiring customers to get the soldering iron out. It only requires a BIOS update, but it doesn't necessarily mean the cache itself is the cause. It could be a thermal issue, which only builds up after many cache misses have occurred, heating up an area of the chip, which causes the cache or a nearby unit to fail. It can also mean a unit isn't getting enough power due to the heat (heat increases resistances and leakage), or that it's using too much power, making less available for other parts.

            The Phoronix article states that AMD found the problem to be very complex. You'll have to accept this as the only true answer.

            Soon will people however flock to the simplest, but not necessarily most sincere answer. People will remember "the thing with the cache" and this will then also get remembered as the cause. Although it's incorrect won't it matter, because most people are only interested in a solution, but don't necessarily want to understand the cause of it first.
            Last edited by sdack; 12 August 2017, 08:26 AM.

            Comment


            • I'm running kill-ryzen.sh and get this:

              [loop-6] Fri Aug 11 21:07:10 CDT 2017 start 0
              [loop-7] Fri Aug 11 21:07:11 CDT 2017 start 0
              [loop-6] Fri Aug 11 21:08:57 CDT 2017 build failed
              [loop-6] TIME TO FAIL: 113 s
              [loop-4] Fri Aug 11 21:09:14 CDT 2017 build failed
              [loop-2] Fri Aug 11 21:09:14 CDT 2017 build failed
              [loop-4] TIME TO FAIL: 130 s
              [loop-2] TIME TO FAIL: 130 s

              So perhaps my Ryzen 5 X1500 is not immune to the problem. I will try disabling opcache if I can figure out which is the correct option in the BIOS.

              On starting the script it took up about 2GB of memory and started 8 threads. Memory is still not maxed out and I have Chromium running right now. Do I really need 16GB to run this? I only have 8GB.

              BTW when my kernel build failed it also took 2 Chromium tabs with it, so maybe in that case it was more system instability issues.

              Comment


              • bridgman Disabling opcache prevents errors on kill-ryzen.sh for me. Who do I contact for RMA?

                Is it better to RMA with AMD directly or go through the store you got the CPU from (newegg)?

                Comment


                • Originally posted by rene View Post
                  maybe now fully fixed in dragonflybsd? http://lists.dragonflybsd.org/piperm...st/626190.html
                  This is not a fix. It is a workaround. Only AMD can fix this bug. But AMD says it is a linux only bug ...

                  Comment


                  • Originally posted by drSeehas View Post
                    This is not a fix. It is a workaround. Only AMD can fix this bug. But AMD says it is a linux only bug ...
                    Yes, I should indeed have written "work around'ed", but technically if this fixes the issue it is also a fix ;-) Just like the dozens of Intel errata glue in the Linux or other BSD kernels ... :-/ TL;DR: moderns CPUs become too complex, unfortunately.

                    Maybe it is time to think about some simpler than x86 architecture, without all the million legacy and exceptions, and instead more lightweight but even more cores for modern and parallel load. Also in principle I liked the Transmeta idea. a simple massive parallel VLIW with software glue for the actual memory protection implementation. With the VLIW architecture being open (Transmeta was closed) we could implement the actual "code morphing" virtualization / memory protection layer on top of it. And fix such bugs ourselves, ...

                    Comment


                    • I tried to change my RAM Stick, and enable Opcache (auto) Custom kernel 4.12.5 ASLR ON

                      Now I try other RAM sticks F4-2400C15-16GVR -> BLS8G4D26BFSC.16FBR2
                      No luck ...
                      latest sticks F4-2400C15-16GVR with manual settings A-XMP profil 2 2T CR procDOT 53.3 ohms and VRAM 1.2V > 1.35V (not in QVL) kill_rysen run ~4 hours with no errors.. but killed because out of memory..

                      Now BLS8G4D26BFSC.16FBR2 (QVL) A-XMP profil 2 auto and not change any thing [1T CR (@1.2V)]

                      Code:
                      [loop-0] Sat Aug 12 12:48:36 CEST 2017 start 0
                      [loop-1] Sat Aug 12 12:48:37 CEST 2017 start 0
                      [loop-2] Sat Aug 12 12:48:38 CEST 2017 start 0
                      [loop-3] Sat Aug 12 12:48:39 CEST 2017 start 0
                      [loop-4] Sat Aug 12 12:48:40 CEST 2017 start 0
                      [loop-5] Sat Aug 12 12:48:41 CEST 2017 start 0
                      [loop-6] Sat Aug 12 12:48:42 CEST 2017 start 0
                      [loop-7] Sat Aug 12 12:48:43 CEST 2017 start 0
                      [loop-8] Sat Aug 12 12:48:44 CEST 2017 start 0
                      [loop-9] Sat Aug 12 12:48:45 CEST 2017 start 0
                      [loop-10] Sat Aug 12 12:48:46 CEST 2017 start 0
                      [loop-11] Sat Aug 12 12:48:47 CEST 2017 start 0
                      [loop-12] Sat Aug 12 12:48:48 CEST 2017 start 0
                      [loop-13] Sat Aug 12 12:48:49 CEST 2017 start 0
                      [loop-14] Sat Aug 12 12:48:50 CEST 2017 start 0
                      [loop-15] Sat Aug 12 12:48:51 CEST 2017 start 0
                      [loop-2] Sat Aug 12 13:17:40 CEST 2017 build failed
                      [loop-2] TIME TO FAIL: 1744 s
                      [loop-4] Sat Aug 12 13:38:28 CEST 2017 build failed
                      [loop-4] TIME TO FAIL: 2992 s
                      sudo dmesg -wH

                      Code:
                      [août12 12:35] logitech-hidpp-device 0003:046D:400A.0007: HID++ 2.0 device connected.
                      [août12 13:38] bash[27367]: segfault at 7f814d3857e8 ip 00007f814d0a1330 sp 00007ffeb781b898 error 4 in libc-2.24.so[7f814cf78000+193000]
                      ryzen-test-master/buildloop.d/loop-4/build.log

                      Code:
                      Makefile:414: recipe for target 'librandom.la' failed
                      make[5]: *** [librandom.la] Segmentation fault
                      make[5]: Leaving directory '/media/backup6/download/ryzen-test-master/buildloop.d/loop-4/gmp/rand'
                      Makefile:954: recipe for target 'all-recursive' failed
                      make[4]: *** [all-recursive] Error 1
                      make[4]: Leaving directory '/media/backup6/download/ryzen-test-master/buildloop.d/loop-4/gmp'
                      Makefile:773: recipe for target 'all' failed
                      make[3]: *** [all] Error 2
                      make[3]: Leaving directory '/media/backup6/download/ryzen-test-master/buildloop.d/loop-4/gmp'
                      Makefile:5562: recipe for target 'all-stage2-gmp' failed
                      make[2]: *** [all-stage2-gmp] Error 2
                      make[2]: Leaving directory '/media/backup6/download/ryzen-test-master/buildloop.d/loop-4'
                      Makefile:27236: recipe for target 'stage2-bubble' failed
                      make[1]: *** [stage2-bubble] Error 2
                      make[1]: Leaving directory '/media/backup6/download/ryzen-test-master/buildloop.d/loop-4'
                      Makefile:941: recipe for target 'all' failed
                      make: *** [all] Error 2
                      I continue kill_ryzen script
                      Last edited by scorpio810; 12 August 2017, 08:41 AM.

                      Comment

                      Working...
                      X