Announcement

Collapse
No announcement yet.

AMD Confirms Linux Performance Marginality Problem Affecting Some, Doesn't Affect Epyc / TR

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • I tried to change my RAM Stick, and enable Opcache (auto) Custom kernel 4.12.5 ASLR ON

    Now I try other RAM sticks F4-2400C15-16GVR -> BLS8G4D26BFSC.16FBR2
    No luck ...
    latest sticks F4-2400C15-16GVR with manual settings A-XMP profil 2 2T CR procDOT 53.3 ohms and VRAM 1.2V > 1.35V (not in QVL) kill_rysen run ~4 hours with no errors.. but killed because out of memory..

    Now BLS8G4D26BFSC.16FBR2 (QVL) A-XMP profil 2 auto and not change any thing [1T CR (@1.2V)]

    Code:
    [loop-0] Sat Aug 12 12:48:36 CEST 2017 start 0
    [loop-1] Sat Aug 12 12:48:37 CEST 2017 start 0
    [loop-2] Sat Aug 12 12:48:38 CEST 2017 start 0
    [loop-3] Sat Aug 12 12:48:39 CEST 2017 start 0
    [loop-4] Sat Aug 12 12:48:40 CEST 2017 start 0
    [loop-5] Sat Aug 12 12:48:41 CEST 2017 start 0
    [loop-6] Sat Aug 12 12:48:42 CEST 2017 start 0
    [loop-7] Sat Aug 12 12:48:43 CEST 2017 start 0
    [loop-8] Sat Aug 12 12:48:44 CEST 2017 start 0
    [loop-9] Sat Aug 12 12:48:45 CEST 2017 start 0
    [loop-10] Sat Aug 12 12:48:46 CEST 2017 start 0
    [loop-11] Sat Aug 12 12:48:47 CEST 2017 start 0
    [loop-12] Sat Aug 12 12:48:48 CEST 2017 start 0
    [loop-13] Sat Aug 12 12:48:49 CEST 2017 start 0
    [loop-14] Sat Aug 12 12:48:50 CEST 2017 start 0
    [loop-15] Sat Aug 12 12:48:51 CEST 2017 start 0
    [loop-2] Sat Aug 12 13:17:40 CEST 2017 build failed
    [loop-2] TIME TO FAIL: 1744 s
    [loop-4] Sat Aug 12 13:38:28 CEST 2017 build failed
    [loop-4] TIME TO FAIL: 2992 s
    sudo dmesg -wH

    Code:
    [août12 12:35] logitech-hidpp-device 0003:046D:400A.0007: HID++ 2.0 device connected.
    [août12 13:38] bash[27367]: segfault at 7f814d3857e8 ip 00007f814d0a1330 sp 00007ffeb781b898 error 4 in libc-2.24.so[7f814cf78000+193000]
    ryzen-test-master/buildloop.d/loop-4/build.log

    Code:
    Makefile:414: recipe for target 'librandom.la' failed
    make[5]: *** [librandom.la] Segmentation fault
    make[5]: Leaving directory '/media/backup6/download/ryzen-test-master/buildloop.d/loop-4/gmp/rand'
    Makefile:954: recipe for target 'all-recursive' failed
    make[4]: *** [all-recursive] Error 1
    make[4]: Leaving directory '/media/backup6/download/ryzen-test-master/buildloop.d/loop-4/gmp'
    Makefile:773: recipe for target 'all' failed
    make[3]: *** [all] Error 2
    make[3]: Leaving directory '/media/backup6/download/ryzen-test-master/buildloop.d/loop-4/gmp'
    Makefile:5562: recipe for target 'all-stage2-gmp' failed
    make[2]: *** [all-stage2-gmp] Error 2
    make[2]: Leaving directory '/media/backup6/download/ryzen-test-master/buildloop.d/loop-4'
    Makefile:27236: recipe for target 'stage2-bubble' failed
    make[1]: *** [stage2-bubble] Error 2
    make[1]: Leaving directory '/media/backup6/download/ryzen-test-master/buildloop.d/loop-4'
    Makefile:941: recipe for target 'all' failed
    make: *** [all] Error 2
    I continue kill_ryzen script
    Last edited by scorpio810; 08-12-2017, 08:41 AM.

    Comment


    • Originally posted by rene View Post

      Yes, I should indeed have written "work around'ed", but technically if this fixes the issue it is also a fix ;-) Just like the dozens of Intel errata glue in the Linux or other BSD kernels ... :-/ TL;DR: moderns CPUs become too complex, unfortunately.

      Maybe it is time to think about some simpler than x86 architecture, without all the million legacy and exceptions, and instead more lightweight but even more cores for modern and parallel load. Also in principle I liked the Transmeta idea. a simple massive parallel VLIW with software glue for the actual memory protection implementation. With the VLIW architecture being open (Transmeta was closed) we could implement the actual "code morphing" virtualization / memory protection layer on top of it. And fix such bugs ourselves, ...
      VLIW was never the right way to go, just too much overhead in translation. But other than them choosing the obviously wrong architecture, I definitely agree with you.

      Comment


      • Originally posted by keantoken View Post
        bridgman Disabling opcache prevents errors on kill-ryzen.sh for me. Who do I contact for RMA?

        Is it better to RMA with AMD directly or go through the store you got the CPU from (newegg)?
        Same for me, when I disable the opcache option it works without errors. I have never do a RMA. How is the process? First contact with AMD (already contacted, waiting answer) and then contact with the store (pccomponentes official store in Spain for Ryzen release) with the AMD's reply?

        Thanks

        Comment


        • I think my pre order 1700X bought in 2 april 2017 make segfaults only with CPU core 5 ? script continued without error for the moment ...

          Edit: I spoke too fast ... krkrkr

          Code:
          [août12 12:35] logitech-hidpp-device 0003:046D:400A.0007: HID++ 2.0 device connected.
          [août12 13:38] bash[27367]: segfault at 7f814d3857e8 ip 00007f814d0a1330 sp 00007ffeb781b898 error 4 in libc-2.24.so[7f814cf78000+193000]
          [août12 15:44] bash[10807]: segfault at 7fba26df87e8 ip 00007fba26b14330 sp 00007ffd63026c28 error 4 in libc-2.24.so[7fba269eb000+193000]
          [août12 15:45] bash[30145]: segfault at 12 ip 0000000000435d7e sp 00007ffcdc40fc30 error 6 in bash[400000+100000]
          Code:
          [loop-0] Sat Aug 12 12:48:36 CEST 2017 start 0
          [loop-1] Sat Aug 12 12:48:37 CEST 2017 start 0
          [loop-2] Sat Aug 12 12:48:38 CEST 2017 start 0
          [loop-3] Sat Aug 12 12:48:39 CEST 2017 start 0
          [loop-4] Sat Aug 12 12:48:40 CEST 2017 start 0
          [loop-5] Sat Aug 12 12:48:41 CEST 2017 start 0
          [loop-6] Sat Aug 12 12:48:42 CEST 2017 start 0
          [loop-7] Sat Aug 12 12:48:43 CEST 2017 start 0
          [loop-8] Sat Aug 12 12:48:44 CEST 2017 start 0
          [loop-9] Sat Aug 12 12:48:45 CEST 2017 start 0
          [loop-10] Sat Aug 12 12:48:46 CEST 2017 start 0
          [loop-11] Sat Aug 12 12:48:47 CEST 2017 start 0
          [loop-12] Sat Aug 12 12:48:48 CEST 2017 start 0
          [loop-13] Sat Aug 12 12:48:49 CEST 2017 start 0
          [loop-14] Sat Aug 12 12:48:50 CEST 2017 start 0
          [loop-15] Sat Aug 12 12:48:51 CEST 2017 start 0
          [loop-2] Sat Aug 12 13:17:40 CEST 2017 build failed
          [loop-2] TIME TO FAIL: 1744 s
          [loop-4] Sat Aug 12 13:38:28 CEST 2017 build failed
          [loop-4] TIME TO FAIL: 2992 s
          [loop-10] Sat Aug 12 15:44:59 CEST 2017 build failed
          [loop-10] TIME TO FAIL: 10583 s
          [loop-1] Sat Aug 12 15:45:08 CEST 2017 build failed
          [loop-1] TIME TO FAIL: 10592 s
          Last edited by scorpio810; 08-12-2017, 09:50 AM.

          Comment


          • Originally posted by drSeehas View Post
            This is not a fix. It is a workaround. Only AMD can fix this bug. But AMD says it is a linux only bug ...
            You guys may be mixing issues here. Linux (and Windows AFAIK) have had similar code for years (leaving an unused "guard page" at the top of user process address space) to deal with errata in other processors... BSD had workarounds in place as well, but they used a smaller guard region (less than a page) which was sufficient for previous processors. Not a "better or worse" thing, just a different design decision.

            Ryzen needs a full guard page at the top rather than just a guard region, and so the BSD devs have updated their code accordingly. Linux and Windows "got lucky" in this case because the guard page added for errata in previous CPUs also worked for Ryzen.

            We are checking to make sure that the combination of address space randomization and transparent huge page migration will never replace a collection of 4K pages at the top of user memory with a single 2M page (which would effectively remove the guard page). We don't think it will happen because of the unused guard page and the OS-managed write protection of the vsyscall page but need to be sure.
            Last edited by bridgman; 08-12-2017, 02:25 PM.

            Comment


            • Originally posted by bridgman View Post
              ... Ryzen needs a full guard page at the top rather than just a guard region, ...
              Was this ever documented/communicated and when?

              ... and so the BSD devs have updated their code accordingly. Linux and Windows "got lucky" in this case because the guard page added for errata in previous CPUs also worked for Ryzen. ...
              So there are two "bugs", but in Linux and Windows shows up only one bug?

              Comment


              • Originally posted by drSeehas View Post
                Was this ever documented/communicated and when?
                There is an errata document but I'm not sure if it has been published yet. I asked about status last week.

                Originally posted by drSeehas View Post
                So there are two "bugs", but in Linux and Windows shows up only one bug?
                IIRC a typical modern CPU has somewhere between 10 and 100 errata (check the revision guides for any recent CPU).

                I do not expect this specific one to cause problems on Linux or Windows but as I said we are checking to make sure that transparent huge page logic (THP) in Linux does not need an additional tweak.
                Last edited by bridgman; 08-12-2017, 03:37 PM.

                Comment


                • Originally posted by Khudsa View Post

                  Same for me, when I disable the opcache option it works without errors. ...
                  I can confirm that disabling opcache eliminates segfault reports generated by kill-ryzen when opcache is set at Auto (at least for the period of 2 hours before I stopped testing). Impact of the opcache disable varies. Unigine Valley and Superposition scores and average frames per second are effectively unchanged. Latency and read rate per Intel MLC are effectively unchanged. Blender rendering, however, does take a few percent longer for the Ryzen logo and the Blender home page 'Classroom' renders. My measurements are not statistically significant, however, so YMMV.

                  Comment


                  • Originally posted by kaseki View Post

                    I can confirm that disabling opcache eliminates segfault reports generated by kill-ryzen when opcache is set at Auto (at least for the period of 2 hours before I stopped testing). Impact of the opcache disable varies. Unigine Valley and Superposition scores and average frames per second are effectively unchanged. Latency and read rate per Intel MLC are effectively unchanged. Blender rendering, however, does take a few percent longer for the Ryzen logo and the Blender home page 'Classroom' renders. My measurements are not statistically significant, however, so YMMV.
                    Thank you for your benchs.

                    Comment


                    • Originally posted by Khudsa View Post

                      Same for me, when I disable the opcache option it works without errors. I have never do a RMA. How is the process? First contact with AMD (already contacted, waiting answer) and then contact with the store (pccomponentes official store in Spain for Ryzen release) with the AMD's reply?

                      Thanks
                      At this point, nobody knows if it truly fixes the problem or just delays it from happening eventually. May be best to RMA your cheap, seems like people are getting better luck when they get back RMA 2017 - Week 25+ chips. Nevertheless, one reported that the UA1725 chip they got back on the first RMA had Machine Checked Exception issues but the second UA1725 chip was okay. It seems like AMD's QA process needs more work.

                      Comment

                      Working...
                      X