Announcement

Collapse
No announcement yet.

AMD Confirms Linux Performance Marginality Problem Affecting Some, Doesn't Affect Epyc / TR

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by rene View Post

    Yes, I should indeed have written "work around'ed", but technically if this fixes the issue it is also a fix ;-) Just like the dozens of Intel errata glue in the Linux or other BSD kernels ... :-/ TL;DR: moderns CPUs become too complex, unfortunately.

    Maybe it is time to think about some simpler than x86 architecture, without all the million legacy and exceptions, and instead more lightweight but even more cores for modern and parallel load. Also in principle I liked the Transmeta idea. a simple massive parallel VLIW with software glue for the actual memory protection implementation. With the VLIW architecture being open (Transmeta was closed) we could implement the actual "code morphing" virtualization / memory protection layer on top of it. And fix such bugs ourselves, ...
    VLIW was never the right way to go, just too much overhead in translation. But other than them choosing the obviously wrong architecture, I definitely agree with you.

    Comment


    • Originally posted by keantoken View Post
      bridgman Disabling opcache prevents errors on kill-ryzen.sh for me. Who do I contact for RMA?

      Is it better to RMA with AMD directly or go through the store you got the CPU from (newegg)?
      Same for me, when I disable the opcache option it works without errors. I have never do a RMA. How is the process? First contact with AMD (already contacted, waiting answer) and then contact with the store (pccomponentes official store in Spain for Ryzen release) with the AMD's reply?

      Thanks

      Comment


      • I think my pre order 1700X bought in 2 april 2017 make segfaults only with CPU core 5 ? script continued without error for the moment ...

        Edit: I spoke too fast ... krkrkr

        Code:
        [août12 12:35] logitech-hidpp-device 0003:046D:400A.0007: HID++ 2.0 device connected.
        [août12 13:38] bash[27367]: segfault at 7f814d3857e8 ip 00007f814d0a1330 sp 00007ffeb781b898 error 4 in libc-2.24.so[7f814cf78000+193000]
        [août12 15:44] bash[10807]: segfault at 7fba26df87e8 ip 00007fba26b14330 sp 00007ffd63026c28 error 4 in libc-2.24.so[7fba269eb000+193000]
        [août12 15:45] bash[30145]: segfault at 12 ip 0000000000435d7e sp 00007ffcdc40fc30 error 6 in bash[400000+100000]
        Code:
        [loop-0] Sat Aug 12 12:48:36 CEST 2017 start 0
        [loop-1] Sat Aug 12 12:48:37 CEST 2017 start 0
        [loop-2] Sat Aug 12 12:48:38 CEST 2017 start 0
        [loop-3] Sat Aug 12 12:48:39 CEST 2017 start 0
        [loop-4] Sat Aug 12 12:48:40 CEST 2017 start 0
        [loop-5] Sat Aug 12 12:48:41 CEST 2017 start 0
        [loop-6] Sat Aug 12 12:48:42 CEST 2017 start 0
        [loop-7] Sat Aug 12 12:48:43 CEST 2017 start 0
        [loop-8] Sat Aug 12 12:48:44 CEST 2017 start 0
        [loop-9] Sat Aug 12 12:48:45 CEST 2017 start 0
        [loop-10] Sat Aug 12 12:48:46 CEST 2017 start 0
        [loop-11] Sat Aug 12 12:48:47 CEST 2017 start 0
        [loop-12] Sat Aug 12 12:48:48 CEST 2017 start 0
        [loop-13] Sat Aug 12 12:48:49 CEST 2017 start 0
        [loop-14] Sat Aug 12 12:48:50 CEST 2017 start 0
        [loop-15] Sat Aug 12 12:48:51 CEST 2017 start 0
        [loop-2] Sat Aug 12 13:17:40 CEST 2017 build failed
        [loop-2] TIME TO FAIL: 1744 s
        [loop-4] Sat Aug 12 13:38:28 CEST 2017 build failed
        [loop-4] TIME TO FAIL: 2992 s
        [loop-10] Sat Aug 12 15:44:59 CEST 2017 build failed
        [loop-10] TIME TO FAIL: 10583 s
        [loop-1] Sat Aug 12 15:45:08 CEST 2017 build failed
        [loop-1] TIME TO FAIL: 10592 s
        Last edited by scorpio810; 12 August 2017, 09:50 AM.

        Comment


        • Originally posted by drSeehas View Post
          This is not a fix. It is a workaround. Only AMD can fix this bug. But AMD says it is a linux only bug ...
          You guys may be mixing issues here. Linux (and Windows AFAIK) have had similar code for years (leaving an unused "guard page" at the top of user process address space) to deal with errata in other processors... BSD had workarounds in place as well, but they used a smaller guard region (less than a page) which was sufficient for previous processors. Not a "better or worse" thing, just a different design decision.

          Ryzen needs a full guard page at the top rather than just a guard region, and so the BSD devs have updated their code accordingly. Linux and Windows "got lucky" in this case because the guard page added for errata in previous CPUs also worked for Ryzen.

          We are checking to make sure that the combination of address space randomization and transparent huge page migration will never replace a collection of 4K pages at the top of user memory with a single 2M page (which would effectively remove the guard page). We don't think it will happen because of the unused guard page and the OS-managed write protection of the vsyscall page but need to be sure.
          Last edited by bridgman; 12 August 2017, 02:25 PM.
          Test signature

          Comment


          • Originally posted by bridgman View Post
            ... Ryzen needs a full guard page at the top rather than just a guard region, ...
            Was this ever documented/communicated and when?

            ... and so the BSD devs have updated their code accordingly. Linux and Windows "got lucky" in this case because the guard page added for errata in previous CPUs also worked for Ryzen. ...
            So there are two "bugs", but in Linux and Windows shows up only one bug?

            Comment


            • Originally posted by drSeehas View Post
              Was this ever documented/communicated and when?
              There is an errata document but I'm not sure if it has been published yet. I asked about status last week.

              Originally posted by drSeehas View Post
              So there are two "bugs", but in Linux and Windows shows up only one bug?
              IIRC a typical modern CPU has somewhere between 10 and 100 errata (check the revision guides for any recent CPU).

              I do not expect this specific one to cause problems on Linux or Windows but as I said we are checking to make sure that transparent huge page logic (THP) in Linux does not need an additional tweak.
              Last edited by bridgman; 12 August 2017, 03:37 PM.
              Test signature

              Comment


              • Originally posted by Khudsa View Post

                Same for me, when I disable the opcache option it works without errors. ...
                I can confirm that disabling opcache eliminates segfault reports generated by kill-ryzen when opcache is set at Auto (at least for the period of 2 hours before I stopped testing). Impact of the opcache disable varies. Unigine Valley and Superposition scores and average frames per second are effectively unchanged. Latency and read rate per Intel MLC are effectively unchanged. Blender rendering, however, does take a few percent longer for the Ryzen logo and the Blender home page 'Classroom' renders. My measurements are not statistically significant, however, so YMMV.

                Comment


                • Originally posted by kaseki View Post

                  I can confirm that disabling opcache eliminates segfault reports generated by kill-ryzen when opcache is set at Auto (at least for the period of 2 hours before I stopped testing). Impact of the opcache disable varies. Unigine Valley and Superposition scores and average frames per second are effectively unchanged. Latency and read rate per Intel MLC are effectively unchanged. Blender rendering, however, does take a few percent longer for the Ryzen logo and the Blender home page 'Classroom' renders. My measurements are not statistically significant, however, so YMMV.
                  Thank you for your benchs.

                  Comment


                  • Originally posted by Khudsa View Post

                    Same for me, when I disable the opcache option it works without errors. I have never do a RMA. How is the process? First contact with AMD (already contacted, waiting answer) and then contact with the store (pccomponentes official store in Spain for Ryzen release) with the AMD's reply?

                    Thanks
                    At this point, nobody knows if it truly fixes the problem or just delays it from happening eventually. May be best to RMA your cheap, seems like people are getting better luck when they get back RMA 2017 - Week 25+ chips. Nevertheless, one reported that the UA1725 chip they got back on the first RMA had Machine Checked Exception issues but the second UA1725 chip was okay. It seems like AMD's QA process needs more work.

                    Comment


                    • Originally posted by bridgman View Post

                      There is an errata document but I'm not sure if it has been published yet. I asked about status last week.



                      IIRC a typical modern CPU has somewhere between 10 and 100 errata (check the revision guides for any recent CPU).

                      I do not expect this specific one to cause problems on Linux or Windows but as I said we are checking to make sure that transparent huge page logic (THP) in Linux does not need an additional tweak.
                      How did AMD come to the conclusion that ThreadRipper isn't affected by these issues given that it's running the same stepping as the Ryzen chips (B1). Does it just have better QA in general? A manufacturing fix on 2017 Week 25+ chips? A microcode fix?

                      Sounds to me that there's no Microcode Fix (if ThreadRipper has it, pretty sure they Ryzen would have it by now as well), From a die perspective - there's no new stepping (TR is supposedly using the same me B1 Zeppelin Dies used in Ryzen). Sounds like a lot of voodoo BS coming out of AMD - how are they able to claim that TR is unaffected if Linux is still being looked at?

                      Will the announced Ryzen Pro desktop processors have the same level of QA-binning as TR? Will the consumer line as well? Or will we play ongoing SEGV Silicon Lottery going forward on the consumer line?

                      AMD needs to give us a better explanation and fast-track RMA process for people affected instead of making us take pictures of our BIOS settings, our Case - (a bit pissed as I have two systems affected, and AMD support takes several days to respond to an ongoing support conversation).

                      A fast-track path IMO would be, make us set BIOS to defaults (to rule out overclocking), provide us a testing tool (something we can execute) and if it fails, generate an RMA number. The way it's happening now, it'll take over a week before one gets an RMA number because they want pictures of the BIOS, pictures of the case and AMD support doesn't respond quickly enough. Add shipping into account, this process ends up taking a couple of weeks (crossing ones fingers the first CPU sent back will be a good copy).
                      Last edited by Funks; 13 August 2017, 03:30 AM.

                      Comment

                      Working...
                      X