Announcement

Collapse
No announcement yet.

AMD Confirms Linux Performance Marginality Problem Affecting Some, Doesn't Affect Epyc / TR

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #71
    Originally posted by puleglot View Post
    This freebsd bug is a completely different story. Please don't mix unrelated issues.


    Any process can crash on Ryzen during heavy compilation: bash or even a process not participating in compilation. So I don't understand why you are trying to find some issue with gcc itself.
    Fine, whatever, fart around with your voltage setting and swaping your thermal grease and your ricer gentoo cflags all day then.

    Comment


    • #72
      Love it when "my answer is better than your answer" flares up.

      Time for someone to print up tshirts that say "I was there when the Ryzen had segmentation faults in 2017". Then you can claim to anyone 10 years later when "Pyzen" comes out that you had the answer way back when.

      Comment


      • #73
        Originally posted by puleglot View Post
        This freebsd bug is a completely different story.
        partly true - this bug report first contained two issues: freezes/reboots + compilation failures (what you're referring to, I suppose)

        After a while I splitted the compilation failure issue out into a seperate bug report...

        Comment


        • #74
          Originally posted by soulsource View Post

          May I ask: What are those reasonable voltages you are using? It'd be nice to use them as starting values for my own experiments.
          The voltages are:

          VCore = 1.35v
          SoC = 1.2v
          DRAM = 1.375v

          Oddly enough, it looks like a power outage the other day knocked out the other settings when it triggered a the BIOS setup screen. Everything still works, so that stuff must not have been important.

          Originally posted by k1e0x View Post

          Negative.

          Seriously just don't. It's great that you want to help but this is an extremely hard problem and smart people are on it. Ya know.. about 10% of the computer industry has smart people that actually do the bulk of the work.. the rest are just there.
          I get that I'm not an engineer. I'm just a guy who has built and overclocked plenty of computers, so yeah, maybe I'm seeing what I expect to see, but it's information, and it could be useful to someone who does have more in-depth technical knowledge than myself.

          Comment


          • #75
            I think it would really help the whole bug investigation process if people would provide observations only from non-overclocked, default-setting systems.

            All this talk about this-and-that-tweak of BIOS settings is just a distraction, and it makes some people (both at AMD and elsewhere) believe that all the weird symptoms might just be caused by people running weird settings.

            Fact is: People experience the bug on completely default-configured, non-overclocked systems. So it makes totally no sense to try all kinds of weird voltage or clock settings in hope to have it gone.

            Comment


            • #76
              Thanks Michael, for raising awareness of this issue.

              As evident in the bug discussion over at AMD dating back to 2017-05-08, other bug reports, discussions over at FreeBSD and several threads here in the forums, there are two well documented symptoms:

              1 - "Segfaults"
              This has been well researched by several users. Under load pointers may get corrupted, which results in undefined behavior. This is why compilation fails "randomly".

              2 - uOP Cache
              The kernel detects errors (and sometimes corrections) in the uOP Cache.

              Some examples I've grabbed from the thread over at AMD:
              Example Linux:
              "mce: [Hardware Error]: Machine check events logged"
              "mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea0000000000108"
              "mce: [Hardware Error]: TSC 0 ADDR 1ffffa94be452 MISC d012000101000000 SYND 4d000000 IPID 500b000000000"
              "mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1500732880 SOCKET 0 APIC 2 microcode 8001126"

              Example BSD:
              MCA: Bank 1, Status 0x90200000000b0151
              MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
              MCA: Vendor "AuthenticAMD", ID 0x800f11, APIC ID 14
              MCA: CPU 14 COR ICACHE L1 IRD error

              -----

              Our friends over there have done extensive testing and identified these symptoms, which is clear evidence of a hardware defect. But it remains unclear if these are symptoms of the same problem, or two unrelated problems. This will be up to AMD's engineers to research.

              As for anyone claiming this is a software bug; when you have a piece of code that's proven to be correct, and then you run it repeatedly and it causes pointer corruption and segfaults, and eventually freezes the operating system, you have a hardware bug.

              So far, their testing have eliminated OS kernels, since it's reproduced on Linux, BSD and Windows Subsystem for Linux (WSL). Both gcc and llvm are tested, the problems have been reproduced during compilation of gcc, mesa, chromium, thunderbird, libreoffice, ffmpeg, linux kernel, bsd kernel and more. Memory configurations and timings have been eliminated as a cause.

              As stated in the article:
              AMD engineers found the problem to be very complex and characterize it as a performance marginality problem exclusive to certain workloads on Linux
              This is completely incorrect. This is not a performance problem at all, this is a data corruption issue. And the problem has nothing to do with Linux.

              The hardware defect is not related to compilation. It just happens that the problem is most easily reproduced with compilation, since it stresses the right parts of the core. This is why these chips have no problem with Prime95 or Cinebench, the FPUs and ALUs just work fine. And who does a lot of heavy compilation? Linux developers, especially Gentoo users.

              The defect is present in at least the B1 stepping of Ryzen, but as with all microprocessors, the risk of the bug occurring is dependent on the quality of each sample. A proper solution would probably require a new stepping, but hopefully AMD can manage to either tweak some parameters or disable some features in firmware to make these systems reliable (likely at a performance penalty). Anyone wanting to buy these for development or other productive work should hold off until the situation has been resolved.

              Edit:
              Improved sentence.
              Last edited by efikkan; 07 August 2017, 06:56 PM.

              Comment


              • #77
                Originally posted by ZombieNo7 View Post
                The voltages are: ...
                I think it would really help the whole investigation into this issue if people restricted symptom reporting to observations made on default-configured, non-overclocked systems.

                The whole voltage and clock setting vodoo is just a distraction, which makes some people at AMD and elsewhere think that the whole issue might be caused just by some weird settings.

                But fact is: The bug does occur also on default-configured, non-overclocked systems. It will not go away by just tweaking some voltage levels or clocks.

                Comment


                • #78
                  Originally posted by ZombieNo7 View Post

                  The voltages are:

                  VCore = 1.35v
                  SoC = 1.2v
                  DRAM = 1.375v

                  Oddly enough, it looks like a power outage the other day knocked out the other settings when it triggered a the BIOS setup screen. Everything still works, so that stuff must not have been important.



                  I get that I'm not an engineer. I'm just a guy who has built and overclocked plenty of computers, so yeah, maybe I'm seeing what I expect to see, but it's information, and it could be useful to someone who does have more in-depth technical knowledge than myself.
                  I'm not calling you dumb or any of the sort, its just.. we all have our areas where we are strong. Even understanding the problem much less developing a test to repeatably encounter it is difficult.. and for the record it's WAY over my head as a sysadmin so I'm no better off.

                  Keep in mind this could still be a case where AMD got it right and the compilers have just been doing the wrong thing for the past 20 years.
                  Last edited by k1e0x; 07 August 2017, 06:44 PM.

                  Comment


                  • #79
                    Originally posted by k1e0x View Post
                    Someone is going to have to help me here but I think its the process of the execution of the non-executable memory pages using GCC trampolines that it's showing up in.
                    The bug reproduction using gcc has _nothing_ to do with executing "freshly compiled code". It is the compiler itself that seg-faults, not the executables it generates.

                    GCC is just a kind of software that executes a lot of complex (compiler-)code in parallel where any kind of "wrong intermediate result" is much more likely to result in a (visible) segmentation fault than most other massively parallel software. If you run ffmpeg in 16 threads, chances are that any kind of wrong intermediate result just means some distorted pixels in a frame.

                    Comment


                    • #80
                      Originally posted by ZombieNo7 View Post
                      I get that I'm not an engineer. I'm just a guy who has built and overclocked plenty of computers, so yeah, maybe I'm seeing what I expect to see, but it's information, and it could be useful to someone who does have more in-depth technical knowledge than myself.
                      If the article is correct that the AMD engineers have successfully reproduced the problem, I doubt there's anything you can to do contribute. There's already far too much noise and confusion around this issue... at this point, the engineers no longer need help figuring out how to trigger the failures on their own hardware, and they certainly don't need the distraction of a few hundred uninformed people on the internet offering theories of how to fix it...

                      Comment

                      Working...
                      X