Announcement

Collapse
No announcement yet.

Some Ryzen Linux Users Are Facing Issues With Heavy Compilation Loads

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by Beherit View Post
    Hardware bugs aren't necessarily triggered by all kernels.

    Other than an undocumented hearsay of it happened to one ryzen user using netbsd, and Matt Dillon's (dragonflybsd) report, the bug is only triggered when using Linux. Still not a single Windows user reporting this.
    You're wrong:

    > Yet windows does not trigger it?

    As I reported several time, Windows Subsystem for Linux (WSL), so called "Bash on Ubuntu on Windows"
    triggered this kind of problem (see my past report for the detail). WSL is the linux userland on WIndows kernel
    (more precisely it consists of Linux emulation layer and NT kernel). And NetBSD triggered the very similar problem.

    Comment


    • So. Do we have any more information about this problem?

      Comment


      • Originally posted by Zucca View Post
        So. Do we have any more information about this problem?
        Some. A subset of users seems to benefit from upping the SoC voltage from ~.945 (auto) to 1.05-~1.1v. I'm fairly sure it's mostly an electrical issue, as exchanging processors doesn't seem to make any difference, but it's not yet led to AMD announcing any kind of fix.

        Comment


        • I think even after upping the voltage it was still necessary to disable ASLR to reach stability, at least in most cases.
          Note, that's my rough impression from following the amd forum thread, I didn't keep track of who did what exactly;p
          Last edited by timon37; 06 July 2017, 05:12 AM.

          Comment


          • Originally posted by foppe View Post

            Some. A subset of users seems to benefit from upping the SoC voltage from ~.945 (auto) to 1.05-~1.1v. I'm fairly sure it's mostly an electrical issue, as exchanging processors doesn't seem to make any difference, but it's not yet led to AMD announcing any kind of fix.
            Based on this, the problem couldbe circumvented by BIOS/UEFI update. Although, in my understanding, when raising the operating voltage of the cores you also raise se power consumption, which might lead to more problems with certain CPU+MB combos. Also then 7 1700 wouldn't be 65W TDP aynnmore.

            I've been planning to upgrade my Opteron 3380 to 7 1700 on my server (MB and RAM aswell, of course). I guess I'll still wait. Opteron 3380 is fairly capable for what I need, but sometimes when encoding videos it takes looooong. But being my home server it can do all the work at night. So no hurry.

            Comment


            • Originally posted by Zucca View Post
              Based on this, the problem couldbe circumvented by BIOS/UEFI update. Although, in my understanding, when raising the operating voltage of the cores you also raise se power consumption, which might lead to more problems with certain CPU+MB combos. Also then 7 1700 wouldn't be 65W TDP aynnmore.

              I've been planning to upgrade my Opteron 3380 to 7 1700 on my server (MB and RAM aswell, of course). I guess I'll still wait. Opteron 3380 is fairly capable for what I need, but sometimes when encoding videos it takes looooong. But being my home server it can do all the work at night. So no hurry.
              Not sure if that's necessarily true. Some bioses seem to set vSoC at .9v, while mine was at .945v, and the vSoC drives the chipset/mem controller, not the cpu, so I doubt it really affects overall power draw all that much.

              Comment


              • Originally posted by foppe View Post
                the vSoC drives the chipset/mem controller, not the cpu, so I doubt it really affects overall power draw all that much.
                Oh. That's better news then, I guess. At least from power/heat generation standpoint.

                Comment


                • I am one of the users with an affected Ryzen in that AMD Community thread. There we are wondering how usual the problem is. Some people have already got a new CPU through RMA but the problem persists in the new CPU. We also have example of people that have exchanged all major components in the system (processor, motherboard, memory, PSU, graphics card) and still faces the segfault under heavy compilation.

                  A special characteristic is that the problem is not that easy to trigger. You may or may not compile many things and not see it. So a person may have a processor that is affected but may not not see it. Fortunately some smart people have created a simple script that always shows the problem in may system and in the system of the other people of the thread. The script can be found in

                  Tools to reproduce randomly crashing processes under load on AMD Ryzen processors on Linux - suaefar/ryzen-test


                  You just have to clone the repository, move to the ryzen-test directory and run ./kill_ryzen.sh. It is a very simple script, it downloads gcc-7.1 source code into a vram disk and start #processors simultaneous compilation of it. If any compilation fails it writes a message in the console saying how long it took to get the failure. After a few minutes in my system I the build fails unless I turn off SMT (I am also invreasing SOC and Memory voltage, but I am not completely sure this is necessary).

                  Now, I would like to ask a hand from the fellow readers. If you have a Ryzen system can you test it with the kill_ryzen.sh script? Let's same for one or two hours. After that post here the result even if no failures happen. This may be a nice way to find out how common is the problem. That is the reason that it is important to have both kindos of reports: failues and sucessful builds.

                  Obs: The kill_ryzen.sh is an infinite build loop, the easiest way to stopping it is rebooting.

                  Comment


                  • Very disturbing that there is no official AMD communication on the issue for a very long time. There was a "we're working on it" message on the AMD community forum many weeks ago. Do they intend to just sweep this under the rug ? This sort of problem makes my intended use (24x7 Linux server) infeasible, and leaves a very bad impression of AMD as CPU vendor.

                    Comment


                    • stevea: Actually a few days ago a AMD representative posted in thread saying "we are still reading every post AFAIK and this is definitely being looked at. Please continue to file customer tickets as amdmatt suggested.". So the current situation is that AMD is taking a look at it but does not share any internal information. So we do not know for example whether they are abale to replicate our problems in place. Many of us opened personal technical support request, myself included. Some have exchanged the CPU using a RMA. In the tread we found some people saying that the new CPU solved their problem but some also saying that the problem remained the same. This last cases scared many of us, since we started to think how unlikely would be to get a second faulty CPU. This may suggest that the problem is more usual than we initially thought. This was the main reason the led me to write to message here. We would like people that do not think that they are affected to test their system. If we can get many responses saying that the systems are OK them our best option is to ask for an RMA. If many people discover problems in their system, then the bug might be wide spread. So, once again, if some fellow reader could test their systems and post a follow up here (just sayin that everything is OK or not) we would appreciate it.

                      Comment

                      Working...
                      X