Announcement

Collapse
No announcement yet.

Some Ryzen Linux Users Are Facing Issues With Heavy Compilation Loads

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by foppe View Post

    Some. A subset of users seems to benefit from upping the SoC voltage from ~.945 (auto) to 1.05-~1.1v. I'm fairly sure it's mostly an electrical issue, as exchanging processors doesn't seem to make any difference, but it's not yet led to AMD announcing any kind of fix.
    Based on this, the problem couldbe circumvented by BIOS/UEFI update. Although, in my understanding, when raising the operating voltage of the cores you also raise se power consumption, which might lead to more problems with certain CPU+MB combos. Also then 7 1700 wouldn't be 65W TDP aynnmore.

    I've been planning to upgrade my Opteron 3380 to 7 1700 on my server (MB and RAM aswell, of course). I guess I'll still wait. Opteron 3380 is fairly capable for what I need, but sometimes when encoding videos it takes looooong. But being my home server it can do all the work at night. So no hurry.

    Comment


    • Originally posted by Zucca View Post
      Based on this, the problem couldbe circumvented by BIOS/UEFI update. Although, in my understanding, when raising the operating voltage of the cores you also raise se power consumption, which might lead to more problems with certain CPU+MB combos. Also then 7 1700 wouldn't be 65W TDP aynnmore.

      I've been planning to upgrade my Opteron 3380 to 7 1700 on my server (MB and RAM aswell, of course). I guess I'll still wait. Opteron 3380 is fairly capable for what I need, but sometimes when encoding videos it takes looooong. But being my home server it can do all the work at night. So no hurry.
      Not sure if that's necessarily true. Some bioses seem to set vSoC at .9v, while mine was at .945v, and the vSoC drives the chipset/mem controller, not the cpu, so I doubt it really affects overall power draw all that much.

      Comment


      • Originally posted by foppe View Post
        the vSoC drives the chipset/mem controller, not the cpu, so I doubt it really affects overall power draw all that much.
        Oh. That's better news then, I guess. At least from power/heat generation standpoint.

        Comment


        • I am one of the users with an affected Ryzen in that AMD Community thread. There we are wondering how usual the problem is. Some people have already got a new CPU through RMA but the problem persists in the new CPU. We also have example of people that have exchanged all major components in the system (processor, motherboard, memory, PSU, graphics card) and still faces the segfault under heavy compilation.

          A special characteristic is that the problem is not that easy to trigger. You may or may not compile many things and not see it. So a person may have a processor that is affected but may not not see it. Fortunately some smart people have created a simple script that always shows the problem in may system and in the system of the other people of the thread. The script can be found in

          https://github.com/suaefar/ryzen-test

          You just have to clone the repository, move to the ryzen-test directory and run ./kill_ryzen.sh. It is a very simple script, it downloads gcc-7.1 source code into a vram disk and start #processors simultaneous compilation of it. If any compilation fails it writes a message in the console saying how long it took to get the failure. After a few minutes in my system I the build fails unless I turn off SMT (I am also invreasing SOC and Memory voltage, but I am not completely sure this is necessary).

          Now, I would like to ask a hand from the fellow readers. If you have a Ryzen system can you test it with the kill_ryzen.sh script? Let's same for one or two hours. After that post here the result even if no failures happen. This may be a nice way to find out how common is the problem. That is the reason that it is important to have both kindos of reports: failues and sucessful builds.

          Obs: The kill_ryzen.sh is an infinite build loop, the easiest way to stopping it is rebooting.

          Comment


          • Very disturbing that there is no official AMD communication on the issue for a very long time. There was a "we're working on it" message on the AMD community forum many weeks ago. Do they intend to just sweep this under the rug ? This sort of problem makes my intended use (24x7 Linux server) infeasible, and leaves a very bad impression of AMD as CPU vendor.

            Comment


            • stevea: Actually a few days ago a AMD representative posted in thread saying "we are still reading every post AFAIK and this is definitely being looked at. Please continue to file customer tickets as amdmatt suggested.". So the current situation is that AMD is taking a look at it but does not share any internal information. So we do not know for example whether they are abale to replicate our problems in place. Many of us opened personal technical support request, myself included. Some have exchanged the CPU using a RMA. In the tread we found some people saying that the new CPU solved their problem but some also saying that the problem remained the same. This last cases scared many of us, since we started to think how unlikely would be to get a second faulty CPU. This may suggest that the problem is more usual than we initially thought. This was the main reason the led me to write to message here. We would like people that do not think that they are affected to test their system. If we can get many responses saying that the systems are OK them our best option is to ask for an RMA. If many people discover problems in their system, then the bug might be wide spread. So, once again, if some fellow reader could test their systems and post a follow up here (just sayin that everything is OK or not) we would appreciate it.

              Comment


              • pjssilva Is that from reddit? Can you paste a link to the thread?

                Comment


                • There are some discussions on redit:

                  https://www.reddit.com/r/programming...ausing_random/

                  There is also an active bug report in FreeBSD and DragonFlyBSD with developers looking for a workaround. In the AMD Forum we already have some cases with people with multiple machines affected. I am more and more convinced that this is a real and common bug. Unfortunately I could not convince people here to test their systems. Not a single report. Come on people, try the kill_rizen.sh script for some hours (let it running by the night). It would be great to get independent confirmation from people outside the AMD thread.

                  I am sorry for AMD, if this bug is widespread, even if hard to trigger, this could be a disaster for them. I hope they find a solution using microcode. But first they need to recognize the problem.

                  Comment


                  • Originally posted by pjssilva View Post
                    There are some discussions on redit:

                    https://www.reddit.com/r/programming...ausing_random/

                    There is also an active bug report in FreeBSD and DragonFlyBSD with developers looking for a workaround. In the AMD Forum we already have some cases with people with multiple machines affected. I am more and more convinced that this is a real and common bug. Unfortunately I could not convince people here to test their systems. Not a single report. Come on people, try the kill_rizen.sh script for some hours (let it running by the night). It would be great to get independent confirmation from people outside the AMD thread.

                    I am sorry for AMD, if this bug is widespread, even if hard to trigger, this could be a disaster for them. I hope they find a solution using microcode. But first they need to recognize the problem.
                    I am on it, started it now.
                    Asus B350M-A
                    1700 @ 3.8ghz.
                    Corsair LPX 2666 32 gb (2x16gb)

                    4.11.11-041111-generic
                    in ubuntu.

                    Comment


                    • Quick!
                      I'd edit my own post but I haven't made enough of em it seems :-)

                      2017 x86_64 x86_64 x86_64 GNU/Linux
                      cat /proc/sys/kernel/randomize_va_space
                      2
                      Using 16 parallel processes
                      [KERN] -- Logs begin at on. 2017-08-02 00:48:25 CEST. --
                      [KERN] aug. 02 00:50:41 oleUbuntu kernel: userif-3: sent link up event.
                      [KERN] aug. 02 00:50:44 oleUbuntu kernel: userif-3: sent link down event.
                      [KERN] aug. 02 00:50:44 oleUbuntu kernel: userif-3: sent link up event.
                      [KERN] aug. 02 00:50:52 oleUbuntu kernel: zram: Cannot change disksize for initialized device
                      [KERN] aug. 02 00:52:49 oleUbuntu kernel: zram: Cannot change disksize for initialized device
                      [KERN] aug. 02 00:53:27 oleUbuntu kernel: zram: Cannot change disksize for initialized device
                      [KERN] aug. 02 00:55:03 oleUbuntu kernel: zram0: detected capacity change from 68719476736 to 0
                      [KERN] aug. 02 00:56:08 oleUbuntu kernel: zram0: detected capacity change from 0 to 68719476736
                      [KERN] aug. 02 00:56:10 oleUbuntu kernel: EXT4-fs (zram0): mounting ext2 file system using the ext4 subsystem
                      [KERN] aug. 02 00:56:10 oleUbuntu kernel: EXT4-fs (zram0): mounted filesystem without journal. Opts: discard
                      [loop-0] on. 02. aug. 00:57:02 +0200 2017 start 0
                      [loop-1] on. 02. aug. 00:57:03 +0200 2017 start 0
                      [loop-2] on. 02. aug. 00:57:04 +0200 2017 start 0
                      [loop-3] on. 02. aug. 00:57:05 +0200 2017 start 0
                      [loop-4] on. 02. aug. 00:57:06 +0200 2017 start 0
                      [loop-5] on. 02. aug. 00:57:07 +0200 2017 start 0
                      [loop-6] on. 02. aug. 00:57:08 +0200 2017 start 0
                      [loop-7] on. 02. aug. 00:57:09 +0200 2017 start 0
                      [loop-8] on. 02. aug. 00:57:10 +0200 2017 start 0
                      [loop-9] on. 02. aug. 00:57:11 +0200 2017 start 0
                      [loop-10] on. 02. aug. 00:57:12 +0200 2017 start 0
                      [loop-11] on. 02. aug. 00:57:13 +0200 2017 start 0
                      [loop-12] on. 02. aug. 00:57:14 +0200 2017 start 0
                      [loop-13] on. 02. aug. 00:57:15 +0200 2017 start 0
                      [loop-14] on. 02. aug. 00:57:16 +0200 2017 start 0
                      [loop-15] on. 02. aug. 00:57:17 +0200 2017 start 0
                      [loop-2] on. 02. aug. 00:57:49 +0200 2017 build failed
                      [loop-2] TIME TO FAIL: 47 s
                      [loop-13] on. 02. aug. 00:58:00 +0200 2017 build failed
                      [loop-13] TIME TO FAIL: 58 s
                      [KERN] aug. 02 00:58:00 oleUbuntu kernel: bash[23093]: segfault at 7fff2c45f69c ip 00007fff2c45f69c sp 00007fff2c45f4f8 error 15

                      Comment

                      Working...
                      X