Announcement

Collapse
No announcement yet.

Some Ryzen Linux Users Are Facing Issues With Heavy Compilation Loads

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #91
    Originally posted by tuxd3v View Post
    This Problem exists for a long time in AMD processors.
    I have a [AMD A10-7800 Radeon R7, 12 Compute Cores 4C+8G], and it hapens at least some 5-10 times a week.
    For what I see described here, the error seems the same

    its a mess, they never fixed it!
    Processor halts, and you have to press reset and boot again...
    But its not in compile time, its during applications execution, like browsing, video, and some times, but less, happens even in a console session... a nightmare.
    What you are describing sounds more like a problem I used to see on Kaveri with early graphics drivers. Don't think it has anything to do with the CPU other than maybe a cooling/power problem.

    Which graphics driver versions are you running ?
    Last edited by bridgman; 06-04-2017, 03:43 PM.

    Comment


    • #92
      Hello Everyone, I decided to sign up to the forum to post my experience as well

      I am currently running a 1700x on the ASUS Prime x370-Pro running the latest firmware (running the latest BIOS - PRIME X370-PRO BIOS 0612). The system is currently configured with 64 GB of G.Skill Flare X RAM (F4-2400C15Q-64GFX - Ryzen supported) running at 2400MHz at stock timings (15-15-15-39) . I have all the AI tweaker setting set to AUTO (as I can't seem to be able to shut them off) so everything is stock and not overclocked. Current OS is Fedora 25 on kernel 4.10.15, any kernel higher than the 4.10.x series results in consistent Xorg crashes in XFCE

      I have tried a number of suggestions that are on this thread (C state, disable SMT and most recently disabling ASLR). I have had no luck at compiling a kernel from day 1 in a ram drive. Consistently SEGFAULTS at random sections in the compile. My compile command line is pretty straightforward
      make INSTALL_MOD_STRIP-1 rpm. I am able to compile this kernel with the same settings on a 8350 without any issues

      Hopefully there is a resolution soon... and I am willing to test whatever is needed to help push this forward...

      Comment


      • #93
        Originally posted by bridgman View Post

        What you are describing sounds more like a problem I used to see on Kaveri with early graphics drivers. Don't think it has anything to do with the CPU other than maybe a cooling/power problem.

        Which graphics driver versions are you running ?
        Hi,
        I am using Mesa 10.3.2 drivers, I had in the past used AMD drivers, but it seemed even worst...so I switched to open source ones..
        I am on Linux Mint lmde2, a debian derivative..
        What do you recommend, to solve this problem?

        Comment


        • #94
          Originally posted by dillon View Post
          Hi, Matt Dillon here. Yes, I did find what I believe to be a hardware issue with Ryzen related to concurrent operations. In a nutshell, for any given hyperthread pair, if one hyperthread is in a cpu-bound loop of any kind (can be in user mode), and the other hyperthread is returning from an interrupt via IRETQ, the hyperthread issuing the IRETQ can stall indefinitely until the other hyperthread with the cpu-bound loop pauses (aka HLT until next interrupt). After this situation occurs, the system appears to destabilize. The situation does not occur if the cpu-bound loop is on a different core than the core doing the IRETQ. The %rip the IRETQ returns to (e.g. userland %rip address) matters a *LOT*. The problem occurs more often with high %rip addresses such as near the top of the user stack, which is where DragonFly's signal trampoline traditionally resides. So a user program taking a signal on one thread while another thread is cpu-bound can cause this behavior. Changing the location of the signal trampoline makes it more difficult to reproduce the problem. I have not been able to completely mitigate it. When a cpu-thread stalls in this manner it appears to stall INSIDE the microcode for IRETQ. It doesn't make it to the return pc, and the cpu thread cannot take any IPIs or other hardware interrupts while in this state.
          Could you tell me why you think this problem is HT specific?

          If the problem which you pointed out is the same as mine, SEGV should disappear if I disable SMT. However, when I did so,
          SEGV still happened. On the other hand, when I disabled ASLR, which is expected to reduce the probability of the problem
          you mentioned, SEGV disappeared at all.

          Comment


          • #95
            Kaveri has worked perfectly for years for me, there are no specific stability issues known.

            As for Intel being the "safe" choice, note that Intel also had serious stability issues with Skylake and with Haswell they had to disable the TSX feature because it was broken. (How did they ever get away with this? They permanently disabled a feature that the CPU was advertised with!)

            Comment


            • #96
              Originally posted by brent View Post
              ...

              (How did they ever get away with this? They permanently disabled a feature that the CPU was advertised with!)
              The same way Sony got away with getting rid of OtherOS on the PS3: they have a lot of money (and, thus, lawyers), and the minority of users who even understand it, let alone care, don't*.

              * Those folks who had the money to throw around to build PS3 clusters using it...spent it all on said clusters. :P Or just never updated the firmware.

              Comment


              • #97
                Hi Guys, I'm one of the Ceph developers. I'm seeing this on my 1700x now pretty regularly on an ASUS Taichi X370 with ubuntu 17.04. This chip has never been overclocked. 16GB of DDR4-3200 running at 2100 in 2 DIMMs. Definitely not overclocking related and no indication the memory is bad. Can verify that once it starts happening the system appears to destabilize pretty quickly.

                Comment


                • #98
                  Nite_Hawk, disable the OPCache Code on the bios as it improved a lot the segfaults. I have just started testing with the randomize_va_space set to 0. My machine hasn't segfaulted in a while, but I have not really been stressing it. I did have problems with unexpected zfs crashes leading to complete freezes, but I haven't seen one of this since the bios upgrade.

                  Comment


                  • #99
                    Fedora 26 Alpha held up 25 hours for me so I booted into Gentoo with the Fedora kernel and that held up too. Now I've built my own kernel using the same configuration so I'll see how that goes. It's starting to look like there's some crucial element I didn't enable so I just need to work out which one. I'm wondering if it's IOMMU-related because I got a bunch of errors unless I booted with iommu=pt before. Now I don't need to do that.

                    Comment


                    • Originally posted by Chewi View Post
                      Fedora 26 Alpha held up 25 hours for me so I booted into Gentoo with the Fedora kernel and that held up too. Now I've built my own kernel using the same configuration so I'll see how that goes. It's starting to look like there's some crucial element I didn't enable so I just need to work out which one. I'm wondering if it's IOMMU-related because I got a bunch of errors unless I booted with iommu=pt before. Now I don't need to do that.
                      Have you tried disabling the boost frequency, checking that Vcore is 1.2 Volts, and lowering operating frequency below the default frequency (for example: 3.0GHz in case the default frequency is 3.2GHz (Ryzen 5 1600))?

                      Comment

                      Working...
                      X