Announcement

Collapse
No announcement yet.

Bricked RX 560

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bricked RX 560

    I started playing Firewatch using my rather new RX 560 and got a system freeze within minutes (keyboard dead, couldn't switch console). So I reset the computer and tried again, and again got a freeze within minutes. Tried again, this time with lower graphics settings, and it seemed to work fine - for 15 minutes or so. Got another freeze, reset the computer again, played another 10-15 minutes, got a freeze and decided to give up for the night.

    Next day when starting the computer I get no graphics signal whatsoever, not even on POST. I do get a BIOS error beep. Reseated the card, reconnected the power supply cable and the DVI connector and tried again - same result.

    The computer works fine with an old Radeon 3870.

    Has anyone seen similar behaviour? Could this be a driver bug or is it just a manufacturing problem? The card has worked fine playing DiRT Rally for a couple of hours, and started crashing within minutes on Firewatch, which makes me a bit suspicious - but maybe Firewatch stresses the card more and caused some heat related problem?

    I'm running Linux Mint 18.1 with kernel 4.10.0-26 (kernel command line radeon.dpm=1) and the ubuntu-x-swat/updates PPA (so libdrm-2.4.80-1, mesa-17.1.2-1ubuntu-1).

  • #2
    If it's not even getting through POST then drivers wouldn't be involved yet, ie seems more like HW issue.

    If it was getting through POST then my first though would be to try a newer kernel. I believe the 560 draws a bit less power than the 3870 did so power supply doesn't seem likely to be an issue. Your initial description makes me think about overheating.

    There is a frequently reported Linux issue with Firewatch which causes mouse inputs to stop being recognized, but that doesn't sound like your problem. There is a patch for that available from the game dev BTW.
    Last edited by bridgman; 01 July 2017, 07:48 PM.
    Test signature

    Comment


    • #3
      Originally posted by bitnick View Post
      The computer works fine with an old Radeon 3870.
      Well if comp is fine with 3870 but problem only appears with new RX, then incompatible hardware, bios issue... if mobo is old sometimes new hardware does not work, in mind also that suddenly it can't post you mentioned there, etc...

      Maybe it is problematic card and if it happens on Windows also, just return it
      Last edited by dungeon; 02 July 2017, 02:31 AM.

      Comment


      • #4
        Originally posted by bridgman View Post
        If it's not even getting through POST then drivers wouldn't be involved yet, ie seems more like HW issue.
        Hi bridgman, thanks for your answer. Well, yes, obviously the card has a hardware issue now (and I will RMA it). My concern is--what caused it? Do I dare play Firewatch with the replacement?

        Originally posted by bridgman View Post
        If it was getting through POST then my first though would be to try a newer kernel. I believe the 560 draws a bit less power than the 3870 did so power supply doesn't seem likely to be an issue. Your initial description makes me think about overheating.
        Yes, the 560 draws about 10 W less, both when idle and at max, than the 3870, on my system (measured with a power meter at the socket). So as you say, power supply shouldn't be an issue, and neither should ambient heat (inside the case). (I have a 400 W Corsair power supply and max consumption is 120-130 W.)

        Is it even possible to overheat a card like this using stock cooling and stock settings (i.e. no overclocking, if that's even possible in linux)? Shouldn't it throttle down if it gets too hot?

        I forgot to write in my first post that I've also played several hours of Left4Dead 2 at max graphics settings, as well as DiRT Rally, and in both those games the card was completely stable. And within minutes of Firewatch play, the crashes started. This is what makes me suspect the drivers (i.e. that they do something/allows something that causes the card to overheat).

        But I guess it's really impossible to answer like this, and I'll just have to try again with the replacement card, right?

        Comment


        • #5
          Originally posted by bitnick View Post
          Is it even possible to overheat a card like this using stock cooling and stock settings (i.e. no overclocking, if that's even possible in linux)? Shouldn't it throttle down if it gets too hot?
          Two part answer... thermal, power & clock management are all tied together these days so I expect it would throttle, plus IIRC there is a hard thermal limit where the chip will power down. That said, if there was something wrong with the cooling solution on the card (eg thermal paste missing) then it's possible the chip could heat up too fast for any of the controls to kick in. Not saying that was the case, just trying to be complete.

          Originally posted by bitnick View Post
          I forgot to write in my first post that I've also played several hours of Left4Dead 2 at max graphics settings, as well as DiRT Rally, and in both those games the card was completely stable. And within minutes of Firewatch play, the crashes started. This is what makes me suspect the drivers (i.e. that they do something/allows something that causes the card to overheat). But I guess it's really impossible to answer like this, and I'll just have to try again with the replacement card, right?
          There were some kernel driver changes related to power management for the 5xx polaris parts, but AFAICS they were all picked up in the 4.10 kernel release, so I don't see anything obviously missing. It wouldn't hurt to start with a newer kernel (Michael did his RX 560 testing with WIP 4.12) but at first glance 4.10 should have been sufficient.
          Test signature

          Comment


          • #6
            Thanks bridgman, that's what I thought.

            It's on its way back to the reseller now for a replacement. I will try again with linux-4.10 when I get the new one. It's the latest supported kernel for the distro on my "play" machine (Linux Mint) and I really don't want to take work with me home (I do kernel development for a living and want things to pretty much just work on this machine...).

            Comment


            • #7
              So I got the replacement card today and installed it. After a while it crashed just like the previous one - but this time it broke completely on the first crash. So now I have another bricked card. Which will be more difficult to get replaced, probably...

              HOWEVER, before I turned the computer off this time, I just made a sanity check to see that the card's fan was running - and it wasn't! And of course the card was very hot.

              How can it be that
              1) The fan isn't running on (probably) two cards in a row?
              2) The card just continue going and kills itself with heat? I mean, it had a working heatsink with quite a lot of thermal mass, so it should have had time to react, right?

              I *do* have fancontrol installed for the system & CPU fans, but this shouldn't interfere with the graphic card fan -- right? And for sure not just turn it off?

              Comment


              • #8
                Maybe I should add that the card is a Sapphire RX 560 Pulse. It's supposed to have something called "Intelligent Fan Control III" that should turn the fan on at 54 °C. After the crash, the edges of the fins of the heatsink were too hot to touch (but not sizzling hot), so I guess they were somewhere over 50 °C. Which means the chip should have been a fair bit above 54 °C. Of course, the fan might have been spinning before the crash...

                Comment


                • #9
                  OK, that's wierd. Was the original card the same make & model ? Probably is but just checking...

                  Looks like there are 2GB and 4GB models at two different power ratings ("unspecified" and "45W") - which do you have ?

                  Fan control on dGPUs is supposed to be independent of the driver, set up at boot by VBIOS. Factory website info and anecdotal comments from Sapphire 560 Linux users both seem to support that. Drivers can potentially over-ride the default temp/fanspeed breakpoints but I have never seen a case where they are not set up by VBIOS before the driver even loads. Sometimes they are set up inappropriately (a bit too fast or too slow for the conditions, generally when vendor changes cooler but doesn't update fan table) but I have never heard of a board that doesn't set up any fan control values. I'll ask around to see if anyone has heard of this though.

                  What kind of workloads were you running during the initial test of the new card ?

                  Guessing you don't have the ability to test with Windows ?

                  Couple more things...

                  1. The Polaris parts use the amdgpu kernel driver not radeon, so you can remove the radeon.dpm boot parm. Is there anything else in your boot string worth reviewing ?

                  2. Can you boot enough to interact with the system and pastebin your dmesg output ? When you say "bricked" what does and does not seem to work ?

                  3. I found a number of reviews/comments by people running the Sapphire RX 560 Pulse with the open source drivers without problems (including Michael as it turns out)...

                  http://www.phoronix.com/scan.php?pag...n-rx-560&num=1

                  ... so starting to get a bit suspicious about the fancontrol utility.

                  4. Most of the descriptions I see only talk about fancontrol being used for CPU fan control but AFAIK it is part of lmsensors and the framework it runs in has some ability to work with GPU hardware as well... so I would disable it for now.

                  Seems like it would be worth powering up with fancontrol disabled to see if fans start working.

                  5. Guessing you are still running the same 4.10.x kernel as before ? If so then maybe try booting with 4.12 kernel (ideally same setup Michael used) and see if that makes a difference with fans. I didn't see anything obvious in the code history between 4.10 and 4.12 that should make a big difference but that doesn't mean there isn't something...

                  6. Is there anything else unusual about your system configuration ? I was trying to think whether something like a bad wire/connector in a PCIE power cable could let the GPU run but not the fan, however they all run on 12V AFAIK so at first glance that doesn't seem like a likely cause.
                  Last edited by bridgman; 14 July 2017, 07:45 AM.
                  Test signature

                  Comment


                  • #10
                    Originally posted by bridgman View Post
                    OK, that's wierd. Was the original card the same make & model ? Probably is but just checking...

                    Looks like there are 2GB and 4GB models at two different power ratings ("unspecified" and "45W") - which do you have ?
                    It's the 2GB model, "Sapphire Pulse Radeon RX 560 2G", SKU 11267-02, PN 299-1E348-130SA. The original card was identical.

                    Originally posted by bridgman
                    What kind of workloads were you running during the initial test of the new card ?
                    I ran Firewatch with medium settings (it started lagging with high settings). Sorry, I don't have any quantitive info about the load. It did take longer for this card to fail (about an hour) than the first one, probably because I forgot to unplug the internal case fan that gave the old, passively cooled Radeon card a bit of extra air movement. So this card had a very light breeze on its back side, which probably helped a bit.

                    Originally posted by bridgman
                    Guessing you don't have the ability to test with Windows ?
                    Not unless Windows XP is supported and I get another new card...

                    Originally posted by bridgman
                    1. The Polaris parts use the amdgpu kernel driver not radeon, so you can remove the radeon.dpm boot parm. Is there anything else in your boot string worth reviewing ?
                    Ok. That's the only change I've made from the default Linux Mint 18.1 boot string.

                    Originally posted by bridgman
                    2. Can you boot enough to interact with the system and pastebin your dmesg output ? When you say "bricked" what does and does not seem to work ?
                    The card never gives any video signal at all - I get a weird BIOS beep at power on (not a normal error beep code, more like an extra long beep with a few very short interruptions). I think the system is otherwise alive though, so I might be able to log in via ssh. I'll get back about this.

                    Originally posted by bridgman
                    5. Guessing you are still running the same 4.10.x kernel as before ?
                    Yep, same kernel. I did get a microcode ("linux-firmware") update for Polaris 11 as part of a system update, so the first card ran with ucode dated dec 2016, the new with ucode dated mars 2017 - that is, if Polaris 11 code is used for this card? There's no firmware files for Polaris 21 installed.

                    Originally posted by bridgman
                    6. Is there anything else unusual about your system configuration ? I was trying to think whether something like a bad wire/connector in a PCIE power cable could let the GPU run but not the fan, however they all run on 12V AFAIK so at first glance that doesn't seem like a likely cause.
                    Well, it's an old system. Motherboard Gigabyte GA-EP43-DS3L (PCI Express 2.0) with a Core2 Duo E8400 CPU. I had to disable "CPU Smart FAN Control" in the BIOS for fancontrol to work, so the CPU fan runs at 100 % from boot until the fancontrol daemon has been loaded.

                    Comment

                    Working...
                    X