Announcement

Collapse
No announcement yet.

Vega + dota2 lockups ?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Vega + dota2 lockups ?

    Anyone else having problems with system lockups in dota2 ?

    dmesg:
    [141799.403573] [drm:0xffffffffa04b5505] *ERROR* ring gfx timeout, last signaled seq=18041777, last emitted seq=18041778
    [141799.403576] [drm] GPU recovery disabled.

    Then the keyboard don't work and programs can't be killed if I ssh in.
    This happens just about everytime I try the game, usually about half way thru. Various versions of software over the past year seem to result in similar behavior.
    At first I thought it was the drivers being new and I just had to wait, then I suspected it was heat since I live in southern california.
    DOA card ? too late to return, and applying new thermal paste likely voided the warranty ...
    I can run benchmarks just fine, but trying to complete a full game is basically impossible.

    Any ideas ?

  • #2
    yeah, i've got 2x 120mm intake fans and two 120mm exhaust fans
    I've got them all set to run at max if gpu temp > 60c

    The cpu/mem/chipset are all cool and stable.
    Last edited by Soul_keeper; 11 September 2018, 06:39 AM.

    Comment


    • #3
      it's stable when underclocked ... ohh well.
      1600MHz 1.20v is just too insane I guess, Or the video card itself is defective

      Comment


      • #4
        I installed an extra 120mm side intake fan to blow right on the vid card ...
        so that's 3 intake and 2 exhaust, and the thing is still not stable.

        I wish I hadn't replaced the thermal paste and voided the warranty, I basically got screwed on this deal.
        The card was DOA when I first bought it, and it is still DOA ... no amount of cooling can fix that fact. But I just have to take a $1000 hit and no joy at all from having given $1000 to AMD.

        Comment


        • #5
          KINDA OFF TOPIC HERE:
          I finally figured out a way that seems to lower the HBM voltage.

          cat /sys/class/drm/card0/device/pp_od_clk_voltage
          OD_SCLK:
          0: 852Mhz 800mV
          1: 927Mhz 825mV
          2: 1002Mhz 870mV
          3: 1402Mhz 930mV
          4: 1452Mhz 970mV
          5: 1502Mhz 1020mV
          6: 1552Mhz 1070mV
          7: 1602Mhz 1125mV
          OD_MCLK:
          0: 167Mhz 800mV
          1: 652Mhz 825mV
          2: 1027Mhz 870mV
          3: 945Mhz 930mV
          OD_RANGE:
          SCLK: 852MHz 2400MHz
          MCLK: 167MHz 1500MHz
          VDDC: 800mV 1250mV


          Setting mclock state 2 higher than state 3 causes it to be disabled, and thus it's locked higher voltage.
          This results in more thermal headroom and less throttling. 1027 locked in this way beats 1077 on state 3 in every benchmark.

          Now the mystery is what exact voltages are locked for the mclk states, since linux users aren't allowed to see hbm voltage or temps.

          DEFAULT:

          OD_SCLK:
          0: 852Mhz 800mV
          1: 991Mhz 900mV
          2: 1138Mhz 950mV
          3: 1269Mhz 1000mV
          4: 1348Mhz 1050mV
          5: 1440Mhz 1100mV
          6: 1528Mhz 1150mV
          7: 1600Mhz 1200mV
          OD_MCLK:
          0: 167Mhz 800mV
          1: 500Mhz 900mV
          2: 800Mhz 950mV
          3: 945Mhz 1000mV
          OD_RANGE:
          SCLK: 852MHz 2400MHz
          MCLK: 167MHz 1500MHz
          VDDC: 800mV 1250mV

          From what I can gather from random internet forums ... The mclk is locked to 1.35v on vega64/fe for state 3 and 1.25v for vega56. So i'm not sure what state 2 is locked at, but adjusting the sclk voltages and the mclk "floor" voltages (there are two 1.35v and a lower one) in the pptable have no effect on max clock speed or power usage. They do seem to influence how states are switched. I suspect it's 1.10v, then maybe 1.00v for state 1.

          mclk settings tried:
          0: none
          1: 500 (stable) 550 (stable) 600 (stable) 652 (stable) 700 (unstable)
          2: 800 (stable) 850 (stable) 900 (stable) 952 (stable) 1027 (stable) 1033 (unstable) 1052 (unstable)
          3: 945 (stable) ... 1027 (stable) 1052 (stable) 1077 (stable) 1090 (stable?) 1107 (unstable)

          Also there is the matter of SoC clock, changing the MHz value of those in the pptable appears to have an effect, but the voltages for the 0-5 states are a mystery. I suspect they are also linked to the sclk voltages in some way. I tried various values for state 5 here and anything over ~1200 isn't stable. I've noticed that the 3 top clocks are 1MHz less than the 3 top dispclock values shifted down 1 state. ie: 1200 1108 1029 = 1107 1028
          Clearly a different voltage is applied to each one, possibly mapped to sclk voltage states 5 4 3 etc.
          Also it is unclear if the actual hbm memory controller is included in the SoC or if it is included in the gpu plane itself.

          https://openbenchmarking.org/result/...FO-1809157FO93
          Last edited by Soul_keeper; 17 September 2018, 03:05 AM.

          Comment


          • #6
            After figuring out this stuff, and writing my own spreadsheet/scripts for quickly editing all the pptable values and changing what I need, I'm going to test my new settings and see if I can't permanently stay out of a thermal shutdown/lockup in dota2 or other games/benchmarks.

            Comment


            • #7
              The issue is not thermal/hardware. It is software, and still exists, if anyone reads this for posterity.
              I went 2 months without playing a single game, then updated to the latest mesa/kernel and sure enough system lockup.
              Impossible to play dota2 without constantly rebooting the pc ... frustrating nonsense.

              I guess AMD still hasn't bothered to write the code to handle recovery for vega yet.

              Comment

              Working...
              X