Announcement

Collapse
No announcement yet.

Annoying AMD Linux Graphics Driver Crashes With "Timed Out Fences" Has A Fix Coming

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    khnazile My memory clocks aren't locked though. The card can decide when to use which mclk/sclk/socclk DPM state at any time. I've also never experienced this with stock settings. Now I'm curious about why this affects some people, but not others…

    Comment


    • #32
      Originally posted by kiffmet View Post
      khnazile My memory clocks aren't locked though. The card can decide when to use which mclk/sclk/socclk DPM state at any time. I've also never experienced this with stock settings. Now I'm curious about why this affects some people, but not others…
      I believe that it's some sort of racing condition and chip quality issue. Better chips are faster to leave 'dangerous' state, so there's much lower chance that they hit the issue.

      Something that made me think of like this: according to those engineers who repair failed mining equipment, AMD chips have those weird failures when GPU stays partially operational way more often than nvidia counterparts. That makes diagnostics much harder, and it's one of the reasons why many of them refuse to do repairs of AMD hardware.

      Comment


      • #33
        The same issue existed on windows. Especially happened a lot to me when you run a video game on one monitor and a video in a browser on the other monitor. Constant timeouts and crashing. Now they fixed the timeout issues but i still get the video game to crash randomly.

        Comment


        • #34
          I think these patches were the ones from drm-next that tried to improve the situation:
          It can happen that we query the sequence value before the callback had a chance to run. Workaround that by grabbing the fence lock and releasing it again. Should be replaced by hw handling soon. Signed-off-by: Christian König CC: [email protected] # 5.19+ Fixes:...

          Setting this flag on a scheduler fence prevents pipelining of jobs depending on this fence. In other words we always insert a full CPU round trip before dependen jobs are pushed to the pipeline. Signed-off-by: Christian König

          Make sure that we always have a CPU round trip to let the submission code correctly decide if a TLB flush is necessary or not. Signed-off-by: Christian König


          They improve the situation some what, e.g. plasmashell or kwin were broke after a few hours because of the issue work fine now.

          However Firefox still freezes and wait for the card to be available I think, sometimes Firefox reacts again and sometimes I just give up, kill firefox and start it again.

          Comment


          • #35
            Originally posted by khnazile View Post

            I believe that it's some sort of racing condition and chip quality issue. Better chips are faster to leave 'dangerous' state, so there's much lower chance that they hit the issue.

            Something that made me think of like this: according to those engineers who repair failed mining equipment, AMD chips have those weird failures when GPU stays partially operational way more often than nvidia counterparts. That makes diagnostics much harder, and it's one of the reasons why many of them refuse to do repairs of AMD hardware.
            From my observation, it seems Polaris has way less driver issues than all newer generations on both Windows and Linux, which use different AMD kernel drivers, unlike Nvidia, which reuses as much driver code as possible across different OS. I have an RX 580 and last time when I experienced (minor) driver issues on both Windows and Linux was around 2017/2018. Others have similar experience and people rarely talk about driver issues on Polaris. As you said, these driver issues may depend on chip quality, so this got me thinking if AMD's manufacturing quality has simply gone downhill after Polaris. I've also seen similar comments that 5700 XT is a silicon lottery, because some experience driver issues, but some don't.

            However, regarding cases of driver issues on Linux, but not on Windows, I think this is simply because Windows's WDDM driver infrastructure is superior and more robust than what Linux has. A classic example is incorrect graphics api usage: On Linux it will cause GPU hang, but on Windows it will only cause an app crash. Windows is also way better at handling driver resets.
            Last edited by user1; 15 November 2022, 10:58 AM.

            Comment

            Working...
            X