Announcement

Collapse
No announcement yet.

RADV Lands New Extension To Better Debug GPU Hangs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    I've not seen many hangs over the years, I'm currently playing Horizon Forbidden West on my 6800M with HDR at 4K with everything maxed out

    Most of my issues stem from there being no testing of PRIME systems, nearly all my bug reports are reporting regressions because they didn't take it into account

    A new regression slipped into libdrm, took me ages to find, thought it was something in the Kernel, or Mesa, or maybe the vulkan-loader, or gamescope....

    Comment


    • #12
      Originally posted by Rauros View Post
      I have 7900 xt. I tried different distros (arch, nobara, fedora, opensuse, popos), different mesa versions, kernel versions (from 6.4 to 6.9) they all have these resets problems. It is not a hardware problem or distro problem. Look at the links there are way too many people having some GCVM_L2_PROTECTION_FAULT_STATUS or *ERROR* Waiting for fences timed out or ERROR ring gfx timeout​ or flip_done timed out errors. I've been driving windows now for the last 5 months and I played tons of games. Not a single crash.
      And look at this https://www.reddit.com/r/linux_gamin...ould_get_right
      I see similar post in linux_gaming or in r/linux subredit every week and every week people are downvoting it and telling to person that they never had a problem it must be an hardware defect or must be special to some games. Or they tell that these things happens with nvidia too. I used nvidia for 6 years and i was regularly visiting their forums etc. I remember people having problem with xwayland or missing features but tbh i don't remember people having so many hang and reset problems that is not solved for 4 years now.
      If this is an hardware defect then amd doing really terrible job with their hardware and we should avoid at all cost. Considering so many people are having these issues.
      If this is special to *some* games then mesa guys is doing terrible jobs but I know that these errors can be reproducible in amdvlk as well.
      first of all i believe you its really not that i do not believe you. can you please tell me what kind of cpu and mainboard you have ?
      I only have a Vega64 and AMD PRO w7900 and AMD ryzen 7600 with radeon 780G and all my systems are rock solit without error and hangs.
      its no joke there are really hardware defects who only trigger on one OS but not on the other OS. but its rare case.
      in your case i do not think its a hardware defect.

      what i can see is this: at the start of the AMD open-source initiative amd did a lot of open-source documentary about the ISA of their chips and a lot of hardware documentation. and then amd switched to another stradegy to develop the driver first and proclaim that the driver itself is the documenation. because of that its now like the amd developers can reproduce and fix bugs and outside developers many times can not unterstand it fix it because of the lag of documentation. if you follow lastest news AMD did promise to again spend more time and money for documenation to satisfy the AI people this documenation will also triggle down to non-AI technical stuff.

      it looks like its not enough to just push out the open-source driver and no longer have top notch ISA/hardware documentation this situation makes developers outside of amd impossible to fix bugs like you have.

      did you ever try to make some bug-reports somewhere ? please give me links to that as well.

      if you give me more information what action and software did trigger an error or crash then i can use my AMD PRO w7900 to perform a reproduction of your bug for future investigation.

      Phantom circuit Sequence Reducer Dyslexia

      Comment


      • #13
        Originally posted by qarium View Post

        first of all i believe you its really not that i do not believe you. can you please tell me what kind of cpu and mainboard you have ?
        I only have a Vega64 and AMD PRO w7900 and AMD ryzen 7600 with radeon 780G and all my systems are rock solit without error and hangs.
        its no joke there are really hardware defects who only trigger on one OS but not on the other OS. but its rare case.
        in your case i do not think its a hardware defect.

        what i can see is this: at the start of the AMD open-source initiative amd did a lot of open-source documentary about the ISA of their chips and a lot of hardware documentation. and then amd switched to another stradegy to develop the driver first and proclaim that the driver itself is the documenation. because of that its now like the amd developers can reproduce and fix bugs and outside developers many times can not unterstand it fix it because of the lag of documentation. if you follow lastest news AMD did promise to again spend more time and money for documenation to satisfy the AI people this documenation will also triggle down to non-AI technical stuff.

        it looks like its not enough to just push out the open-source driver and no longer have top notch ISA/hardware documentation this situation makes developers outside of amd impossible to fix bugs like you have.

        did you ever try to make some bug-reports somewhere ? please give me links to that as well.

        if you give me more information what action and software did trigger an error or crash then i can use my AMD PRO w7900 to perform a reproduction of your bug for future investigation.
        AMD's biggest problem is ROCm itself. It's a galvanic juxtaposition of unlike and dissimilar interfaces that as a whole is incompatible, incomplete and buggy as hell.... It's fglrx all over again...

        Comment


        • #14
          Originally posted by duby229 View Post
          AMD's biggest problem is ROCm itself. It's a galvanic juxtaposition of unlike and dissimilar interfaces that as a whole is incompatible, incomplete and buggy as hell.... It's fglrx all over again...
          is he even using ROCm ? ROCm works very good on my vega64 with Blender... also ZLUDA works on vega64.

          right now the main problem with ROCm i see right now is that developers outside of AMD just lag ISA and hardware documentation and debugging tools to fix the problems. looks like it is not enough to AMDVLK style develop the driver internaly and drop a open-source version somewhere.

          also ROCm to deliver it outside of the linux distros is really not a good idea the ROCm driver from AMD.com i could never make it run but the distro version from fedora works ...

          "It's a galvanic juxtaposition of unlike and dissimilar interfaces that as a whole is incompatible, incomplete and buggy as hell."

          i really struggle to unterstand what you want to say

          did you know that it is so buggy because they copy cuda bug by bug ? ... if there is a bug in cuda they litterally copy it in ROCm/HIP... this is the only reason why ZLUDA works​

          if you want bug-free compute on a GPU you should use Vulkan(no joke)
          Phantom circuit Sequence Reducer Dyslexia

          Comment


          • #15
            Originally posted by qarium View Post

            is he even using ROCm ? ROCm works very good on my vega64 with Blender... also ZLUDA works on vega64.

            right now the main problem with ROCm i see right now is that developers outside of AMD just lag ISA and hardware documentation and debugging tools to fix the problems. looks like it is not enough to AMDVLK style develop the driver internaly and drop a open-source version somewhere.

            also ROCm to deliver it outside of the linux distros is really not a good idea the ROCm driver from AMD.com i could never make it run but the distro version from fedora works ...

            "It's a galvanic juxtaposition of unlike and dissimilar interfaces that as a whole is incompatible, incomplete and buggy as hell."

            i really struggle to unterstand what you want to say

            did you know that it is so buggy because they copy cuda bug by bug ? ... if there is a bug in cuda they litterally copy it in ROCm/HIP... this is the only reason why ZLUDA works​

            if you want bug-free compute on a GPU you should use Vulkan(no joke)
            I don't really think ROCm works good anywhere on anything. No user of ROCm can resolve the fact that it's composed of incompatible and incomplete interfaces that work against each other. Do you know the word "clusterfuck"? "Galvanic juxtaposition" is a nicer way of saying that...

            EDIT: It means something like "Too many chiefs and not enough indians" or "Too many officers and not enough troops"

            EDIT: Clusterfuck: "The term dates at least as far back as the Vietnam War, as military slang for doomed decisions resulting from the toxic combination of too many high-ranking officers and too little on-the-ground information.​" Yup, sounds like the -actual- description of ROCm..
            Last edited by duby229; 04 April 2024, 01:18 PM.

            Comment


            • #16
              Originally posted by qarium View Post

              first of all i believe you its really not that i do not believe you. can you please tell me what kind of cpu and mainboard you have ?
              I only have a Vega64 and AMD PRO w7900 and AMD ryzen 7600 with radeon 780G and all my systems are rock solit without error and hangs.
              its no joke there are really hardware defects who only trigger on one OS but not on the other OS. but its rare case.
              in your case i do not think its a hardware defect.

              what i can see is this: at the start of the AMD open-source initiative amd did a lot of open-source documentary about the ISA of their chips and a lot of hardware documentation. and then amd switched to another stradegy to develop the driver first and proclaim that the driver itself is the documenation. because of that its now like the amd developers can reproduce and fix bugs and outside developers many times can not unterstand it fix it because of the lag of documentation. if you follow lastest news AMD did promise to again spend more time and money for documenation to satisfy the AI people this documenation will also triggle down to non-AI technical stuff.

              it looks like its not enough to just push out the open-source driver and no longer have top notch ISA/hardware documentation this situation makes developers outside of amd impossible to fix bugs like you have.

              did you ever try to make some bug-reports somewhere ? please give me links to that as well.

              if you give me more information what action and software did trigger an error or crash then i can use my AMD PRO w7900 to perform a reproduction of your bug for future investigation.
              I have msi b650m as motherboard and ryzen 7 7700 cpu. All the resets and hangs I have observed was already reported. For instance https://gitlab.freedesktop.org/drm/amd/-/issues/2156 this and this https://gitlab.freedesktop.org/drm/amd/-/issues/3284 when you are using anything that utilize vaapi. But this is not a regular thing. I got 2-3 of these over 2-3 weeks of usage.
              I had regular freezes in ac:valhalla https://gitlab.freedesktop.org/mesa/mesa/-/issues/5701 and some regular ring gfx timeout https://gitlab.freedesktop.org/drm/amd/-/issues/1974 in all kinds of different games. ac:valhalla resets were more frequent.
              I had problems in wolfensteins games and watch dogs 2 about textures but those are unrelated to stability of drivers.
              These hangs are totally random. You can play something for a week or 2 without any crashes and get a reset randomly.

              One other thing that bothers me is that in windows my gpu consumes 45 watts when decoding a video. But on linux same settings and configuration it consumes 75-95 watts. Which is talked in here https://gitlab.freedesktop.org/drm/amd/-/issues/3195. Again unrelated to stability but bothering still.

              Comment


              • #17
                Originally posted by duby229 View Post
                I don't really think ROCm works good anywhere on anything. No user of ROCm can resolve the fact that it's composed of incompatible and incomplete interfaces that work against each other. Do you know the word "clusterfuck"? "Galvanic juxtaposition" is a nicer way of saying that...
                EDIT: It means something like "Too many chiefs and not enough indians" or "Too many officers and not enough troops"
                EDIT: Clusterfuck: "The term dates at least as far back as the Vietnam War, as military slang for doomed decisions resulting from the toxic combination of too many high-ranking officers and too little on-the-ground information.​" Yup, sounds like the -actual- description of ROCm..
                it looks like a ROCm problem from the outside but in reality its a CUDA problem in reality CUDA is this Clusterfuck and AMD with ROCM just reimplement -CUDA with bug by bug compatibility its like wine who reimplement windows API with bug by bug compatibility something like ZLUDA​would not even possible without bug by bug compatibility.

                ""clusterfuck"? "Galvanic juxtaposition" is a nicer way of saying that...
                EDIT: It means something like "Too many chiefs and not enough indians" or "Too many officers and not enough troops"​"

                believe it or not but CUDA is exactly this a Clusterfuck and to many chiefs who all cook the same meal and so one.,
                people thought a source code level compatibility layer with bug by bug compatibility would be a nice idea
                and also everyone on top on that thought a ZLUDA with bug by bug compatibility to cuda would be a nice idea.


                the reality is that it is a clusterfuck... a clean cut and Vulkan Compute as a real open standard would be the solution

                keep in mind that nvidia does not want CUDA do become a open standard their newest cuda version has a license who makes ZLUDA agaist the law.

                I honestly don't unterstand why people don't have the Wisdom to go with Vulkan Compute instead as a real solution.
                Phantom circuit Sequence Reducer Dyslexia

                Comment


                • #18
                  Originally posted by Rauros View Post
                  I have msi b650m as motherboard and ryzen 7 7700 cpu. All the resets and hangs I have observed was already reported. For instance https://gitlab.freedesktop.org/drm/amd/-/issues/2156 this and this https://gitlab.freedesktop.org/drm/amd/-/issues/3284 when you are using anything that utilize vaapi. But this is not a regular thing. I got 2-3 of these over 2-3 weeks of usage.
                  I had regular freezes in ac:valhalla https://gitlab.freedesktop.org/mesa/mesa/-/issues/5701 and some regular ring gfx timeout https://gitlab.freedesktop.org/drm/amd/-/issues/1974 in all kinds of different games. ac:valhalla resets were more frequent.
                  I had problems in wolfensteins games and watch dogs 2 about textures but those are unrelated to stability of drivers.
                  These hangs are totally random. You can play something for a week or 2 without any crashes and get a reset randomly.
                  One other thing that bothers me is that in windows my gpu consumes 45 watts when decoding a video. But on linux same settings and configuration it consumes 75-95 watts. Which is talked in here https://gitlab.freedesktop.org/drm/amd/-/issues/3195. Again unrelated to stability but bothering still.
                  this all really sounds like a serious problem...
                  and the only problem i more or less know the cause of the problem is the 45watt vs 75-95watt problem
                  its known that the chiplet design cause high power consuming and then after release they developed techniques to mitigate the problem
                  and this techniques is not triggled down to linux yet... linux is slow to have the same mitigations because compared to windows its a low priority to AMD... AMD bridgman is in pension now but years ago he told me that linux marketshare it to small to justify any driver development and they already spend 2-3 times more on linux than the marketshare would justify it because for them the linux driver is PR in the computer science field.

                  what do i mean if i say PR ??? in this meaning well its like the Nvidia RTX 4090 the total market share of such GPUs is very small and the big business is made with smaller and less expensive gpus and these TOP TOP TOP GPUs and CPUs do AMD and Intel and Nvidia cost MORE than they earn them money means it is not financed with the sales it is finances over the PR budged (no joke)

                  linux has not the marketshare to justify the driver development over sales it is more or less financed only to generate good PR ... its not PR for the masses its PR for software developers in the HPC and AI field.

                  the fix for the 95watt vs 45watt problem will come but linux gets it much slower.

                  to the rest problems for a long time now like 10 years AMD did no longer release ISA and hardware documentation means developers outside AMD can not fix it. in 2007/2008 this was very different there was no open-source driver but AMD did release top-notch ISA and hardware documentation.

                  as lisa sue said they will spend more money on release ISA and Hardware documentation this will overall be of great benefit for the driver for all the developers outside of amd..

                  but this is more or less long term fix... not some fast fix.

                  right now there is only one real fix for it buy 2-3 generation older graphic-cards believe it or not some even buy Vega-64/radeon7 used on ebay because the stability is very high on such old hardware because the driver is very mature.

                  this is the result of many years of bug fixes.

                  your 7900XT+msi b650m+ryzen 7 7700 cpu system looks very new compared to my vega64+1920X threadripper system.


                  Phantom circuit Sequence Reducer Dyslexia

                  Comment


                  • #19
                    Originally posted by qarium View Post

                    it looks like a ROCm problem from the outside but in reality its a CUDA problem in reality CUDA is this Clusterfuck and AMD with ROCM just reimplement -CUDA with bug by bug compatibility its like wine who reimplement windows API with bug by bug compatibility something like ZLUDA​would not even possible without bug by bug compatibility.

                    ""clusterfuck"? "Galvanic juxtaposition" is a nicer way of saying that...
                    EDIT: It means something like "Too many chiefs and not enough indians" or "Too many officers and not enough troops"​"

                    believe it or not but CUDA is exactly this a Clusterfuck and to many chiefs who all cook the same meal and so one.,
                    people thought a source code level compatibility layer with bug by bug compatibility would be a nice idea
                    and also everyone on top on that thought a ZLUDA with bug by bug compatibility to cuda would be a nice idea.


                    the reality is that it is a clusterfuck... a clean cut and Vulkan Compute as a real open standard would be the solution

                    keep in mind that nvidia does not want CUDA do become a open standard their newest cuda version has a license who makes ZLUDA agaist the law.

                    I honestly don't unterstand why people don't have the Wisdom to go with Vulkan Compute instead as a real solution.
                    About the legality of Zluda... Developing or using it is not and will never be illegal. The terms of Cuda's EULA only apply to people who have accepted the terms of the license by Installing it.

                    Simply don't Install Cuda... That simple....

                    EDIT: nVidia isn't a government, they can't make laws. The only thing they can do is make license agreements... If you don't accept the terms of their license agreement then you cannot be bound by it.... Even nVidia has to abide by copywrite laws...
                    Last edited by duby229; 04 April 2024, 07:56 PM.

                    Comment


                    • #20
                      Originally posted by duby229 View Post
                      About the legality of Zluda... Developing or using it is not and will never be illegal. The terms of Cuda's EULA only apply to people who have accepted the terms of the license by Installing it.
                      Simply don't Install Cuda... That simple....
                      EDIT: nVidia isn't a government, they can't make laws. The only thing they can do is make license agreements... If you don't accept the terms of their license agreement then you cannot be bound by it.... Even nVidia has to abide by copywrite laws...
                      right Nvidia is not the government...

                      https://en.wikipedia.org/wiki/Revolving_door_(politics)

                      Zluda needs bug by bug compatibility you will have a hard time to check bug by bug compatibility if you can not install cuda and accept EULA on a computer to check the bug by bug compatibility.

                      Phantom circuit Sequence Reducer Dyslexia

                      Comment

                      Working...
                      X