Announcement

Collapse
No announcement yet.

Radeon SDMA Support Is Deemed Too Buggy That It's Dropped From Open-Source Driver

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Radeon SDMA Support Is Deemed Too Buggy That It's Dropped From Open-Source Driver

    Phoronix: Radeon SDMA Support Is Deemed Too Buggy That It's Dropped From Open-Source Driver

    There have been many workarounds and changes to the SDMA (System DMA) copy support within the RadeonSI Gallium3D driver since the support was introduced years ago, but it's now been outright removed due to too many generations of AMD Radeon hardware having issues with it enabled...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Some simple questions come to my mind: If these issues are known for some time, why don't they fix them in Hardware? Do the driver people talk to the hardware engineers about these issues regularly to get them fixed eventually in later silicon / newer generations? In its current form, this capability seems to be a waste of ressources (engineering time and die space).

    Comment


    • #3
      Originally posted by ms178 View Post
      Some simple questions come to my mind: If these issues are known for some time, why don't they fix them in Hardware? Do the driver people talk to the hardware engineers about these issues regularly to get them fixed eventually in later silicon / newer generations? In its current form, this capability seems to be a waste of ressources (engineering time and die space).
      If it is so simple provide the fix.

      Hint. It is not simple to fix or even debug issues like that, if they are sporadic. They don't know why it is happening fully, it could be software, it could be hardware, it could be firmware, etc.

      Comment


      • #4
        Originally posted by baryluk View Post

        If it is so simple provide the fix.

        Hint. It is not simple to fix or even debug issues like that, if they are sporadic. They don't know why it is happening fully, it could be software, it could be hardware, it could be firmware, etc.
        I haven't claimed that it would be a simple problem to fix, I just asked two simple questions. They have the simulators and tools, they have the hardware and software engineers with the abilities to debug this. I don't know every detail about the bring up of such a complex product, but I'd assume that they have simulations and test chips which could reveal these fundamental issues. I have the impression that problems like this could be found and fixed earlier by better communication and cooperation between the different involved departments within the company (and this is not an AMD-only problem).

        Edit: Another example for such an "engineering disaster" is the NGG implementation in Vega. As it is, it is just "dark silicon" or many transistors left behind. One might think they could have tested and verified its design earlier in the development process.
        Last edited by ms178; 09 December 2020, 08:26 AM.

        Comment


        • #5
          Originally posted by ms178 View Post
          Some simple questions come to my mind: If these issues are known for some time, why don't they fix them in Hardware? Do the driver people talk to the hardware engineers about these issues regularly to get them fixed eventually in later silicon / newer generations? In its current form, this capability seems to be a waste of ressources (engineering time and die space).
          My bet is that the windows drivers didn't use the feature, or in a very specific (i.e. actually tested) way.

          Comment


          • #6
            Originally posted by ms178 View Post
            Some simple questions come to my mind: If these issues are known for some time, why don't they fix them in Hardware?
            Assuming the software actually used the feature and discovered/reported the issues, they probably did try to fix the issues with it. But DMA strikes me as a somewhat difficult feature to unit-test, given that there are potentially different memories, caches, MMUs, apertures, and buses involved. And even if they fixed the bugs in the previous generation, that leaves plenty of room for new ones to creep in.

            Originally posted by ms178 View Post
            Do the driver people talk to the hardware engineers about these issues regularly
            I'm sure the firmware and driver people are reporting hardware bugs all the time, and the hardware engineers are probably trying to characterize the bugs so they can plausibly be worked around. But, sometimes, something is so broken that it would hurt performance too much or just isn't reliable in enough cases to actually use.

            If you don't like working around hardware bugs, then definitely don't take a job writing firmware or drivers at a chip company. In my experience, hardware engineers are quick to dismiss issues as something that the software can work around. Since respins are expensive and time-consuming, chips go to market with a lot of known defects. Sure, they probably get fixed in the next gen, but then that will have its share of entirely new bugs. The worst part is when the code becomes so littered with workarounds that it gets fragile, convoluted, and difficult to change. And the biggest tragedy I've seen is when QA misunderstood a feature and the hardware team decided not to fix it to work the way it should, because QA would have to rewrite too many test cases.

            Comment


            • #7
              Thanks for your insights, coder!

              As a consumer I hope that the companies find a way to make better products in the future without the need to work around performance- or stability-critical issues that often. With FPGAs integrated on-chip down the road, I assume that this will also enhance the possibilities to fix some of these issues which are uneconomical to fix today.

              Comment


              • #8
                Originally posted by ms178 View Post
                Some simple questions come to my mind: If these issues are known for some time, why don't they fix them in Hardware? Do the driver people talk to the hardware engineers about these issues regularly to get them fixed eventually in later silicon / newer generations? In its current form, this capability seems to be a waste of ressources (engineering time and die space).
                AFAIK we still make use of SDMA in the kernel driver and these changes only affect SDMA usage by userspace drivers.
                Last edited by bridgman; 09 December 2020, 10:45 AM.
                Test signature

                Comment


                • #9
                  Originally posted by bridgman View Post

                  AFAIK we still make use of SDMA in the kernel driver, and these changes only affect SDMA usage by userspace drivers.
                  Thanks for clarifying about the use of SDMA elswhere in the software stack.

                  I can also understand that no company wants to discuss such problems around their products publicly (e.g. what went wrong with NGG on Vega). I can only hope that the motivation to sell more products leads to a constant optimization of the internal processes to get better products (with as few defects as possible) out to us consumers in the end.
                  Last edited by ms178; 09 December 2020, 11:49 AM.

                  Comment


                  • #10
                    Typo: The link to the merge request is missing a quote mark so the URL contains text which should have been displayed.

                    Edit: Fixed now. Thanks.
                    Last edited by anth; 09 December 2020, 06:17 PM.

                    Comment

                    Working...
                    X