Announcement

Collapse
No announcement yet.

Radeon Linux Driver Seeing "MALL" Feature For Big Navi

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by mdedetrich View Post
    Does NVidia contain a patent for the FLR resetting feature?
    There are patents over that feature on implementing it in hardware. I did not say its Nvidia patent.

    Originally posted by mdedetrich View Post
    Customers have been calling this out as a legitimate issue on AMD's side for years (these are some of the most voted threads on amd's reddit and I am damn sure its probably the most common complaint given to AMD in server scenarios.
    https://amdgpu-install.readthedocs.i...ll-bugrep.html

    To be correct reddit is not where you report bugs. I guess you don't have AMD Customer Engagement Representative. Us working with servers do know where to report stuff. Brats report stuff to reddit then complain about it not being fixed and the reason its not fixed there are no bug reports. Reddit is marketing department only.

    Major customers with problems in server space report to their AMD Customer Engagement Representative that puts them straight in the developers ears on what their problem are. Now minor customers on Linux is open a freedesktop.org Bugzilla as directed by the documentation.

    https://www.amd.com/en/support/kb/faq/amdbrt
    Windows you have a graphical tool it submit a bug report.


    Originally posted by mdedetrich View Post
    Yeah because having to physically reboot a server machine because the GPU gets put into an invalid state and cannot be reset properly is loads of fun.... I am sorry but unless you can guarantee that the cards never crash this to me is an excuse.
    Those who don't know you have to know how to-do PCI hot reset have with Nvidia cards inside at times when FLR fails. Yes PCI hot reset will fail on Nvidia cards as well then you have to physically reboot the machine with Nvidia card inside. AMD you start at PCIe hot reset when that fails you drop to BACO reset and if BACO reset fail then you have to physically reboot machine. Be it AMD or Nvidia you have at least step before you have to physically reboot machine to fix a GPU issue. AMD you have 5 steps total you can try before its physical reset. Nvidia you only have two before it physical reset.

    Please note with modern openBMC systems a physical system reset does not require me to go to the machine you are able to perform it remotely. One of the reasons why remote resetting by items like openBMC is cases of Nvidia cards in servers failing FLR and PCI hot reset so forcing full server reset to bring them back. These problems is why I would love the PCIe standard extended to support doing a proper warm reset and cold reset of cards. Yes I would love to have server boards where I could by pci standards cut all power to a slot and then apply power again for miss behaving cards. It would be a lot more friendly if I could power cycle miss behaving cards instead of complete systems.

    Please note I am not saying having to physically reset machine is good outcome. The reality is when you have handled enough Nvidia cards in a render farm you learn that having to reboot machine because there is no other choice happens a lot more than you like. Getting use to how you have to manage AMD GPUs is a learning curve at first that you start with PCIe hot reset and that Linux kernel and most VM solutions don't provide simple straight forward interface to hit PCIe hot reset. Once you learn how to perform PCIe hot reset it has a positive effect on your management of Nvidia cards as in you do less complete system resets when they screw up past what FLR will fix. Of course then with amd you need to learn to perform BACO reset. Then the times you need to go to a full physical reset with AMD containing system vs NVIDIA one is more cases with Nvidia than AMD.

    Sorry to say a lot of those complaining on reddit don't know what they are doing that is party shown by them complaining on reddit to start off with instead of opening up a proper bug report or talking to AMD Customer Engagement Representative. AMD Customer Engagement Representative will in fact provide you with management process guidelines to get the best out AMD GPUs that where I learnt about PCIe hot reset in the first place.

    mdedetrich also you are complaining here about it. Where is you open bug report about it. You don't have one right. Before complain about these problems you need to have spent your time doing the bug reports.

    AMD issues I find only annoying. Not big enough for me to bother with a bug report. AMD GPU are not causing me to reset systems any more often than systems containing NVIDIA mind you I did take the time to learn how to manage AMD GPUs correctly if you have not learnt how to reset amd cards state yes you will be having problems but problems that are being caused by your lack of knowledge that will be effecting your card Nvidia management.

    Comment


    • #32
      Originally posted by mdedetrich View Post
      Right, but the point is that its the most supported method that is part of the actual PCIe spec. I mean since you work with AMD you would know more than me, but all I see is AMD implementing all these vendor specific API's/non published workarounds for how to reset the GPU's when the far simplest solution is for AMD to implement FLR.
      The vendor specific implementations are our attempt to add support to existing hardware that we can't go back and change. Without time machine or magic wand I think you would agree that is our only option.

      Originally posted by mdedetrich View Post
      I understand that FLR is not the only way of resetting the device. but if you are going to implement it why do in a complete non standard why, this seems like AMD shooting themselves completely in the foot here.
      Again, I think the only person talking about implementing FLR in a non-standard way is you - we have not proposed or done that as far as I know. Apologies in advance if that is not correct.

      Originally posted by mdedetrich View Post
      If you want to implement a vendor neutral way to reset a graphics card, FLR is the way to do it, It being optional is completely besides the point here. I mean AMD spent effort implementing this "optional" feature in the first place in their own way so clearly the fact that its optional is not the relevant point here (in fact it would be more relevant if AMD didn't implement any kind of resetting feature at all).
      My understanding was that hot reset via secondary bus reset was the closest thing to a standard way to reset a graphics card, while FLR was an optional mechanism for resetting a single function on a multi-function card... but I'm the furthest thing from an expert here.
      Last edited by bridgman; 22 October 2020, 07:48 PM.
      Test signature

      Comment


      • #33
        Resetting something as complex as a GPU is hard. There are lots of block on a GPU and lots of interconnects and in a lot of causes, multiple end points (e.g., audio and USB in addition to the GPU). Often many of those blocks or endpoints share common resources at the hardware level. It's relatively easy to implement something like this for something simple like a USB controller or an NIC, but a GPU is a lot more complex. That is part of the reason why we have several reset methods in the driver. In addition to complexity, there are also reasons why you may not want to reset the entire GPU. More limited resets are useful in a lot of cases. So in AMD hardware, depending on the asic we have:
        1. Mode1 reset. This is the closest thing to what an FLR would be if we supported it. It resets just about the entire GPU, however, there are some hardware bugs on some chips that hit you in certain cases.
        2. Mode2 reset. This is lighter weight that mode1. It resets everything on the GPU except the memory controller. This is the only reset available on APUs because the memory controller is shared with the CPU so you can't reset that without taking down the CPU with it. On dGPUs this is useful as it doesn't reset memory so you don't lose vram, however, if the memory controller is hung, it doesn't help.
        3. Per engine resets. Most blocks on the GPU support their own soft resets. This is convenient for resetting just a particular block, but is also tricky to get working reliably because hangs often involve multiple blocks and at some point, you just end up having to reset everything.
        4. BACO (Bus Active Chip Off). This is actually a power feature for idle power savings. It powers down just about the entire chip. We use this for runtime pm on our dGPUs to save power when they are idle. Conveniently, powering down most of the chip is basically a reset of sorts so we have this as an option as well.

        Depending on the asic and the nature of the hang, the driver uses all of these methods.

        Comment


        • #34
          Originally posted by mdedetrich View Post

          Right, but the point is that its the most supported method that is part of the actual PCIe spec. I mean since you work with AMD you would know more than me, but all I see is AMD implementing all these vendor specific API's/non published workarounds for how to reset the GPU's when the far simplest solution is for AMD to implement FLR. I understand that FLR is not the only way of resetting the device. but if you are going to implement it why do in a complete non standard why, this seems like AMD shooting themselves completely in the foot here.
          Simple perhaps from the perspective of a hypervisor that wants a vendor neutral method to reset device, but not for a driver for complex hardware. FLR is not really that useful from a GPU driver perspective. There's not much flexibility. It also requires access to PCI config space which is not even available in most guest environments. From the GPU driver's perspective, FLR is a nice optional feature for a different use case. I agree that FLR would be nice for the virtualization use case, but the hang recovery use cases are just as valid. It's not like PCI devices did not support reset before FLR came along.

          Originally posted by mdedetrich View Post
          If you want to implement a vendor neutral way to reset a graphics card, FLR is the way to do it, It being optional is completely besides the point here. I mean AMD spent effort implementing this "optional" feature in the first place in their own way so clearly the fact that its optional is not the relevant point here (in fact it would be more relevant if AMD didn't implement any kind of resetting feature at all).
          I agree that FLR is a good solution for a vendor neutral reset mechanism. AMD didn't spend effort implementing an optional feature in our own way. The goal was not to implement FLR in a vendor specific way. The goal was a flexible, reliable way to reset the GPU from the driver in a number of complex scenarios.

          Comment


          • #35
            Originally posted by cb88 View Post
            Neither can javascript...WUT?
            because javascript also merely coexists. what i'm after is called compatibility

            Comment


            • #36
              Originally posted by pal666 View Post
              because javascript also merely coexists. what i'm after is called compatibility
              Then you will never find it. There is only good enough never perfect... Rust is good enough by any standard and is not a hack of a scripting language it's a systems programming language...

              Comment


              • #37
                Originally posted by microcode View Post
                Cool. Weird name but I guess this is about accessing memory directly from an L3 or equivalent cache, so you don't, thrash your L1 (or equivalent). If the throughput is sufficient for the specific operation (essentially memory-to-memory or close), the latency will be hidden anyway.
                Yeah, probably what the leakers have been calling as infinity cache

                Comment

                Working...
                X