Announcement

Collapse
No announcement yet.

Radeon Linux Driver Seeing "MALL" Feature For Big Navi

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by pal666 View Post
    that's exactly my point. rust can't include c++ headers
    Neither can javascript...WUT?

    Comment


    • #22
      Originally posted by mdedetrich View Post

      You do realize that FLR (Function Level Reset) is an optional part of the PCIe standard, i.e. its nothing specific to NVidia.

      AMD is the one that is doing something non standard here/specific to their architecture here, not NVidia.
      *optional* It's not non-standard to not support optional features, otherwise they would not be optional.

      Comment


      • #23
        Originally posted by agd5f View Post

        *optional* It's not non-standard to not support optional features, otherwise they would not be optional.
        Wrong, optional and standard mean completely different things (in the context of software engineering/programming).

        Optional: You can implement this feature (if it makes sense to do so) or not.
        Standard: Following some spec/standrad

        Optional feature in a standard/spec: You can choose to implement this feature. If you do so it needs to follow the specification (which in this case is PCIe specification with FLR).

        All I see is an attempt of weaseling out of responsibility. There is a spec that for every PCIe device, if you want to add functionality for the device to reset then that is what FLR is for. This is not only for graphics cards, NVMe drives can also implement FLR (can you imagine how ridiculous it would be if different NVMe drive manufacturers implemented their own versions of FLR?).

        And in the context of virtualization/pass through, virtualization software expects FLR to exist if for some reason a linked hardware device needs to be reset. KVM is currently filled with exceptions/hacks for AMD graphics cards (which don't even work all of the time) because they can't even implement the PCIe spec correctly.

        And yes, FLR would indeed not be required if the AMD graphics card never crash, but only gullible fools believe that.

        Comment


        • #24
          Originally posted by microcode View Post
          Cool. Weird name but I guess this is about accessing memory directly from an L3 or equivalent cache, so you don't, thrash your L1 (or equivalent). If the throughput is sufficient for the specific operation (essentially memory-to-memory or close), the latency will be hidden anyway.
          I wondered whether this is anything to do with a rumoured 128MB cache given there's a hardcoded `128 * 1024 * 1024` value next to a TODO about a hardcoded value in one of the patches.

          Comment


          • #25
            Originally posted by mdedetrich View Post
            Optional feature in a standard/spec: You can choose to implement this feature. If you do so it needs to follow the specification (which in this case is PCIe specification with FLR).

            All I see is an attempt of weaseling out of responsibility. There is a spec that for every PCIe device, if you want to add functionality for the device to reset then that is what FLR is for. This is not only for graphics cards, NVMe drives can also implement FLR (can you imagine how ridiculous it would be if different NVMe drive manufacturers implemented their own versions of FLR?).
            With respect, I believe you may be missing the point. The PCIE spec covers a number of different reset options, some of which are mandatory and others which are not.

            Nobody is saying that if we implemented FLR we would do it in a non-standard form - agd5f is saying that FLR is not the only way to reset even within the PCIE spec.
            Test signature

            Comment


            • #26
              Originally posted by mdedetrich View Post

              Wrong, optional and standard mean completely different things (in the context of software engineering/programming).

              Optional: You can implement this feature (if it makes sense to do so) or not.
              Standard: Following some spec/standrad

              Optional feature in a standard/spec: You can choose to implement this feature. If you do so it needs to follow the specification (which in this case is PCIe specification with FLR).

              All I see is an attempt of weaseling out of responsibility. There is a spec that for every PCIe device, if you want to add functionality for the device to reset then that is what FLR is for. This is not only for graphics cards, NVMe drives can also implement FLR (can you imagine how ridiculous it would be if different NVMe drive manufacturers implemented their own versions of FLR?).

              And in the context of virtualization/pass through, virtualization software expects FLR to exist if for some reason a linked hardware device needs to be reset. KVM is currently filled with exceptions/hacks for AMD graphics cards (which don't even work all of the time) because they can't even implement the PCIe spec correctly.

              And yes, FLR would indeed not be required if the AMD graphics card never crash, but only gullible fools believe that.
              FLR is an optional aspect of the PCIe spec. Any device can implement it, but it doesn't have to. I realize it's a convenient feature for virtualization because it is a device independent method to reset a device, but it's certainly not required for a functional device. It's possible for a device to have it's own device specific reset mechanism and that is indeed what AMD GPUs have and use. FLR is not the only way to reset a device. Prior to FLR existing, devices were able to reset themselves. There is no such thing as a vendor specific FLR, either a device supports FLR or it doesn't. All of these hacks you talk about in KVM are attempts to work around the fact that some AMD GPUs do not support FLR. Unfortunately, that is what you need to do to reset a device if it does not support a device independent reset mechanism like FLR. Unfortunately, the reset mechanisms are complex and really hard to get working correctly without context for the current hardware state which is why they are a challenge to make work outside of the GPU driver. Moreover, the AMD Linux GPU driver is open source and the developers work on the public mailing lists and many are available on IRC. If you have questions you can ask, but frankly I don't think we've ever had any serious queries about how to implement or debug it on any of the public mailing lists or IRC.

              Comment


              • #27
                Originally posted by bridgman View Post

                With respect, I believe you may be missing the point. The PCIE spec covers a number of different reset options, some of which are mandatory and others which are not.

                Nobody is saying that if we implemented FLR we would do it in a non-standard form - agd5f is saying that FLR is not the only way to reset even within the PCIE spec.
                Right, but the point is that its the most supported method that is part of the actual PCIe spec. I mean since you work with AMD you would know more than me, but all I see is AMD implementing all these vendor specific API's/non published workarounds for how to reset the GPU's when the far simplest solution is for AMD to implement FLR. I understand that FLR is not the only way of resetting the device. but if you are going to implement it why do in a complete non standard why, this seems like AMD shooting themselves completely in the foot here.

                If you want to implement a vendor neutral way to reset a graphics card, FLR is the way to do it, It being optional is completely besides the point here. I mean AMD spent effort implementing this "optional" feature in the first place in their own way so clearly the fact that its optional is not the relevant point here (in fact it would be more relevant if AMD didn't implement any kind of resetting feature at all).
                mdedetrich
                Senior Member
                Last edited by mdedetrich; 22 October 2020, 04:56 PM.

                Comment


                • #28
                  Originally posted by mdedetrich View Post
                  You do realize that FLR (Function Level Reset) is an optional part of the PCIe standard, i.e. its nothing specific to NVidia.
                  There is a horrible one with the PCIe patent grant. Any patents covering how to implement optional parts of the pcie standard are not included in the PCIe patent grant. So optional features of the PCIe standard companies can be not implementing them because some party hold the patent. Mandatory features of the PCIe standard are covered. So if FLR was a mandatory feature of PCIe standard you would have every right to be annoyed at AMD for not implementing.

                  Some of your problem here you did not have understanding what optional means in PCIe standard to those implementing deivices. Its optional for 2 possible reasons:
                  1: the PCIe standard developers did not class that all devices need it.
                  2: some party has a patent on that feature and that patent can be selective as well as in effect a graphics card but not a sound card.

                  If the problem is number 2 you cannot expect vendors to implement it. Then you need to expect equivalent functionality.

                  Basically stop being a brat mdedetrich and have realistic expectations.

                  I use amd gpu with virtualisation all the time. I have a small image that does a baco reset on cards. Yes this is more complex than FLR but having to do BACO reset instead of FLR because card lands in a non functional mode. So workflow is different using AMD GPU to Nvidia ones. AMD CPU and motherboards in virtualisation are used a lot due to EPIC and yes those do also have different other places lacking functional FLR so having to use vendor reset. Again you accept the platform is AMD change setup to suite and get a very functional result.

                  By the way different Xeon motherboard chip-sets from Intel also have issues with different parts on them not having working FLR support as well. When you get into large scale virtualisation you are always doing "non standard here/specific to their architecture" alternatives to FLR somewhere.

                  Its better for vendor not to provide FLR and expect a specific to their architecture solution than implement FLR wrong. There was a batch of Nvidia mining cards that had FLR wrong that results in do FLR draw 10 times rated power draw and fry instead of FLR(yes a true halt and catch fire issue). Reset functionality is hard to implement and if a party gets it wrong the level of damage that can be on your hands can be a monster. So most important thing to me is a reset method for device that works AMD GPU do have that in BACO reset. Secondary is how it done.

                  FLR is icing on cake to me not the cake. With the difference devices that have defective PCIe FLR with halt and catch fire issues out there random-ally sending devices PCIe FLR its playing a random game that can completely fry server motherboards from the card going halt and catch fire.

                  mdedetrich fun right that FLR is not safe even to use on random Nvidia cards because you might roll unlucky. The reality that FLR can be defective in different batches of cards means you do need other reset routes anyhow this is why I am not annoyed that much with amdgpu good collection of different card reset options. Nvidia on the other hand not so much.

                  By the way crashing out VMs are not a common event at least not common enough that having to run a special clean up is a major issue at least to me who accept the reality of the the hardware.

                  Comment


                  • #29
                    Originally posted by oiaohm View Post

                    There is a horrible one with the PCIe patent grant. Any patents covering how to implement optional parts of the pcie standard are not included in the PCIe patent grant. So optional features of the PCIe standard companies can be not implementing them because some party hold the patent. Mandatory features of the PCIe standard are covered. So if FLR was a mandatory feature of PCIe standard you would have every right to be annoyed at AMD for not implementing.
                    Does NVidia contain a patent for the FLR resetting feature?


                    Originally posted by oiaohm View Post
                    Some of your problem here you did not have understanding what optional means in PCIe standard to those implementing deivices. Its optional for 2 possible reasons:
                    1: the PCIe standard developers did not class that all devices need it.
                    2: some party has a patent on that feature and that patent can be selective as well as in effect a graphics card but not a sound card.
                    I seriously doubt that PCIe made features optional because of patent reasons. This seems to be a problem with US's broken patent system and trying to create an equivalence here with optional reasons is a false one.

                    Originally posted by oiaohm View Post
                    Basically stop being a brat mdedetrich and have realistic expectations.

                    I use amd gpu with virtualisation all the time. I have a small image that does a baco reset on cards. Yes this is more complex than FLR but having to do BACO reset instead of FLR because card lands in a non functional mode. So workflow is different using AMD GPU to Nvidia ones. AMD CPU and motherboards in virtualisation are used a lot due to EPIC and yes those do also have different other places lacking functional FLR so having to use vendor reset. Again you accept the platform is AMD change setup to suite and get a very functional result.
                    Look I am not sure why you are calling me a brat but when you buy a piece of software/hardware you expect it to work to a standard degree and I have barely scraped the barrel when it comes to the virtualization issues. As far as I see it, unless I see what these supposed patent claims are AMD is making it harder for themselves for no logical reason.

                    Customers have been calling this out as a legitimate issue on AMD's side for years (these are some of the most voted threads on amd's reddit and I am damn sure its probably the most common complaint given to AMD in server scenarios.

                    Originally posted by oiaohm View Post
                    By the way different Xeon motherboard chip-sets from Intel also have issues with different parts on them not having working FLR support as well. When you get into large scale virtualisation you are always doing "non standard here/specific to their architecture" alternatives to FLR somewhere.
                    I don't know what issues you are speaking specifically here but generally speaking Intel has been much more reliable and if there are problems its usually fixed with BIOS updates. Of course Intel has other problems (i.e. ridiculously overcosted and behind AMD when it comes to core count).

                    Originally posted by oiaohm View Post
                    Its better for vendor not to provide FLR and expect a specific to their architecture solution than implement FLR wrong. There was a batch of Nvidia mining cards that had FLR wrong that results in do FLR draw 10 times rated power draw and fry instead of FLR(yes a true halt and catch fire issue). Reset functionality is hard to implement and if a party gets it wrong the level of damage that can be on your hands can be a monster. So most important thing to me is a reset method for device that works AMD GPU do have that in BACO reset. Secondary is how it done.

                    FLR is icing on cake to me not the cake. With the difference devices that have defective PCIe FLR with halt and catch fire issues out there random-ally sending devices PCIe FLR its playing a random game that can completely fry server motherboards from the card going halt and catch fire.

                    mdedetrich fun right that FLR is not safe even to use on random Nvidia cards because you might roll unlucky. The reality that FLR can be defective in different batches of cards means you do need other reset routes anyhow this is why I am not annoyed that much with amdgpu good collection of different card reset options. Nvidia on the other hand not so much.
                    All I am hearing is trying to come up with some weird justification for not implementing something. Sure there was some NVidia cards that had broken FLR (both companies have cards that are bad apples for various reasons) but I don't see NVidia completely removing the FLR feature because one series of cards implemented it incorrectly which seems to be what you are advocating/justifying.

                    Originally posted by oiaohm View Post
                    By the way crashing out VMs are not a common event at least not common enough that having to run a special clean up is a major issue at least to me who accept the reality of the the hardware.
                    Yeah because having to physically reboot a server machine because the GPU gets put into an invalid state and cannot be reset properly is loads of fun.... I am sorry but unless you can guarantee that the cards never crash this to me is an excuse.
                    mdedetrich
                    Senior Member
                    Last edited by mdedetrich; 22 October 2020, 05:48 PM.

                    Comment


                    • #30
                      Originally posted by mdedetrich View Post

                      Right, but the point is that its the most supported method that is part of the actual PCIe spec. I mean since you work with AMD you would know more than me, but all I see is AMD implementing all these vendor specific API's/non published workarounds for how to reset the GPU's when the far simplest solution is for AMD to implement FLR. I understand that FLR is not the only way of resetting the device. but if you are going to implement it why do in a complete non standard why, this seems like AMD shooting themselves completely in the foot here.

                      If you want to implement a vendor neutral way to reset a graphics card, FLR is the way to do it, It being optional is completely besides the point here. I mean AMD spent effort implementing this "optional" feature in the first place in their own way so clearly the fact that its optional is not the relevant point here (in fact it would be more relevant if AMD didn't implement any kind of resetting feature at all).
                      Its understanding the difference.

                      First you need to look the Nvidia mining card batch that went stupid. Cause was ram used to make video card needing a longer power down to clear state. Due to not properly clearing state it was now fuzz testing the GPU from the ram modules and then found something horrible wrong resulting in fired motherboards from doing FLR.

                      BACO reset is CPU driven by commanding the power management parts of the card to power the GPU and other parts off and then power them back on. Yes being cpu driven you can adjust the time between power down and power back on in the driver. FLR is fire and forget.

                      Yes FLR is kind of vendor neutral but that the problem. Problem here AMD and Nvidia design GPU and then farm the production out and the resulting cards are a mix of vendors parts. Now designing a FLR that will work with every possible combination of chips a vendor of cards may do is not simple. BACO style solution if found to wrong it is software patchable.

                      There have been a few different ways PCIe FLR has screwed up for different vendors. Nvidias mining cards was the ram being slightly outside specification. Another vendor was the clock on the controller chip doing the FLR was fast resulting in attempting to reset too fast FLR came hard lockup of card yes this was a 100G network card hard locking. There was the intel on on different motherboards were FLR the chipset results in all fans stopped while the FLR is processed not exactly good if CPU is at 100 percent load. The list goes on with different screw ups of FLR implementations

                      Lot of these end up blacklist the devices from using FLR and use vendor resets like BACO reset where the CPU in control.

                      Also there is a problem with the PCIe standard and FLR.
                      https://alexforencich.com/wiki/en/pcie/hot-reset-linux
                      "A 'function-level reset' (FLR) is a reset that affects only a single function of a PCI express device. "

                      Please note the statement only a single function of the PCI device should be reset. Nvidia GPU cards are not in fact to PCIe standard when you call FLR either as it resets more than 1 function. There are many cards that implement FLR that are not to PCIe specifications and this does lead to some quite surprising outcomes.

                      The reality is when AMD implemented stuff to PCIe specifications it is to PCIe specifications not some hacked extend to what the specifications says. So the PCIe specification on FLR could also be another reason why they have not implemented it. You have a high case that a jammed up card is not a single function that is screwed up but multi functions screwed up together making the problem of card don't work. PCIe and vendor forms of full card resets are safer to getting functional state than FLR implemented as the PCIe specification says.

                      Yes people have got use to using FLR from Nvidia cards that is not to standard. mdedetrich this is the problem with saying do the same as Nvidia not everything they do is to standard of the right way to-do it. Even if AMD ends up implementing FLR don't except it to be the same NVIDIA because AMD will do it to standard of course the result of that is that FLR will not be a magic locked up card cure all. PCIe standard defines other resets that are meant to be a locked up card cure all as part of the mandatory part of the standard.

                      The PCIe hot reset is part of mandatory part of standard and is meant to bring a card back from jammed state this does not always work due to hardware variations the reason for the BACO reset with AMD. AMD started off with the idea that hot reset was the valid solution for their GPU and when this not dependable enough they have implemented vendor solution.

                      By the way the issues that cause PCIe hot reset not to work with AMD cards have caused FLR not to work on different Nvidia based cards the fun of hardware varation. There are different AMD cards where you never need a BACO reset instead just use PCIe hot reset.

                      Its really annoying that the PCIe standard does not define how to do a cold reset or a warm reset as even a optional part of the standard. Its also annoying that the Linux kernel sysfs does not define entry to-do a pci-e hot reset.

                      Comment

                      Working...
                      X