Announcement

Collapse
No announcement yet.

Radeon Linux Driver Seeing "MALL" Feature For Big Navi

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
    andrei_me
    Senior Member

  • andrei_me
    replied
    Originally posted by microcode View Post
    Cool. Weird name but I guess this is about accessing memory directly from an L3 or equivalent cache, so you don't, thrash your L1 (or equivalent). If the throughput is sufficient for the specific operation (essentially memory-to-memory or close), the latency will be hidden anyway.
    Yeah, probably what the leakers have been calling as infinity cache

    Leave a comment:

  • cb88
    Senior Member

  • cb88
    replied
    Originally posted by pal666 View Post
    because javascript also merely coexists. what i'm after is called compatibility
    Then you will never find it. There is only good enough never perfect... Rust is good enough by any standard and is not a hack of a scripting language it's a systems programming language...

    Leave a comment:

  • pal666
    Senior Member

  • pal666
    replied
    Originally posted by cb88 View Post
    Neither can javascript...WUT?
    because javascript also merely coexists. what i'm after is called compatibility

    Leave a comment:

  • agd5f
    X.Org ATI Driver Developer

  • agd5f
    replied
    Originally posted by mdedetrich View Post

    Right, but the point is that its the most supported method that is part of the actual PCIe spec. I mean since you work with AMD you would know more than me, but all I see is AMD implementing all these vendor specific API's/non published workarounds for how to reset the GPU's when the far simplest solution is for AMD to implement FLR. I understand that FLR is not the only way of resetting the device. but if you are going to implement it why do in a complete non standard why, this seems like AMD shooting themselves completely in the foot here.
    Simple perhaps from the perspective of a hypervisor that wants a vendor neutral method to reset device, but not for a driver for complex hardware. FLR is not really that useful from a GPU driver perspective. There's not much flexibility. It also requires access to PCI config space which is not even available in most guest environments. From the GPU driver's perspective, FLR is a nice optional feature for a different use case. I agree that FLR would be nice for the virtualization use case, but the hang recovery use cases are just as valid. It's not like PCI devices did not support reset before FLR came along.

    Originally posted by mdedetrich View Post
    If you want to implement a vendor neutral way to reset a graphics card, FLR is the way to do it, It being optional is completely besides the point here. I mean AMD spent effort implementing this "optional" feature in the first place in their own way so clearly the fact that its optional is not the relevant point here (in fact it would be more relevant if AMD didn't implement any kind of resetting feature at all).
    I agree that FLR is a good solution for a vendor neutral reset mechanism. AMD didn't spend effort implementing an optional feature in our own way. The goal was not to implement FLR in a vendor specific way. The goal was a flexible, reliable way to reset the GPU from the driver in a number of complex scenarios.

    Leave a comment:

  • agd5f
    X.Org ATI Driver Developer

  • agd5f
    replied
    Resetting something as complex as a GPU is hard. There are lots of block on a GPU and lots of interconnects and in a lot of causes, multiple end points (e.g., audio and USB in addition to the GPU). Often many of those blocks or endpoints share common resources at the hardware level. It's relatively easy to implement something like this for something simple like a USB controller or an NIC, but a GPU is a lot more complex. That is part of the reason why we have several reset methods in the driver. In addition to complexity, there are also reasons why you may not want to reset the entire GPU. More limited resets are useful in a lot of cases. So in AMD hardware, depending on the asic we have:
    1. Mode1 reset. This is the closest thing to what an FLR would be if we supported it. It resets just about the entire GPU, however, there are some hardware bugs on some chips that hit you in certain cases.
    2. Mode2 reset. This is lighter weight that mode1. It resets everything on the GPU except the memory controller. This is the only reset available on APUs because the memory controller is shared with the CPU so you can't reset that without taking down the CPU with it. On dGPUs this is useful as it doesn't reset memory so you don't lose vram, however, if the memory controller is hung, it doesn't help.
    3. Per engine resets. Most blocks on the GPU support their own soft resets. This is convenient for resetting just a particular block, but is also tricky to get working reliably because hangs often involve multiple blocks and at some point, you just end up having to reset everything.
    4. BACO (Bus Active Chip Off). This is actually a power feature for idle power savings. It powers down just about the entire chip. We use this for runtime pm on our dGPUs to save power when they are idle. Conveniently, powering down most of the chip is basically a reset of sorts so we have this as an option as well.

    Depending on the asic and the nature of the hang, the driver uses all of these methods.

    Leave a comment:

  • bridgman
    AMD Linux

  • bridgman
    replied
    Originally posted by mdedetrich View Post
    Right, but the point is that its the most supported method that is part of the actual PCIe spec. I mean since you work with AMD you would know more than me, but all I see is AMD implementing all these vendor specific API's/non published workarounds for how to reset the GPU's when the far simplest solution is for AMD to implement FLR.
    The vendor specific implementations are our attempt to add support to existing hardware that we can't go back and change. Without time machine or magic wand I think you would agree that is our only option.

    Originally posted by mdedetrich View Post
    I understand that FLR is not the only way of resetting the device. but if you are going to implement it why do in a complete non standard why, this seems like AMD shooting themselves completely in the foot here.
    Again, I think the only person talking about implementing FLR in a non-standard way is you - we have not proposed or done that as far as I know. Apologies in advance if that is not correct.

    Originally posted by mdedetrich View Post
    If you want to implement a vendor neutral way to reset a graphics card, FLR is the way to do it, It being optional is completely besides the point here. I mean AMD spent effort implementing this "optional" feature in the first place in their own way so clearly the fact that its optional is not the relevant point here (in fact it would be more relevant if AMD didn't implement any kind of resetting feature at all).
    My understanding was that hot reset via secondary bus reset was the closest thing to a standard way to reset a graphics card, while FLR was an optional mechanism for resetting a single function on a multi-function card... but I'm the furthest thing from an expert here.
    bridgman
    AMD Linux
    Last edited by bridgman; 22 October 2020, 07:48 PM.

    Leave a comment:

  • oiaohm
    Senior Member

  • oiaohm
    replied
    Originally posted by mdedetrich View Post
    Does NVidia contain a patent for the FLR resetting feature?
    There are patents over that feature on implementing it in hardware. I did not say its Nvidia patent.

    Originally posted by mdedetrich View Post
    Customers have been calling this out as a legitimate issue on AMD's side for years (these are some of the most voted threads on amd's reddit and I am damn sure its probably the most common complaint given to AMD in server scenarios.
    https://amdgpu-install.readthedocs.i...ll-bugrep.html

    To be correct reddit is not where you report bugs. I guess you don't have AMD Customer Engagement Representative. Us working with servers do know where to report stuff. Brats report stuff to reddit then complain about it not being fixed and the reason its not fixed there are no bug reports. Reddit is marketing department only.

    Major customers with problems in server space report to their AMD Customer Engagement Representative that puts them straight in the developers ears on what their problem are. Now minor customers on Linux is open a freedesktop.org Bugzilla as directed by the documentation.

    https://www.amd.com/en/support/kb/faq/amdbrt
    Windows you have a graphical tool it submit a bug report.


    Originally posted by mdedetrich View Post
    Yeah because having to physically reboot a server machine because the GPU gets put into an invalid state and cannot be reset properly is loads of fun.... I am sorry but unless you can guarantee that the cards never crash this to me is an excuse.
    Those who don't know you have to know how to-do PCI hot reset have with Nvidia cards inside at times when FLR fails. Yes PCI hot reset will fail on Nvidia cards as well then you have to physically reboot the machine with Nvidia card inside. AMD you start at PCIe hot reset when that fails you drop to BACO reset and if BACO reset fail then you have to physically reboot machine. Be it AMD or Nvidia you have at least step before you have to physically reboot machine to fix a GPU issue. AMD you have 5 steps total you can try before its physical reset. Nvidia you only have two before it physical reset.

    Please note with modern openBMC systems a physical system reset does not require me to go to the machine you are able to perform it remotely. One of the reasons why remote resetting by items like openBMC is cases of Nvidia cards in servers failing FLR and PCI hot reset so forcing full server reset to bring them back. These problems is why I would love the PCIe standard extended to support doing a proper warm reset and cold reset of cards. Yes I would love to have server boards where I could by pci standards cut all power to a slot and then apply power again for miss behaving cards. It would be a lot more friendly if I could power cycle miss behaving cards instead of complete systems.

    Please note I am not saying having to physically reset machine is good outcome. The reality is when you have handled enough Nvidia cards in a render farm you learn that having to reboot machine because there is no other choice happens a lot more than you like. Getting use to how you have to manage AMD GPUs is a learning curve at first that you start with PCIe hot reset and that Linux kernel and most VM solutions don't provide simple straight forward interface to hit PCIe hot reset. Once you learn how to perform PCIe hot reset it has a positive effect on your management of Nvidia cards as in you do less complete system resets when they screw up past what FLR will fix. Of course then with amd you need to learn to perform BACO reset. Then the times you need to go to a full physical reset with AMD containing system vs NVIDIA one is more cases with Nvidia than AMD.

    Sorry to say a lot of those complaining on reddit don't know what they are doing that is party shown by them complaining on reddit to start off with instead of opening up a proper bug report or talking to AMD Customer Engagement Representative. AMD Customer Engagement Representative will in fact provide you with management process guidelines to get the best out AMD GPUs that where I learnt about PCIe hot reset in the first place.

    mdedetrich also you are complaining here about it. Where is you open bug report about it. You don't have one right. Before complain about these problems you need to have spent your time doing the bug reports.

    AMD issues I find only annoying. Not big enough for me to bother with a bug report. AMD GPU are not causing me to reset systems any more often than systems containing NVIDIA mind you I did take the time to learn how to manage AMD GPUs correctly if you have not learnt how to reset amd cards state yes you will be having problems but problems that are being caused by your lack of knowledge that will be effecting your card Nvidia management.

    Leave a comment:

  • oiaohm
    Senior Member

  • oiaohm
    replied
    Originally posted by mdedetrich View Post

    Right, but the point is that its the most supported method that is part of the actual PCIe spec. I mean since you work with AMD you would know more than me, but all I see is AMD implementing all these vendor specific API's/non published workarounds for how to reset the GPU's when the far simplest solution is for AMD to implement FLR. I understand that FLR is not the only way of resetting the device. but if you are going to implement it why do in a complete non standard why, this seems like AMD shooting themselves completely in the foot here.

    If you want to implement a vendor neutral way to reset a graphics card, FLR is the way to do it, It being optional is completely besides the point here. I mean AMD spent effort implementing this "optional" feature in the first place in their own way so clearly the fact that its optional is not the relevant point here (in fact it would be more relevant if AMD didn't implement any kind of resetting feature at all).
    Its understanding the difference.

    First you need to look the Nvidia mining card batch that went stupid. Cause was ram used to make video card needing a longer power down to clear state. Due to not properly clearing state it was now fuzz testing the GPU from the ram modules and then found something horrible wrong resulting in fired motherboards from doing FLR.

    BACO reset is CPU driven by commanding the power management parts of the card to power the GPU and other parts off and then power them back on. Yes being cpu driven you can adjust the time between power down and power back on in the driver. FLR is fire and forget.

    Yes FLR is kind of vendor neutral but that the problem. Problem here AMD and Nvidia design GPU and then farm the production out and the resulting cards are a mix of vendors parts. Now designing a FLR that will work with every possible combination of chips a vendor of cards may do is not simple. BACO style solution if found to wrong it is software patchable.

    There have been a few different ways PCIe FLR has screwed up for different vendors. Nvidias mining cards was the ram being slightly outside specification. Another vendor was the clock on the controller chip doing the FLR was fast resulting in attempting to reset too fast FLR came hard lockup of card yes this was a 100G network card hard locking. There was the intel on on different motherboards were FLR the chipset results in all fans stopped while the FLR is processed not exactly good if CPU is at 100 percent load. The list goes on with different screw ups of FLR implementations

    Lot of these end up blacklist the devices from using FLR and use vendor resets like BACO reset where the CPU in control.

    Also there is a problem with the PCIe standard and FLR.
    https://alexforencich.com/wiki/en/pcie/hot-reset-linux
    "A 'function-level reset' (FLR) is a reset that affects only a single function of a PCI express device. "

    Please note the statement only a single function of the PCI device should be reset. Nvidia GPU cards are not in fact to PCIe standard when you call FLR either as it resets more than 1 function. There are many cards that implement FLR that are not to PCIe specifications and this does lead to some quite surprising outcomes.

    The reality is when AMD implemented stuff to PCIe specifications it is to PCIe specifications not some hacked extend to what the specifications says. So the PCIe specification on FLR could also be another reason why they have not implemented it. You have a high case that a jammed up card is not a single function that is screwed up but multi functions screwed up together making the problem of card don't work. PCIe and vendor forms of full card resets are safer to getting functional state than FLR implemented as the PCIe specification says.

    Yes people have got use to using FLR from Nvidia cards that is not to standard. mdedetrich this is the problem with saying do the same as Nvidia not everything they do is to standard of the right way to-do it. Even if AMD ends up implementing FLR don't except it to be the same NVIDIA because AMD will do it to standard of course the result of that is that FLR will not be a magic locked up card cure all. PCIe standard defines other resets that are meant to be a locked up card cure all as part of the mandatory part of the standard.

    The PCIe hot reset is part of mandatory part of standard and is meant to bring a card back from jammed state this does not always work due to hardware variations the reason for the BACO reset with AMD. AMD started off with the idea that hot reset was the valid solution for their GPU and when this not dependable enough they have implemented vendor solution.

    By the way the issues that cause PCIe hot reset not to work with AMD cards have caused FLR not to work on different Nvidia based cards the fun of hardware varation. There are different AMD cards where you never need a BACO reset instead just use PCIe hot reset.

    Its really annoying that the PCIe standard does not define how to do a cold reset or a warm reset as even a optional part of the standard. Its also annoying that the Linux kernel sysfs does not define entry to-do a pci-e hot reset.

    Leave a comment:

  • mdedetrich
    Senior Member

  • mdedetrich
    replied
    Originally posted by oiaohm View Post

    There is a horrible one with the PCIe patent grant. Any patents covering how to implement optional parts of the pcie standard are not included in the PCIe patent grant. So optional features of the PCIe standard companies can be not implementing them because some party hold the patent. Mandatory features of the PCIe standard are covered. So if FLR was a mandatory feature of PCIe standard you would have every right to be annoyed at AMD for not implementing.
    Does NVidia contain a patent for the FLR resetting feature?


    Originally posted by oiaohm View Post
    Some of your problem here you did not have understanding what optional means in PCIe standard to those implementing deivices. Its optional for 2 possible reasons:
    1: the PCIe standard developers did not class that all devices need it.
    2: some party has a patent on that feature and that patent can be selective as well as in effect a graphics card but not a sound card.
    I seriously doubt that PCIe made features optional because of patent reasons. This seems to be a problem with US's broken patent system and trying to create an equivalence here with optional reasons is a false one.

    Originally posted by oiaohm View Post
    Basically stop being a brat mdedetrich and have realistic expectations.

    I use amd gpu with virtualisation all the time. I have a small image that does a baco reset on cards. Yes this is more complex than FLR but having to do BACO reset instead of FLR because card lands in a non functional mode. So workflow is different using AMD GPU to Nvidia ones. AMD CPU and motherboards in virtualisation are used a lot due to EPIC and yes those do also have different other places lacking functional FLR so having to use vendor reset. Again you accept the platform is AMD change setup to suite and get a very functional result.
    Look I am not sure why you are calling me a brat but when you buy a piece of software/hardware you expect it to work to a standard degree and I have barely scraped the barrel when it comes to the virtualization issues. As far as I see it, unless I see what these supposed patent claims are AMD is making it harder for themselves for no logical reason.

    Customers have been calling this out as a legitimate issue on AMD's side for years (these are some of the most voted threads on amd's reddit and I am damn sure its probably the most common complaint given to AMD in server scenarios.

    Originally posted by oiaohm View Post
    By the way different Xeon motherboard chip-sets from Intel also have issues with different parts on them not having working FLR support as well. When you get into large scale virtualisation you are always doing "non standard here/specific to their architecture" alternatives to FLR somewhere.
    I don't know what issues you are speaking specifically here but generally speaking Intel has been much more reliable and if there are problems its usually fixed with BIOS updates. Of course Intel has other problems (i.e. ridiculously overcosted and behind AMD when it comes to core count).

    Originally posted by oiaohm View Post
    Its better for vendor not to provide FLR and expect a specific to their architecture solution than implement FLR wrong. There was a batch of Nvidia mining cards that had FLR wrong that results in do FLR draw 10 times rated power draw and fry instead of FLR(yes a true halt and catch fire issue). Reset functionality is hard to implement and if a party gets it wrong the level of damage that can be on your hands can be a monster. So most important thing to me is a reset method for device that works AMD GPU do have that in BACO reset. Secondary is how it done.

    FLR is icing on cake to me not the cake. With the difference devices that have defective PCIe FLR with halt and catch fire issues out there random-ally sending devices PCIe FLR its playing a random game that can completely fry server motherboards from the card going halt and catch fire.

    mdedetrich fun right that FLR is not safe even to use on random Nvidia cards because you might roll unlucky. The reality that FLR can be defective in different batches of cards means you do need other reset routes anyhow this is why I am not annoyed that much with amdgpu good collection of different card reset options. Nvidia on the other hand not so much.
    All I am hearing is trying to come up with some weird justification for not implementing something. Sure there was some NVidia cards that had broken FLR (both companies have cards that are bad apples for various reasons) but I don't see NVidia completely removing the FLR feature because one series of cards implemented it incorrectly which seems to be what you are advocating/justifying.

    Originally posted by oiaohm View Post
    By the way crashing out VMs are not a common event at least not common enough that having to run a special clean up is a major issue at least to me who accept the reality of the the hardware.
    Yeah because having to physically reboot a server machine because the GPU gets put into an invalid state and cannot be reset properly is loads of fun.... I am sorry but unless you can guarantee that the cards never crash this to me is an excuse.
    mdedetrich
    Senior Member
    Last edited by mdedetrich; 22 October 2020, 05:48 PM.

    Leave a comment:

  • oiaohm
    Senior Member

  • oiaohm
    replied
    Originally posted by mdedetrich View Post
    You do realize that FLR (Function Level Reset) is an optional part of the PCIe standard, i.e. its nothing specific to NVidia.
    There is a horrible one with the PCIe patent grant. Any patents covering how to implement optional parts of the pcie standard are not included in the PCIe patent grant. So optional features of the PCIe standard companies can be not implementing them because some party hold the patent. Mandatory features of the PCIe standard are covered. So if FLR was a mandatory feature of PCIe standard you would have every right to be annoyed at AMD for not implementing.

    Some of your problem here you did not have understanding what optional means in PCIe standard to those implementing deivices. Its optional for 2 possible reasons:
    1: the PCIe standard developers did not class that all devices need it.
    2: some party has a patent on that feature and that patent can be selective as well as in effect a graphics card but not a sound card.

    If the problem is number 2 you cannot expect vendors to implement it. Then you need to expect equivalent functionality.

    Basically stop being a brat mdedetrich and have realistic expectations.

    I use amd gpu with virtualisation all the time. I have a small image that does a baco reset on cards. Yes this is more complex than FLR but having to do BACO reset instead of FLR because card lands in a non functional mode. So workflow is different using AMD GPU to Nvidia ones. AMD CPU and motherboards in virtualisation are used a lot due to EPIC and yes those do also have different other places lacking functional FLR so having to use vendor reset. Again you accept the platform is AMD change setup to suite and get a very functional result.

    By the way different Xeon motherboard chip-sets from Intel also have issues with different parts on them not having working FLR support as well. When you get into large scale virtualisation you are always doing "non standard here/specific to their architecture" alternatives to FLR somewhere.

    Its better for vendor not to provide FLR and expect a specific to their architecture solution than implement FLR wrong. There was a batch of Nvidia mining cards that had FLR wrong that results in do FLR draw 10 times rated power draw and fry instead of FLR(yes a true halt and catch fire issue). Reset functionality is hard to implement and if a party gets it wrong the level of damage that can be on your hands can be a monster. So most important thing to me is a reset method for device that works AMD GPU do have that in BACO reset. Secondary is how it done.

    FLR is icing on cake to me not the cake. With the difference devices that have defective PCIe FLR with halt and catch fire issues out there random-ally sending devices PCIe FLR its playing a random game that can completely fry server motherboards from the card going halt and catch fire.

    mdedetrich fun right that FLR is not safe even to use on random Nvidia cards because you might roll unlucky. The reality that FLR can be defective in different batches of cards means you do need other reset routes anyhow this is why I am not annoyed that much with amdgpu good collection of different card reset options. Nvidia on the other hand not so much.

    By the way crashing out VMs are not a common event at least not common enough that having to run a special clean up is a major issue at least to me who accept the reality of the the hardware.

    Leave a comment:

Working...
X