Announcement

Collapse
No announcement yet.

AMDGPU Linux Driver No Longer Lets You Have Unlimited Control To Lower Your Power Limit

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #71
    Originally posted by stormcrow View Post
    ...some commenters apparently fell asleep in high school science classes...

    Basic high school electricity. Under voltage beyond tolerance will kill electronics nearly as fast as an over voltage (where you get arcing). Under voltage increases your amperage to meet the basic power levels required by the electronics. Electronics are rated to a certain voltage, but more importantly, to a certain amperage. When that amperage is exceeded Bad Things happen. Ever wondered why high amperage extension cords are much larger and more expensive than low amperage for the same voltage? (Look it up.) Additionally left to your education, find out why weak(ening) PSUs often scorch power traces on connected boards. (Hint: Power (Watts) = V (voltage) x I (current or amperage) )

    Don't expect this to ever be reverted. It was an oversight/bug to begin with.
    You apparently fell asleep in, "last 10 years of DVFS literature review". The power limit doesn't change the voltage applied to the chip or increase the current through it for any particular clock frequency. It only controls which frequency/voltage pairs the chip is allowed to choose for a given level of utilization.

    Comment


    • #72
      Originally posted by yump View Post
      You apparently fell asleep in, "last 10 years of DVFS literature review". The power limit doesn't change the voltage applied to the chip or increase the current through it for any particular clock frequency. It only controls which frequency/voltage pairs the chip is allowed to choose for a given level of utilization.
      But he might have been in the same class as some AMD employees, they have no clue on physics or how their products work either.

      Comment


      • #73
        Originally posted by Anux View Post
        I don't know how to explain it better and also know no good resource that talks about this. Maybe someone else does?
        OK, I'm sorry then, as I said, I'm very unfamiliar with these systems.

        So if I now understood correctly, the driver sets FV greedily to have the lowest capacitive load at any given tdp.

        Is that it?


        If so, all of this seem to me like a https://en.wikipedia.org/wiki/XY_problem.

        People actually want to constrain their gpus to not go above a FV pair n. Not limit the tdp of FV pair n+1.

        The patch, if I understood correctly from what you showed, limits how low the tdp can be for (in your example) FV7. People want their boards to consume less, so constraining or to FV6 would seem to me (the layman) like a better approach.

        Or where have I gone wrong?

        Comment


        • #74
          I would love my 7900 XTX to be able to go beyond vanilla TDP. On 4k CP2077 it's completely power limited.

          Comment


          • #75
            Originally posted by DumbFsck View Post

            OK, I'm sorry then, as I said, I'm very unfamiliar with these systems.

            So if I now understood correctly, the driver sets FV greedily to have the lowest capacitive load at any given tdp.

            Is that it?


            If so, all of this seem to me like a https://en.wikipedia.org/wiki/XY_problem.

            People actually want to constrain their gpus to not go above a FV pair n. Not limit the tdp of FV pair n+1.

            The patch, if I understood correctly from what you showed, limits how low the tdp can be for (in your example) FV7. People want their boards to consume less, so constraining or to FV6 would seem to me (the layman) like a better approach.

            Or where have I gone wrong?
            You've confused yourself.

            People want their boards not to consume above a given TDP, but to maximize performance within that envelope. A high frequency gives better performance, but requires higher voltage, the actual power consumed depends upon what the GPU is doing (the workload). If the power consumed (or current) at a certain Frequency and Voltage is too high (exceeds the TDP) a lower Frequency/Voltage pair must be used, if there is headroom a higher F/V can be used. The driver typically monitors the GPU load and sets the F/V to a value where it can perform the workload while minimizing the power consumption, this is the DPM function in the amdgpu driver, but it's constrained by the TDP.
            Last edited by s_j_newbury; 06 March 2024, 02:57 PM. Reason: Consistency

            Comment


            • #76
              Originally posted by DumbFsck View Post
              OK, I'm sorry then
              You don't need to, most of us are here to gain and spread knowledge. At least I am ... and for the giggles.

              Originally posted by s_j_newbury View Post
              People want their boards not to consume above a given TDP, but to maximize performance within that envelope. A high frequency gives better performance, but requires higher voltage, the actual power consumed depends upon what the GPU is doing (the workload). If the power consumed (or current) at a certain Frequency and Voltage is too high (exceeds the TDP) a lower Frequency/Voltage pair must be used, if there is headroom a higher F/V can be used. The driver typically monitors the GPU load and sets the F/V to a value where it can perform the workload while minimizing the power consumption, this is the DPM function in the amdgpu driver, but it's constrained by the TDP.
              Yes exactly, people don't care about F/V pairs (or inner workings of the GPU) they want lower power/heat/noise and not just the 6% AMD gives them. The selection of F/V-pairs are and should be the task of the GPU or firmware power management.
              Working with F/V-pairs directly would give vastly different results from low performance to high power as my table hopefully visualizes.

              Under volting can be used in conjunction but is not needed and has the potential of instabilities.

              Comment


              • #77
                It's not impossible that the claim about potential hardware damage is true. Newer chips (RDNA1+, I think) use a parametric curve for frequency-voltage instead of a handful of defined pairs. It's conceivable that a parametric curve might produce wild values outside its intended domain of application, which work for low utilization but could be dangerous otherwise. Or a VRM design might assume that low voltages would only correspond to low utilization, and use that to select the number of enabled phases. (But such a design might blow up in Furmark too.) And of course AIB partners are probably only testing the stock values and maybe overclocking.

                But if AMD is going to enforce a minimum power limit for some reason, that reason needs to be publicly stated on the mailing list and in the commit messages in a way that is understandable to someone educated in electrical engineering and explains the nature of the legitimate problem with too-low power limit. It must be clear that 1) this constraint prevents a legitimate problem, and 2) the necessity has been reviewed by the hardware team with awareness that users don't like it.

                What has been posted to the list so far is compatible with them just following documentation and what Windows does. Christian König doesn't seem to have cottoned on that it's about power and not voltage.

                The concern about bugs reported by users running in undefined configurations is legitimate, but there is an established solution to that that should be applied to all overclocking/undervolting interfaces the kernel exposes: taint the kernel. I'd even favor going further and tainting the kernel if XMP is enabled, if it's easy enough to detect that.

                Anux intelfx and DumbFsck

                Undervolting does increase amperage, actually, for chips that typically run against the power limit. Undervolting reduces the amount of current at any given frequency, but the firmware will then choose a higher frequency to get back to the limit. And since P=I*V, I=P/V. Firware holds power constant, so reducing voltage increases current.

                P=I*V is first principles physics and applies to all devices, semiconductor or otherwise. The f*C*V^2 power formula is just P=IV, with a substitution. The charge on a capacitor is CV, and that amount of charge is drawn through the voltage Vdd every time a gate turns on. Current is charge per second, and f is the the number of cycles per second. I=fCV. Some of the energy is dissipated in the on-cycle, and some in the off-cycle, but that doesn't affect the result. (In fact from outside the chip, capacitors in the PDN filter out clock-frequency current variation almost completely, so a big ASIC just looks like a resistor that changes value very quickly with what the workload is doing.)


                Comment


                • #78
                  Originally posted by yump View Post
                  It's not impossible that the claim about potential hardware damage is true. Newer chips (RDNA1+, I think) use a parametric curve for frequency-voltage instead of a handful of defined pairs. It's conceivable that a parametric curve might produce wild values outside its intended domain of application, which work for low utilization but could be dangerous otherwise.
                  How so? A continous curve is nothing more than infinite F/V-pairs. If any of those infinite F/V-pairs destroys your card how do they make sure this doesn't happen with some random workload that uses this F/V-pair at standard TDP? I could limit my games FPS from 312 to 1 and achive any possible TDP far below standard TDP.

                  Or a VRM design might assume that low voltages would only correspond to low utilization, and use that to select the number of enabled phases
                  That would in the worst case freeze my GPU but not destroy it.

                  (But such a design might blow up in Furmark too.)
                  Exactly, there would always be some potential workload that might trigger this behavior.

                  And of course AIB partners are probably only testing the stock values and maybe overclocking.
                  There are much more things that AIBs don't test and we are using it on standard Linux installations. For example running the card with community written drivers.
                  And what about booting with amdgpu.ppfeaturemask=0xffffffff​ or any other combination? If that is allowed without kernel recompile there is no argument for limiting the min TDP.

                  But if AMD is going to enforce a minimum power limit for some reason, that reason needs to be publicly stated on the mailing list and in the commit messages in a way that is understandable to someone educated in electrical engineering and explains the nature of the legitimate problem with too-low power limit.
                  Hehe, good luck with that. I think they already gave their fullest deep dive with "it could destroy your card!!1!!1".

                  The concern about bugs reported by users running in undefined configurations is legitimate
                  No, as long as there is amdgpu.ppfeaturemask, overclocking, fancontrol and under volting it's not. Heck I could wrap my card in a towel and call AMD support to tell them about my over heating issues.

                  but there is an established solution ... taint the kernel
                  Yes they can do that, if we get our control back it's totally fine by me. Taint the kernel if there is an altered UEFI setting, taint it if dual boot is detected, taint it on full moon, I couldn't care less.

                  Undervolting does increase amperage
                  No it doesn't. For fixed frequency under volting decreases current. And if power management selects a higher state, than that's not under volting and it would still only stay at roughly the same current we had before under volting. That is how we get higher performance at same TDP with under volting.

                  Sorry for being a bit too sarcastic. :/ Until recently there was no doubt that my next GPU would be an AMD card, now that I know I couldn't use it in Windows (there's no way I'm running a card above 150 W for extended periods) I have lost faith in them and am seriously considdering Intel GPUs.
                  I think I got their memo, they hate power efficiency and custom builds and want us to buy competitor cards. Intel seems to have no problem going down to 95 W in Win: https://www.kitguru.net/wp-content/u...-oc-scaled.jpg and I noticed they are really working hard to make their drivers good/usable. Maybe 2 more generations and we won't ever have to think about radeon again.

                  I can only hope this doesn't spread to their CPU lineup.

                  Comment


                  • #79
                    Originally posted by Anux View Post

                    Sorry for being a bit too sarcastic. :/ Until recently there was no doubt that my next GPU would be an AMD card, now that I know I couldn't use it in Windows (there's no way I'm running a card above 150 W for extended periods) I have lost faith in them and am seriously considdering Intel GPUs.
                    I think I got their memo, they hate power efficiency and custom builds and want us to buy competitor cards. Intel seems to have no problem going down to 95 W in Win: https://www.kitguru.net/wp-content/u...-oc-scaled.jpg and I noticed they are really working hard to make their drivers good/usable. Maybe 2 more generations and we won't ever have to think about radeon again.

                    I can only hope this doesn't spread to their CPU lineup.
                    I run my Vega64 with a 45W TDP generally, but have it switch to stock 220W through "gamemode". That way I can have use ridiculous amounts of electricity only when I really want it to.

                    Comment


                    • #80
                      Originally posted by yump View Post
                      P=I*V is first principles physics and applies to all devices, semiconductor or otherwise. The f*C*V^2 power formula is just P=IV, with a substitution.
                      Correct. I guess what I meant in the first reply is more in the sense that I think I never in my life heard anyone talk about amperes in this context. Or as I stated somewhat inflammatory, we wouldn't say like he did that current has to increase, we'd say that at iso power and lower voltage f or C is higher, or both (so I guess an example would be that we would see GPU usage being higher at the same tdp and for approx. same amount of work [which could be frames or hashes or w/e]).


                      But as I said, I'm a layman and the last time I read anything about it was more than a decade ago.

                      But I do have a nagging question that is really not clear. So stop on the paragraph where my misunderstanding is found.

                      There is a point in video card's power consumption where it becomes too inefficient. (requires a greater increase in power for less performance gain than what the user wants)

                      People want to set power1_cap at the point of the efficiency curve at which they are satisfied with the returns, therefore avoiding further diminishing.

                      The way power1_cap_min used to work was to let the gpu do whatever it wanted up to the set wattage, so the user did not have to know the clock, power mode, voltage, etc.


                      The question I have now is: is it still possible to limit tdp, but in a less transparent way?

                      People in the bug report mentioned you can set SCLK, after talking to you guys I looked the documentation and there apparently you can set VDD curve, pp tables and acrivate/deactivate performance levels, and set manual performance levels.

                      A lot of pp_dpm*.


                      Soiisn't the end result the same, just using a different interface?


                      In the bug reports I also read people using tuxclocker and corectl, for people using these "front ends" would it be possible for the presentation to stay the same, but then the software doing these other methods to set the desired tdp?

                      BTW, are the *od_* API calls od because of overdrive and overdrive being some marketing name amd gave to overclock and then if you use od your kernel is already tainted?

                      Cause I see what you're saying about XMP/DOCP or whatever it's called.

                      All of this situation does not add up. Sure they are setting the pcap to respect what their AIB partners write in their bios. But if you can get the exact same state through other means, this seems like change for the sake of change. And if od does not taint the kernel, it is much easier for people to get undefined behaviour by using it than by using pcapmin

                      Comment

                      Working...
                      X