AMDGPU Linux Driver No Longer Lets You Have Unlimited Control To Lower Your Power Limit

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RBilettess
    replied
    For posterity: The fine people at linux-zen already took action:
    Summary The Linux 6.7 kernel introduced a change to the AMDGPU driver, enforcing a lower power limit set by the graphics card BIOS, preventing users from setting power limits below this threshold. ...

    Adding 'amdgpu.ignore_min_pcap=1​' as boot parameter on recent linux-zen kernels restores the old behavior.

    Leave a comment:


  • Panix
    replied
    Originally posted by varikonniemi View Post
    What the f is going on here? At least i would expect an explanation how this would damage the card? Only thing i could think of is it would damage their customer service if someone forgot they set it too low... WTF is this in Linux land. "just make it like windows" ... ????
    AMD (gpus) are getting worse and worse - for productivity, they're pretty useless. The advantages and benefits seem to be evaporating. It seems that all you get is aggravation. Really a shame.

    Leave a comment:


  • vrendtop
    replied
    It is a very clear strategy of artificial segmentation. You take away users control over that part of the hardware, then you can upcharge customers that want any specific power-related behavior.

    Leave a comment:


  • RBilettess
    replied
    Originally posted by david-nk View Post
    The talk about safety is completely inane, I can run a 4090 24/7 at 33% PL and there is still no damage.
    I would be far more concerned about running the card 24/7 at 100% PL.
    The funny thing is, I'm still allowed to set my 303W card to 402W.
    10% down, but 33% up.
    I'm pretty sure, the heatsink wouldn't be able to handle this. Other stuff like voltage regulators, who knows?

    Leave a comment:


  • david-nk
    replied
    Originally posted by RBilettess View Post
    This hit me today. I ran my GPU at 200W instead of 303W, now the minimum allowed power cap is 272W.
    I've tried:
    Code:
    amdgpu.ppfeaturemask=0xffffffff
    But this only allowed me to raise power cap max, not lower power cap min.
    That's it? Take it or leave it?
    10% lower is all they allow? This is insane considering modern cards reach their maximum efficiency at 60-70% of the normal power limit.
    Instead of trying to catch up to Nvidia, AMD keep shooting themselves in the foot. It's infuriating.

    The talk about safety is completely inane, I can run a 4090 24/7 at 33% PL and there is still no damage.
    I would be far more concerned about running the card 24/7 at 100% PL.
    Last edited by david-nk; 18 March 2024, 08:35 AM.

    Leave a comment:


  • RBilettess
    replied
    This hit me today. I ran my GPU at 200W instead of 303W, now the minimum allowed power cap is 272W.
    I've tried:
    Code:
    amdgpu.ppfeaturemask=0xffffffff
    But this only allowed me to raise power cap max, not lower power cap min.
    That's it? Take it or leave it?

    Leave a comment:


  • yump
    replied
    Originally posted by DumbFsck View Post
    Oh, I was thinking the exact opposite. I guessiI'm thinking of during intensive workload and you are talking about idle.

    As in, I thought that due to the longer time that it would take to "fill up" a capacitor with the lower voltage, the C would have to rise at iso f.
    E.g. To game at nnnfps frame limited in game, cenario A with a tdp cap of 100 and cenario B with a tdp cap of 50, if the voltage is dropped for cenario B (as it should, since it has a disproportional impact in tdp) both could still boost to fmax, but A would show 50% utilisation and B would show 100% (I think in gpu_busy)(every number not related to reality, not even their relationships, cause of course I don't think halving power requires doubling activity. It's just an illustration)

    As per your latest comment in response to American locomotive. A larger area of the die would have to be active at any given time for it to reach the max clock.
    No, we're both talking about intensive workloads.

    If the voltage is dropped the chip cannot boost to fmax. There is a minimum voltage required for the chip to compute correctly at any given frequency, running any given sequence of instructions. The minimum voltage required at every frequency for a worst-case workload defines the voltage/frequency curve. The chip has an embedded controller doing power management, and that controller's firmware has a definition of the voltage/frequency curve (either as a lookup table or a parameterized formula).

    How "heavy" or "light" a workload is depends on how many logic gates it causes to switch at once. That means like, furmark is heavy and games are comparatively light, nothing to do with how many FPS the game gets. (You will see some versions of the power formula that incorporate this, as ⍺*C*f*v^2. ⍺ is "activity factor".) The reason activity factor matters is that higher-activity workloads draw more current, which causes more voltage drop in the wires between the VRM and the actual logic gates.

    When the chip is allowed to automatically select voltage to match frequency, energy per frame scales something like frequency squared, at least in the part of the voltage/frequency curve where most chips are designed to operate under load. Turning half the shader cores off (and I don't know if that would actually show up in utilization %) saves no energy at all, because to complete the same work the remaining cores have to run twice as long. (I am assuming here that the uncapped frame rate is at least 2x the capped frame rate. Otherwise the half-disabled chip can't meet the FPS target and isn't doing the same work.)

    Leave a comment:


  • DumbFsck
    replied
    I just remembered there's something something OHMIC something about semiconductors that was useful to model current.

    I can't remember what, so sorry for posting here but it is a reminder for me to research and see what is it that I half remember.


    I honestly feel like I could get you guys in a call and listen for hours your responses to my questions. At the same time I know for a fact most of my questions would be so shallow that it would in the end be a waste of everyone's time lol.


    I'll see if my nephew who's currently attending uni can hook me up with a copy of "Principles of CMOS VLSI Design: A Systems Perspective", I bet it will be very enjoyable

    Leave a comment:


  • DumbFsck
    replied
    Originally posted by yump View Post

    Amperes are very important in the context of reliability, and amperes are what cause stress on the VRM.
    Thank you, this was an interesting read.

    I honestly didn't think it was something to consider with cpu/gpu design, as I though that due to the infinitesimal area, the effects on velocity would already negate all of it.

    But as I said, all I know about it is dodgy murky misremembering. The other day after reading you guys' comments I realised I didn't even remember wtf was a S. Just for you to see how little remains in my head about the subject.

    Also I can't even remember exactly what velocity entails. And somehow I though that with semiconductors being in a lattice structure it would mess up a lot of migration.


    Originally posted by yump View Post
    Almost certainly it's f that varies. The power management controller could theoretically tell the scheduler to idle CUs or parts of CUs, but I'm pretty sure reducing frequency (and voltage, because its a lookup table or parametric curve) saves more power per unit performance lost unless you've set a really low power limit on a large chip.

    Oh, I was thinking the exact opposite. I guessiI'm thinking of during intensive workload and you are talking about idle.

    As in, I thought that due to the longer time that it would take to "fill up" a capacitor with the lower voltage, the C would have to rise at iso f.
    E.g. To game at nnnfps frame limited in game, cenario A with a tdp cap of 100 and cenario B with a tdp cap of 50, if the voltage is dropped for cenario B (as it should, since it has a disproportional impact in tdp) both could still boost to fmax, but A would show 50% utilisation and B would show 100% (I think in gpu_busy)(every number not related to reality, not even their relationships, cause of course I don't think halving power requires doubling activity. It's just an illustration)


    As per your latest comment in response to American locomotive. A larger area of the die would have to be active at any given time for it to reach the max clock.

    ____---_____

    s_j_newbury I'd think using dpm to create and set a mode in which the gpu simply can't reach a tdp above what the user wants would be the way they "support". Not guessing internal state from userspace. It shouldn't be needed, right?

    Leave a comment:


  • yump
    replied
    AmericanLocomotive I'm pretty sure the power limit acts over much longer timescales than that, based Igor's power spike data. https://www.igorslab.de/en/radeon-rx...0xt-review/11/

    There are physical limits to how quickly VRM output voltage can be adjusted, and it would be extremely weird for a power limit to be used as anything other than an outer control loop shrinking the allowed range of frequency selection. If it could extend the range below Vmin, or force unsafe combinations of frequencies from different domains, then you'd have to validate the power limit against all possible workloads. Better to write the firmware so that it does not do that, and you can list the constraints to be obeyed explicitly.

    Intel CPUs, for example, will violate the power limit to maintain base clock, but they won't violate the thermal limit to do that. And if they get all the way down to the minimum frequency (800 MHz) and are *still* overheating, then they switch to clock modulation instead (forced idle-injection on a duty cycle). But at no point will they attempt to run below fused minimum frequencies and voltages. They'll emergency shutdown first.

    Yeah losing some control over the hardware stinks, but the reality is that being able to adjust GPU power limits at all through software is still a "relatively" new thing. It wasn't that long ago the only thing you could adjust was clocks and voltage - and even that wasn't always a given.
    I don't know about the Nvidia side, but power limits have be software adjustable on AMD GPUs since at least polaris in 2016, maybe earlier.

    Leave a comment:

Working...
X