Announcement

Collapse
No announcement yet.

Linux 5.17 Released With AMD P-State Driver, Plenty Of New Hardware Support

Collapse
This topic has been answered.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • yump
    replied
    Originally posted by farnz View Post

    Looking at my wall consumption, and using a software decoder, at 3.2 GHz (the clock speed of my system as set by the performance governor), I consume 18.5W. At 3.1 GHz, this drops to 18.4W. At 3.0 GHz, it goes back up to 18.5W. As I drop the frequency further, the wall consumption goes up to a peak of 20W at 1.8 GHz before it starts dropping frames.
    Guessing from the wall power and max frequency, that's some kind of laptop or mini-PC, but you have direct control of the fan voltage? I'm curious what kind of hardware you have there. I hacked up a benchmark for this (takes about an hour to run), and on my machine (i5-4670K with a slight overclock/undervolt; factory max speed 3.8 GHz), power efficiency is flat-ish only up to 3 GHz or so. But somehow schedutil still beats performance even with the frequency set to below the point where it goes pear-shaped, possibly because schedutil is allowed to go all the way down to 800 MHz:

    plot.png
    Sorry about how janky it is. I did run on a quiet system, but... If I find the time, I'll fix it to average the best 2 out of 5 or something.

    The breakpoint looks to be around 3.4 GHz, which is the same frequency where my CPU's VID/Frequency map gets way steeper, and also the start of the "turbo" range:

    adapt-1230_offset-100.png

    turbostat shows the balance here - at 3.0 GHz my core power consumption is lowest, but the package power consumption goes up because turbostat sees RAMWatt consuming more. At 3.1 GHz, I hit the balance point where while the cores consume a bit more, RAMWatt is lower, and PkgWatt goes down.

    Reducing frequency further continues the balancing - CorWatt goes down, but RAMWatt goes up by more than CorWatt falls by, causing PkgWatt to increase.
    RAMWatt is not reported by my CPU. My wall power meter doesn't have any kind of API and only updates once per second, so there's no hope of synchronizing measurement with the video. But eyeballing, it looks like 84 W at 2 GHz, and 93W at 4 GHz, both with the performance governor.

    If I limit fan voltage to 7V (== less cooling), then the balance point falls to about 2.8 GHz once the chip has reached its new stable temperature, where I have a minimum wall power consumption. Both CorWatt and RAMWatt are higher at 3.1 GHz than at 2.8 GHz with the reduced cooling. And it's still using less power racing to idle at 2.8 GHz than it does at 1.8 GHz where I can't drop lower without dropping frames.

    To make 1.8 GHz do better than race-to-idle, I have to limit my fans to 3.3V. At this point, the reduced cooling is such that the balance point is 1.8 GHz. I can still run at 3.2 GHz, though, it's just that my chip gets a lot hotter than it does when my fans are allowed their full 12V.
    As for temperature effects, with 100% stress -c 4 load on all cores at 4.0 GHz, I get 81 W PkgWatt, 69 W CorWatt at 92°C (minimum fan speed), and 68 W PkgWatt, 57 W CorWatt at 63°C (maximum fan speed). Video playback is not a thermally significant load for my CPU cooler, and in that test, package temperatures ranged from 49-55°C, at minimum fan speed.

    With how low your computer's wall power is, it's possible that the fan motors themselves are significant.

    Just going by physics, my guess is that temperature just makes the efficiency fall-off at high frequency even more nonlinear, because of the vicious cycle of higher temperature -> higher leakage power -> higher temperature, and also needing more voltage for the same frequency at higher temperature, if your CPU's DVFS is fancy enough to do that. I think mine is too old.

    So, based on doing the test you recommend, and extending to cover different cooling points, I find that race-to-idle is optimal assuming I choose the frequency that is power-optimal for the current thermals. With good cooling, performance is only slightly off optimal; as cooling gets worse, the power-optimal point for my CPU falls, until the power optimal point is lower than the frequency point one above the minimum I can set.

    And this is where the P-state driver gets potentially interesting - the chip can know its own thermal situation and current power-optimal point, and thus race-to-idle whenever the current power-optimal operating point is higher than the minimum speed to reach the goal.
    I think our main point of disagreement is that I don't think the highest frequency the chip can run at is usually pretty close to the optimal power consumption frequency. Rather, I think it's well above the optimum.

    Leave a comment:


  • agd5f
    replied
    Originally posted by Linuxxx View Post

    Wouldn't this lead to yet even more bloat in the 'schedutil' governor, thus spending even more resources to make "smart" decisions?

    All of this suggests to me that no matter how many years are spent trying to improve upon 'schedutil', it will forever remain an inferior solution to the 'performance' governor, because of the simple fact that the latter needs no computationally more expensive decision-making logic.

    (Hope that was comprehensible...)
    It depends what you are looking for. The whole idea behind CPPC is to save power not necessarily max performance. The silicon is the same. If you want max performance, just choose a governor that always picks the max performance state. CPPC gives you finer grained clock control which should allow better tuning of the clock to the task. Ideally, the CPU would be operating at the max freq available at the min voltage level most of the time.

    Leave a comment:


  • farnz
    replied
    Originally posted by yump View Post

    I don't think that's the kind of performance they would be talking about. That's frequency selection, which is done by the governor (schedutil, conservative, performance, etc.). amd-pstate is a driver, which tells the governor what frequencies are available, and convey's the governor's choice to the hardware.






    The supposed superiority of race-to-idle is exaggerated.

    Do your video decode test, and use

    Code:
    cpupower frequency-set -u <limit>GHz
    to turn down the clock frequency to the lowest you can go before the video decoder starts dropping frames.

    Then compare what turbostat says about the power consumption.
    Looking at my wall consumption, and using a software decoder, at 3.2 GHz (the clock speed of my system as set by the performance governor), I consume 18.5W. At 3.1 GHz, this drops to 18.4W. At 3.0 GHz, it goes back up to 18.5W. As I drop the frequency further, the wall consumption goes up to a peak of 20W at 1.8 GHz before it starts dropping frames.

    turbostat shows the balance here - at 3.0 GHz my core power consumption is lowest, but the package power consumption goes up because turbostat sees RAMWatt consuming more. At 3.1 GHz, I hit the balance point where while the cores consume a bit more, RAMWatt is lower, and PkgWatt goes down.

    Reducing frequency further continues the balancing - CorWatt goes down, but RAMWatt goes up by more than CorWatt falls by, causing PkgWatt to increase.

    If I limit fan voltage to 7V (== less cooling), then the balance point falls to about 2.8 GHz once the chip has reached its new stable temperature, where I have a minimum wall power consumption. Both CorWatt and RAMWatt are higher at 3.1 GHz than at 2.8 GHz with the reduced cooling. And it's still using less power racing to idle at 2.8 GHz than it does at 1.8 GHz where I can't drop lower without dropping frames.

    To make 1.8 GHz do better than race-to-idle, I have to limit my fans to 3.3V. At this point, the reduced cooling is such that the balance point is 1.8 GHz. I can still run at 3.2 GHz, though, it's just that my chip gets a lot hotter than it does when my fans are allowed their full 12V.

    So, based on doing the test you recommend, and extending to cover different cooling points, I find that race-to-idle is optimal assuming I choose the frequency that is power-optimal for the current thermals. With good cooling, performance is only slightly off optimal; as cooling gets worse, the power-optimal point for my CPU falls, until the power optimal point is lower than the frequency point one above the minimum I can set.

    And this is where the P-state driver gets potentially interesting - the chip can know its own thermal situation and current power-optimal point, and thus race-to-idle whenever the current power-optimal operating point is higher than the minimum speed to reach the goal.

    Leave a comment:


  • yump
    replied
    Originally posted by agd5f View Post

    I think the problem is related to the granularity and latency of switching between clock states. When using the CPPC pstate driver, the governor switches power states a lot more often than the old APCI pstate driver because there are more states to choose from. The shared memory interface has a higher latency than the MSR interface. I think the proper fix would be to take the state switch latency into account in the governor.
    --

    Originally posted by shmerl View Post

    How would the governor be able to mitigate the latency though? It sounds like a hardware limitation. Or you mean it can plan switching with some prediction of load and latency taken in account? And does AMD plan to update schedutil itself for that somehow?
    Check out /sys/devices/system/cpu/cpufreq/schedutil/rate_limit_us

    It's the only tunable schedutil has.

    Personally, I think it ought to have at least a one-bit toggle between, "save energy where you can without compromising performance, because I care about global warming/my electricity is metered," and "stay away from the full-retard 4+GHz turbo OPPs if you don't absolutely need them, because this computer is battery powered."

    Leave a comment:


  • yump
    replied
    Originally posted by Linuxxx View Post

    Yes, especially in combination with the default "schedutil" governor, as Michael's benchmarks are going to reveal!
    However, as always, "performance" users will be fine.
    I don't think that's the kind of performance they would be talking about. That's frequency selection, which is done by the governor (schedutil, conservative, performance, etc.). amd-pstate is a driver, which tells the governor what frequencies are available, and convey's the governor's choice to the hardware.

    Originally posted by Linuxxx View Post

    You're welcome!
    After all, sharing is caring, you know...

    And yes, the reason why the performance governor is so damn efficiently effective is because it will always go for the maximum clocks immediately.
    However, even on my i5-6500 with Intel's measly stock-cooler, this never causes any problems, because any modern CPU will always go into deep-sleep states intermittently, thus saving more power than working longer at reduced clockspeeds.

    A good way to test this and see it for yourself in action is to playback a software-decoded video while periodically looking at the output of yet another very useful command of CPUPOWER:

    Code:
    sudo cpupower monitor
    Here, you can read the percentages of the different C-states on the right, where the higher the number, the deeper the sleep.

    At least on my Intel system, even though my CPU is under constant load caused by the video playback, I can observe that it still manages to enter all sleep states successfully and only causing a minor increase in heat output, varying on how heavy the video in question is to decode.

    If You could test this out on Your Ryzen system, then I would certainly be interested to find out about Your observations.

    Take care!
    Originally posted by farnz View Post

    Modern chips power gate when idle, so that significant chunks of the chip stop using power when they're not in use. There's also an optimal performance per Joule operating point, which is affected by the temperature of the chip, but is high on a cold chip.

    The ideal from a power consumption perspective is to "race to idle" at the maximum performance per Joule your chip can run at, and then power gate into a low power state. Performance gives you a close approximation to this, because the highest frequency the chip can run at right now is usually pretty close to the optimal power consumption frequency, whereas running at a lower frequency lowers energy use, but lowers performance by more than it lowers energy use.

    In theory, the AMD P-state driver can do better by tracking the current optimal power consumption frequency, and stick to that - so where performance might choose 3.2 GHz, P-state might choose 3.1 GHz.

    The supposed superiority of race-to-idle is exaggerated.

    Do your video decode test, and use

    Code:
    cpupower frequency-set -u <limit>GHz
    to turn down the clock frequency to the lowest you can go before the video decoder starts dropping frames.

    Then compare what turbostat says about the power consumption.

    Leave a comment:


  • Linuxxx
    replied
    Originally posted by agd5f View Post

    I think the problem is related to the granularity and latency of switching between clock states. When using the CPPC pstate driver, the governor switches power states a lot more often than the old APCI pstate driver because there are more states to choose from. The shared memory interface has a higher latency than the MSR interface. I think the proper fix would be to take the state switch latency into account in the governor.
    Wouldn't this lead to yet even more bloat in the 'schedutil' governor, thus spending even more resources to make "smart" decisions?

    All of this suggests to me that no matter how many years are spent trying to improve upon 'schedutil', it will forever remain an inferior solution to the 'performance' governor, because of the simple fact that the latter needs no computationally more expensive decision-making logic.

    (Hope that was comprehensible...)

    Leave a comment:


  • shmerl
    replied
    Originally posted by agd5f View Post

    I think the problem is related to the granularity and latency of switching between clock states. When using the CPPC pstate driver, the governor switches power states a lot more often than the old APCI pstate driver because there are more states to choose from. The shared memory interface has a higher latency than the MSR interface. I think the proper fix would be to take the state switch latency into account in the governor.
    How would the governor be able to mitigate the latency though? It sounds like a hardware limitation. Or you mean it can plan switching with some prediction of load and latency taken in account? And does AMD plan to update schedutil itself for that somehow?

    Leave a comment:


  • agd5f
    replied
    Originally posted by shmerl View Post
    Shouldn't running at performance use more power than necessary to achieve the same task? That's the whole point of schedutil after all, to scale frequency depending on the load. If it's not happening in your case - something is wrong with the driver I assume.

    The whole point of amd-pstate is just to allow more granular control over frequency ranges. So surely you shouldn't be using performance for optimal operation if everything will be working as intended. But it sounds like their current implementation has some problem so far.
    I think the problem is related to the granularity and latency of switching between clock states. When using the CPPC pstate driver, the governor switches power states a lot more often than the old APCI pstate driver because there are more states to choose from. The shared memory interface has a higher latency than the MSR interface. I think the proper fix would be to take the state switch latency into account in the governor.

    Leave a comment:


  • dpanter
    replied
    Originally posted by shmerl View Post
    Is there some easy way to check that P-state driver is being used? I know how to check that UEFI supports it at least.
    Might not fit into the definition of 'easy', but if nothing else it produces clean output.

    Code:
    echo "CPU scaling driver/governor: " `cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver`"/"`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`

    Result from my 8700K machine
    Code:
    CPU scaling driver/governor: intel_pstate/performance

    Leave a comment:


  • loganj
    replied
    how is the freq on powersave? last time i've tried this 5.17.rcX i had 400Mhz on all cores on powersave. my 5750GE was slow and the power consumption (@wall) was the same as 5.16 powersave (1400Mhz)

    Leave a comment:

Working...
X