Announcement

**yump** · 19 July 2022, 08:27 PM

Originally posted by jakubo View Post

If you have a bright back-light, or if you are running a headless server doesn't really matter. Take a job that runs for 5 days and after the job you can turn it completely off and unplug it.
Also... the more power you need to keep the system up, the less impact frequency scaling would have.

The point is that most of the time you aren't using your computer for batch jobs or server tasks, where extra performance would let you power the whole machine off sooner, or not have to run as many machines.

Most of the time you're doing things like watching video, browsing web, playing games. The system will be powered on for the same number of hours and execute about the same number of total instructions no matter what the performance is, as long as it's enough to keep up with the monitor refresh rate.

So when you race to sleep, the only power you save is what can be saved by putting the CPU cores/package in a higher C-state.

Originally posted by jakubo View Post

Running in high frequencies also would mean that you are not data bound and the CPU runs few idle cycles. So the memory system delivers data close to optimally.

Rather the other way, I should think? Higher clock frequency means more stalled cycles fit in the nanoseconds that the CPU is waiting on data from memory. IIRC, the documentation for Intel's HWP implies it chooses lower frequencies in memory bound code.

Maybe that's what you meant? *If* HWP chooses high frequencies, you are probably not data bound?

Originally posted by jakubo View Post

But we don't have to guess or calculate the power or the time. We measure both directly.

Quite so! See below...

Originally posted by Linuxxx View Post

Again, that's all nice in theory, but when Michael's benchmark proofs that Clear Linux with amd-pstate + performance beats all other options by having both the lowest temperatures & lowest power-draw on average, something doesn't seem to properly add up with your expectations:

https://www.phoronix.com/scan.php?pa...ne-linux&num=8

Cross-distro comparisons have way too many extra variables. Particularly, Clear Linux has compiler optimizations that would let it complete the benchmarks in less time and use less energy even if all the distros were locked to the same CPU frequency.

Originally posted by Linuxxx View Post

And as I had already told before, I am able to watch a 1440p60 AV1 video on my i7-11700F with intel_cpufreq + performance (Intel energy to performance bias set to 7 in a range of 0-15) without the fan spinning up from its low default state, so clearly race-to-sleep seems to be working as expected here.

"Does the fan spin up" is a wooly measurement, because it's affected by all sorts of things like ambient temperature and fan control strategy. And your CPU is so fast that decoding 1440p60 AV1 is a light load compared to the sort of 100% saturating SIMD batch jobs that should be within the capability of a well-designed CPU cooler.

But we can use turbostat to measure the energy usage directly, and I wrote a script to do this a while back. You'll want to set the path of $testvid, change $mhz_hi to the turbo clock of your CPU, and probably also s/schedutil/powersave/g because something that new will be using intel_pstate in HWP mode. Dependencies are mpv, stress, and turbostat ("kernel-tools" package in Fedora). Be aware that this needs a completely idle system and will take over your machine for an hour or so.

Code:

#!/bin/bash

testvid="/tmp/vid/testvid.webm"
mhz_lo=2000
mhz_hi=4200

ncpu=$(grep -c processor /proc/cpuinfo)

measure_power () {
        stress -q -c $ncpu -t 10 # preheat the die/IHS for 10 seconds
        sudo turbostat \
            --quiet --Summary --show PkgWatt --out "/tmp/turbostat.out" \
            -- \
            mpv \
                --no-terminal --ao=null --fs \
                --start=00:00:10 --length=60 "$testvid"
        tail -n1 "/tmp/turbostat.out"
        sudo rm "/tmp/turbostat.out"
}

collect () {
    echo "please make sure screensaver is suspended"
    sudo -v
    # preheat the CPU heatsink for 2 minutes
    sudo cpupower frequency-set -g performance -u 2000MHz
    stress -c $ncpu -t 120
    # clear old results
    truncate --size 0 results.dat
    # Test frequencies in random order
    for mhz in $(seq "$mhz_lo" 100 "$mhz_hi" | shuf); do
        sudo cpupower frequency-set -g schedutil -u "${mhz}MHz"
        pow_schedutil="$(measure_power)"
        sudo cpupower frequency-set -g performance -u "${mhz}MHz"
        pow_performance="$(measure_power)"
        printf "%s %s %s\n" "$mhz" "$pow_performance" "$pow_schedutil" \
            >>"results.dat"
    done
    # put results in frequency order
    sort -o "results.dat" "results.dat"
}

plot () {
    gnuplot <<EOF
        set terminal pngcairo size 800,600
        set output "plot.png"
        set xlabel "MHz"
        set ylabel "Watts"
        set autoscale x noextend
        set xtics 200 rotate by -45 offset -1
        plot "results.dat" using 1:2 with linespoints title "performance", \
            "" using 1:3 with linespoints title "schedutil"
EOF
}

collect
plot

On an i5-4670K, the results look like this:
plot.png

**Linuxxx** · 20 July 2022, 11:41 AM

Originally posted by yump View Post

The point is that most of the time you aren't using your computer for batch jobs or server tasks, where extra performance would let you power the whole machine off sooner, or not have to run as many machines.

Most of the time you're doing things like watching video, browsing web, playing games. The system will be powered on for the same number of hours and execute about the same number of total instructions no matter what the performance is, as long as it's enough to keep up with the monitor refresh rate.

So when you race to sleep, the only power you save is what can be saved by putting the CPU cores/package in a higher C-state.

Rather the other way, I should think? Higher clock frequency means more stalled cycles fit in the nanoseconds that the CPU is waiting on data from memory. IIRC, the documentation for Intel's HWP implies it chooses lower frequencies in memory bound code.

Maybe that's what you meant? *If* HWP chooses high frequencies, you are probably not data bound?

Quite so! See below...

Cross-distro comparisons have way too many extra variables. Particularly, Clear Linux has compiler optimizations that would let it complete the benchmarks in less time and use less energy even if all the distros were locked to the same CPU frequency.

"Does the fan spin up" is a wooly measurement, because it's affected by all sorts of things like ambient temperature and fan control strategy. And your CPU is so fast that decoding 1440p60 AV1 is a light load compared to the sort of 100% saturating SIMD batch jobs that should be within the capability of a well-designed CPU cooler.

But we can use turbostat to measure the energy usage directly, and I wrote a script to do this a while back. You'll want to set the path of $testvid, change $mhz_hi to the turbo clock of your CPU, and probably also s/schedutil/powersave/g because something that new will be using intel_pstate in HWP mode. Dependencies are mpv, stress, and turbostat ("kernel-tools" package in Fedora). Be aware that this needs a completely idle system and will take over your machine for an hour or so.

Code:

#!/bin/bash

testvid="/tmp/vid/testvid.webm"
mhz_lo=2000
mhz_hi=4200

ncpu=$(grep -c processor /proc/cpuinfo)

measure_power () {
stress -q -c $ncpu -t 10 # preheat the die/IHS for 10 seconds
sudo turbostat \
--quiet --Summary --show PkgWatt --out "/tmp/turbostat.out" \
-- \
mpv \
--no-terminal --ao=null --fs \
--start=00:00:10 --length=60 "$testvid"
tail -n1 "/tmp/turbostat.out"
sudo rm "/tmp/turbostat.out"
}

collect () {
echo "please make sure screensaver is suspended"
sudo -v
# preheat the CPU heatsink for 2 minutes
sudo cpupower frequency-set -g performance -u 2000MHz
stress -c $ncpu -t 120
# clear old results
truncate --size 0 results.dat
# Test frequencies in random order
for mhz in $(seq "$mhz_lo" 100 "$mhz_hi" | shuf); do
sudo cpupower frequency-set -g schedutil -u "${mhz}MHz"
pow_schedutil="$(measure_power)"
sudo cpupower frequency-set -g performance -u "${mhz}MHz"
pow_performance="$(measure_power)"
printf "%s %s %s\n" "$mhz" "$pow_performance" "$pow_schedutil" \
>>"results.dat"
done
# put results in frequency order
sort -o "results.dat" "results.dat"
}

plot () {
gnuplot <<EOF
set terminal pngcairo size 800,600
set output "plot.png"
set xlabel "MHz"
set ylabel "Watts"
set autoscale x noextend
set xtics 200 rotate by -45 offset -1
plot "results.dat" using 1:2 with linespoints title "performance", \
"" using 1:3 with linespoints title "schedutil"
EOF
}

collect
plot

On an i5-4670K, the results look like this:
plot.png

That's genuinely interesting, but I have a question:

How come the difference between schedutil & performance becomes so large when both are hitting the max. frequency?

Since both are able to hit it, shouldn't the power consumption be the same there, too?

**yump** · 20 July 2022, 03:21 PM

Originally posted by Linuxxx View Post

That's genuinely interesting, but I have a question:

How come the difference between schedutil & performance becomes so large when both are hitting the max. frequency?

Since both are able to hit it, shouldn't the power consumption be the same there, too?

The x-axis on the plot (and the argument to cpupower frequency-set -u) is the highest frequency the governor is allowed to use, not the highest frequency it does use. For perfgov those are the same, but because mpv video playback is apparently one of the tasks that schedutil handles pretty well, it (mostly) doesn't use the max frequency because it doesn't need to.

**jakubo** · 20 July 2022, 05:02 PM

Originally posted by yump View Post

The point is that most of the time you aren't using your computer for batch jobs or server tasks, where extra performance would let you power the whole machine off sooner, or not have to run as many machines.

Most of the time you're doing things like watching video, browsing web, playing games. The system will be powered on for the same number of hours and execute about the same number of total instructions no matter what the performance is, as long as it's enough to keep up with the monitor refresh rate.

So when you race to sleep, the only power you save is what can be saved by putting the CPU cores/package in a higher C-state.

that can hardly be an argument. So if a task drags itself into infinity or actually stalls completely you would be happy? according to your metric that would be the case. "Prevent power consumption at all cost, the task wasn't that important anyway. Let's watch some videos instead". Sorry, maybe a bit much sarcasm, but i think you get my point.

Rather the other way, I should think? Higher clock frequency means more stalled cycles fit in the nanoseconds that the CPU is waiting on data from memory. IIRC, the documentation for Intel's HWP implies it chooses lower frequencies in memory bound code.

Maybe that's what you meant? *If* HWP chooses high frequencies, you are probably not data bound?

The last statement, yes. I believe idle cycles is one of the parameters to decide when to clock a core down (or up). So if the core is up, it means it CAN actually perform and does not wait for IO, which would result in idle cycles and subsequently in down-clocking (roughly speaking, possibly depending on size and latency/timing of the awaited IO)

But we can use turbostat to measure the energy usage directly, and I wrote a script to do this a while back. You'll want to set the path of $testvid, change $mhz_hi to the turbo clock of your CPU, and probably also s/schedutil/powersave/g because something that new will be using intel_pstate in HWP mode. Dependencies are mpv, stress, and turbostat ("kernel-tools" package in Fedora). Be aware that this needs a completely idle system and will take over your machine for an hour or so.

Ok, i get your point, that tasks that use the highest frequencies will burn trough more energy as energy scales cubicly with frequency while time scales linearly. However. There are sweet spots where using cache locality will save a lot of time and energy and i believe that it may be a case here as this is what these cache attacks or rather their mitigations are all about, right? (flushes on context switches, confusing the Russians, making mainframe engineer cry...)
And as said before: I don't think we can have meaningful performance measurements, when we are streaming UHD films or are playing games meanwhile.
It may not be a batch job, but there may be another task you are waiting for that is not parallel but single threaded and would profit from higher (turbo) frequencies, but those will not kick in as the power or thermal budget is already reached. Really, i don't see how we can debate the "usual case". That might actually be a discussion for power saving governor. Otherwise we would have to disable higher frequencies altogether.
I believe my point still stands in some of these cases as we see how much more power is consumed (a little) while the time goes down by a quarter.

**yump** · 21 July 2022, 09:36 AM

Originally posted by jakubo View Post

that can hardly be an argument. So if a task drags itself into infinity or actually stalls completely you would be happy? according to your metric that would be the case. "Prevent power consumption at all cost, the task wasn't that important anyway. Let's watch some videos instead". Sorry, maybe a bit much sarcasm, but i think you get my point.

That's not what I'm saying. An ideal CPU governor is supposed slow down the CPU if and only if a task is throttled by external constraints, like if half CPU speed is enough to keep the outgoing network buffer full, or process the video in realtime, or update the GUI in time for the next display refresh, etc. Additionally, a governor that works inside the CPU, like intel_pstate HWP, should be able to identify external constraints like DRAM bandwidth that aren't visible to the kernel scheduler as threads blocking.

That's why benchmarks like this 12900K test are such an an embarrassment for schedutil. A video encode is a batch job -- it should be running all out!

Other perf/W optimizations are outside of the responsibility of the CPU governor.

For tasks where you know it's not important to run them as fast as possible, like backups, the administrator or system daemon (Android has this, I think) should tell the kernel so with the utilization clamping framework. For the case where, "this is a laptop running on battery, 4+ GHz is stupid," the user (or something in the default configuration that acts on the user's behalf) should restrict the maximum CPU frequency with the interface in /sys/devices/system/cpu/cpufreq.

**jakubo** · 21 July 2022, 10:51 AM

Originally posted by yump View Post

That's not what I'm saying. An ideal CPU governor is supposed slow down the CPU if and only if a task is throttled by external constraints, like if half CPU speed is enough to keep the outgoing network buffer full, or process the video in realtime, or update the GUI in time for the next display refresh, etc. Additionally, a governor that works inside the CPU, like intel_pstate HWP, should be able to identify external constraints like DRAM bandwidth that aren't visible to the kernel scheduler as threads blocking.

That's why benchmarks like this 12900K test are such an an embarrassment for schedutil. A video encode is a batch job -- it should be running all out!

Other perf/W optimizations are outside of the responsibility of the CPU governor.

For tasks where you know it's not important to run them as fast as possible, like backups, the administrator or system daemon (Android has this, I think) should tell the kernel so with the utilization clamping framework. For the case where, "this is a laptop running on battery, 4+ GHz is stupid," the user (or something in the default configuration that acts on the user's behalf) should restrict the maximum CPU frequency with the interface in /sys/devices/system/cpu/cpufreq.

I completely agree. All i was saying was that for some of these benchmarks "race to sleep" (i.e. higher frequencies) was actually beneficial. But it couldn't be seen in the power as one would have to integrate over time (or timesteps) t1 and the other over t2 with t1 < t2. That's like measuring a car's fuel consumption only by looking at the current consumption and then stating that one consumed 10l/100km each time frame, and the other 5l/100km without mentioning that the first was actually 3x as fast.

Announcement

An Important Note On The Alder Lake Mobile Power/Performance With Linux 5.19

Comment

Comment

Comment

Comment

Comment

Comment