Some AMD GPUs Affected By A Nasty Power Regression That Snuck Into Linux 4.18 Stable
A Phoronix reader emailed in that since the recent Linux 4.18.10 stable kernel the power usage on his system has increased by around 50 Watts while idling... Not the overall AC system power draw being 50 Watts, but an increase of roughly that amount on the latest 4.18 stable point releases up to this point. I've now been able to reproduce as well as bisect the cause.
Besides the reader's own experience, he also pointed out some reports on Reddit of the power consumption being much higher on these latest Linux 4.18 point releases but without any bisecting or narrowing down of the problem. For those without power meters, some individuals have reported higher system temperatures with these post-4.18.9 point releases. The issue is also present in the current Linux 4.19 code.
Fortunately, with plenty of hardware around and always an interest in big kernel regressions, I tested a few systems. Quickly I realized it was an AMD graphics issue but nothing across the board as with Vega GPUs, for example, I didn't see any change in the idle AC power draw.
One of the affected cards I found was the commonly used Radeon RX 580 graphics card, so with the Phoronix Test Suite I set out for the automated driven bisecting of the Linux 4.18 stable tree.
With the Radeon RX 580 I was running it on the AMD Threadripper 2990WX to speed up the kernel build times. On that system the newer 4.18 kernel point releases were leading to an idle power increase of about 20 Watts.
Fortunately with having tracked down several Linux kernel power regressions over the years and building the infrastructure into the Phoronix Test Suite for automating nearly the entire process, it's a fairly straight-forward process when easily reproducing the issue and having the hardware/time on hand.
Long story short, the kernel commit causing this notable AMDGPU power regression is this commit which is to some surprise as it's for trying to fix DP/HDMI display issues but is involving the clock code for SMU7/SMU8 hardware.
If you enjoy my daily Linux hardware benchmarking that behind the scenes is almost always as a result of 100+ hour work weeks by your's truly, consider showing your support by joining Phoronix Premium and/or making a PayPal tip. At the very least to please not use any ad-blocker when browsing this site. Any tips coming in over the next few days I'd love to put towards buying some additional USB-interfacing power meters as currently I just have one for interfacing with the Phoronix Test Suite and unfortunately that often bottlenecks my ability for delivering power/perf-per-Watt metrics... Could be doing a lot more power tests otherwise.
Presumably this issue will get fixed up in short order for Linux 4.18+, but unfortunate that such a regression is able to reach the stable tree in the first place. The QA/CI stacks for the AMD Linux open-source driver stack doesn't appear to be as quite as advanced as the CI systems Intel has put in place over the years or those running internally at NVIDIA, but hopefully that will improve with the AMD open-source graphics driver stack nearing parity to their Windows driver and the green competition.
Besides the reader's own experience, he also pointed out some reports on Reddit of the power consumption being much higher on these latest Linux 4.18 point releases but without any bisecting or narrowing down of the problem. For those without power meters, some individuals have reported higher system temperatures with these post-4.18.9 point releases. The issue is also present in the current Linux 4.19 code.
Fortunately, with plenty of hardware around and always an interest in big kernel regressions, I tested a few systems. Quickly I realized it was an AMD graphics issue but nothing across the board as with Vega GPUs, for example, I didn't see any change in the idle AC power draw.
One of the affected cards I found was the commonly used Radeon RX 580 graphics card, so with the Phoronix Test Suite I set out for the automated driven bisecting of the Linux 4.18 stable tree.
Yep... pic.twitter.com/s9j0VX0vNS
— Phoronix (@phoronix) October 4, 2018
With the Radeon RX 580 I was running it on the AMD Threadripper 2990WX to speed up the kernel build times. On that system the newer 4.18 kernel point releases were leading to an idle power increase of about 20 Watts.
Fortunately with having tracked down several Linux kernel power regressions over the years and building the infrastructure into the Phoronix Test Suite for automating nearly the entire process, it's a fairly straight-forward process when easily reproducing the issue and having the hardware/time on hand.
Long story short, the kernel commit causing this notable AMDGPU power regression is this commit which is to some surprise as it's for trying to fix DP/HDMI display issues but is involving the clock code for SMU7/SMU8 hardware.
If you enjoy my daily Linux hardware benchmarking that behind the scenes is almost always as a result of 100+ hour work weeks by your's truly, consider showing your support by joining Phoronix Premium and/or making a PayPal tip. At the very least to please not use any ad-blocker when browsing this site. Any tips coming in over the next few days I'd love to put towards buying some additional USB-interfacing power meters as currently I just have one for interfacing with the Phoronix Test Suite and unfortunately that often bottlenecks my ability for delivering power/perf-per-Watt metrics... Could be doing a lot more power tests otherwise.
Presumably this issue will get fixed up in short order for Linux 4.18+, but unfortunate that such a regression is able to reach the stable tree in the first place. The QA/CI stacks for the AMD Linux open-source driver stack doesn't appear to be as quite as advanced as the CI systems Intel has put in place over the years or those running internally at NVIDIA, but hopefully that will improve with the AMD open-source graphics driver stack nearing parity to their Windows driver and the green competition.
82 Comments