Ryzen-Test & Stress-Run Make It Easy To Cause Segmentation Faults On Zen CPUs
For those that missed our earlier article on the matter from early June, heavy workloads can cause problems on Ryzen, in particular segmentation faults while there have also been reports of some stability problems.
This Google Doc remains among the resources trying to track this issue on Linux while on the Gentoo Forums, AMD Forums, and elsewhere are more reports of various problems encountered under extreme workloads -- like a ton of code compiling for hours on end, but can also happen in other scenarios.
AMD hasn't publicly commented on the problem and as of Linux 4.13 the issue is still happening. If carrying out the same tests on Intel CPUs, the segmentation faults do not occur. There is even ryzen-test to easily try reproducing the issue. The ryzen-test script will build GCC in parallel loops from a compressed ramdisk, in order to easily stress the CPU. In my day-to-day benchmarking of Ryzen CPUs, however, I haven't hit this problem or even on my main production desktop with using Ryzen 5. The problem really comes to light just under very heavy and continuous workloads it seems.
I have been running some tests with a Ryzen 7 1800X with the MSI X370 XPOWER TITANIUM GAMING motherboard with Linux 4.13 and Corsair DDR4-3200 memory. I have been using the latest BIOS, 7A31v17, which is built with AGESA 184.108.40.206.
Sure enough, in the stock configuration of the Ryzen 7 1800X and with the DDR4-3200 speed activated, the first segmentation fault with ryzen-test happened in just 88 seconds.
I also tried setting the memory to its defaults at DDR4-2133, but the issue occurred still in 83 seconds. When disabling SMT, it takes much longer for the problem to come to head, but it does eventually happen. Disabling SMT seems to be the closest workaround those experiencing this problem more often in the community have done to try to avoid problems, but then you lose half the threads of the CPU.
Besides using ryzen-test to hammer the CPU to easily reproduce the issue, I also decided to use phoronix-test-suite stress-run. The stress-run command within the Phoronix Test Suite has been used by enterprise customers for stress testing / burn-ins of hardware and checking for stability. Rather than benchmarking for performance, stress-run allows executing multiple test profiles in parallel for fully loading the system with whatever workloads you would like. Using PTS_CONCURRENT_TEST_RUNS=4 TOTAL_LOOP_TIME=60 phoronix-test-suite stress-run build-linux-kernel build-php build-apache pgbench apache redis will have the Phoronix Test Suite continually running four different benchmarks simultaneously for a period of 60 minutes. As soon as one test finishes, another is fired up. The stress-run algorithm randomly picks the tests of your set to run, but does look at the test profile to ensure if the tests stress multiple subsystems, it tries to ensure stress on all subsystems are always being stressed. The Phoronix Test Suite's stress-run functionality isn't advertised as much as its other features, but is very useful for loading up a system with plenty of real-world workloads concurrently.
In this case of multiple processes of code compilation, PostgreSQL, Apache, and Redis, Ryzen drops to its knees very quickly.
While with the ryzen-fail demo program when disabling SMT it could take up to a half hour to get a fail reported, with the Phoronix Test Suite stress-run for the Ryzen 7 1800X with eight cores and no SMT, I managed to get the first segmentation fault after the system was booted up for just 229 seconds... And the segmentation faults would continue every few minutes in this configuration under the immense workloads.
We'll see now if AMD will provide public comments or if they investigate further as they now have another reproducible test case to slam the Ryzen chips hard in just a few minutes even with SMT disabled and running at DDR4-2133. As far as whether this just affects Ryzen or also Threadripper and Epyc remains unclear. While there are many Windows reviewers out there now with Threadripper, it doesn't look like AMD will be sending any Threadripper samples to Phoronix, at least in the immediate days ahead but I have asked if at least can get SSH access to a TR system for a few hours to be able to run some Linux benchmarks. We'll see. For the Epyc server processors as well, no samples are available according to a motherboard vendor that has been trying to get them on my behalf.
Just to reiterate, while this problem is easy to cause under very heavy workloads, under normal Linux desktop workloads and even normal benchmarking, I haven't run into any Ryzen problems. I will be running some more Ryzen stress-tests today.
Update [7 August]: AMD Confirms Linux Performance Marginality Problem Affecting Some, Doesn't Affect Epyc / TR