Continuing To Stress Ryzen
In direct continuation of yesterday's article about easily causing segmentation faults on AMD Zen CPUs, I have carried out another battery of tests for 24 hours and have more information to report today on the ability to trivially cause segmentation faults and in some cases system lock-ups with Ryzen CPUs.
If you didn't read yesterday's article be sure to do so for more background information and my initial steps on causing segmentation faults for Ryzen as well as testing some theories about the faults, but so far haven't been able to find a workaround to completely avoid this problem... This article is mostly about other attempts made at trying to nail down the issue, but to no avail, as well as showing other areas where these segmentation faults can be reproduced on Ryzen.
For this article all of my testing was done using Phoronix Test Suite's stress-run functionality. As explained in yesterday's article, the stress-run command within the Phoronix Test Suite has been used by enterprise customers for stress testing / burn-ins of hardware and checking for stability. Rather than benchmarking for performance, stress-run allows executing multiple test profiles in parallel for fully loading the system with whatever workloads you would like. Using PTS_CONCURRENT_TEST_RUNS=4 TOTAL_LOOP_TIME=60 phoronix-test-suite stress-run build-linux-kernel build-apache build-imagemagick will have the Phoronix Test Suite continually running four different benchmarks simultaneously for a period of 60 minutes. As soon as one test finishes, another is fired up. The stress-run algorithm randomly picks the tests of your set to run, but does look at the test profile to ensure if the tests stress multiple subsystems, it tries to ensure stress on all subsystems are always being stressed. The Phoronix Test Suite's stress-run functionality isn't advertised as much as its other features, but is very useful for loading up a system with plenty of real-world workloads concurrently. This has been my main means of reproducing the Ryzen bug.
Update [5 August]: As a result of feedback, currently working on some updated results. As some have pointed out, the conftest segmentation faults aren't specific to Ryzen, so updating the tests to avoid confusion. Though one area being explored now as well is the Clang segmentation faults shown in the original article, not originating from conftest as well as Clang being able to yield the system hanging hard where the system is unresponsive and SSH is not working. Plus also incorporating more Ryzen-Kill tests as outlined in the aforelinked article. As many readers have pointed out, BSD developers have also discovered a Ryzen bug. More details soon.
Update [7 August]: More information in AMD Confirms Linux Performance Marginality Problem Affecting Some, Doesn't Affect Epyc / TR
If you enjoyed this article consider joining Phoronix Premium to view this site ad-free, multi-page articles on a single page, and other benefits. PayPal or Stripe tips are also graciously accepted. Thanks for your support.