Observations On Long-Term Performance/Regression Testing
Brought up in this ubuntu-devel thread are comments like:
I'm extremely interested in having the various sub-team come up with standard measurements so that when people make changes to performance, we can actually see it across all of these well-defined workloads.
I can think of only one that exists now: the boot-time graphs Scott manages.
Changes to the compiler toolchain, the kernel, etc, all have an impact on everyone's workloads, but most teams haven't actually stepped forward and said "THIS workload is important to us, here's how to reproduce the measurement, and here's where we're tracking the daily changes to that measurement."
That's something that has been on my mind for server workloads for quite a bit. We test non-regression in builds, we have some coverage for non-regression in features, but we have near-zero coverage for performance regressions, so we get hit (very late) by bugs about how slow apache2 now serves pages in N compared to N-1, at a point in the cycle were analysis and drastic solutions are no longer applicable.
Any of this sound familiar? We can already achieve this with the Phoronix Test Suite and Phoromatic. Heck, we already do this with our Ubuntu Daily Tracker (as well as the Linux Kernel Daily Tracker) where the respective components are benchmarked automatically on a daily basis across multiple systems in an effort to spot regressions. We have already been doing this for a number of months and have been trying to collaborate with Canonical.
This Phoronix Test Suite driven stack is easily extensible and with Phoronix Test Suite 3.0 and other advancements arriving with Iveland, the realm of possibilities for automated testing will be driven even further.
To address some of our observations on long-term, automated testing, Matthew Tippett has written a document outlining some of our findings. This can be read as the Phoromatic Long Term Testing Study.