What's happening with the VP8 libvpx benchmark on AMD?
There's a bunch of compute-bound-in-userspace ones (compiling, C-Ray, ffmpeg, x264) where the kernel doesn't matter at all, and that's unsurprising. But starting off that group of tests, there's the VP8 encoding one, where on AMD hardware .. WTF? A factor of two?
Almost exactly two, as though the application mis-guessed how many threads or processes to spawn. But only that one app on that one kernel with that particular hardware. Bug?
So, I guess what we're seeing here with the computation and bandwidth tests is what Con talks about- BFS isn't geared toward them. But, the only benchmarks BFS clearly wins are the ones Con wrote himself to test latency. But what is a latency test without load? Chances are, if you're running an audio station you're not just piping data from the audio device to /dev/null. You'll also be loading the CPU with compression and/or effects processing, right? Or at least dumping the data to a block device, which comes back to the bandwidth tests. This problem is severely compounded when dealing with video, because of the almost requirement to compress and decompress video data when editing and/or recording.
I'm not sure I trust a benchmark that the developer writes, either. I've had some really bad experiences with that in the past when developers want to push new code to a production environment. A benchmark that shows they introduce no new load is invented, and when the code makes it into the real world, their pristine lab conditions are thrown out and their code falls on its face. I'm not trying to say Con doesn't have a good idea here! He clearly understands a particular facet of the scheduling problem, and I think he understands it very well. But BFS is only trying to solve for a small portion of the general problem of task scheduling, and it favors simplicity over complex scheduling algorithms.
There'a almost certainly a happy medium between BFS and CFS, but since the fallout between the kernel maintainers and Con, I don't think we're going to see everyone come together to try to work on it. And that to me is the real story.
The standard deviations on these tests (both for BFS and the stock kernel) tell me that the results are not reliable. This is even less scientific than the usual Phoronix benchmark, because drawing any conclusions from a graph with such a huge standard deviation is difficult, if not impossible. Basically you're pointing at the entire content of Europe and saying "the criminal is somewhere in there", when proper detectives will at least figure out which town he's in.
If you ran the benchmarks at these standard deviations for a couple dozen iterations, and plotted it on a scatter plot, you'd see a collection of dots that look something like "white noise" (random/chaotic/atmospheric data -- like what you used to see on an Analog TV set when it was not getting any reception). You can't draw any conclusions from that.
I don't see this as a problem with Phoronix's methodology or even the PTS, but rather a factor of modern software engineering. To expect real engineering results from software is simply not feasible without knowing every single last detail about the system being tested. The kernel is simply too complex to do this, and that doesn't even take into account the hardware itself.
When it comes to tests on Phoronix that have a large deviation, I simply look at that graph as best case, average, worst case. If you were to graph a large enough set of data, I'm quite sure you could plot an average through the scatter graph- The workload isn't completely random and there's almost certainly a maximum and minimum possible value with everything else falling in between.