You're right about what should be counted as real stuff. Did you noticed: "auto-defrag" behaved bad, because auto-defrag is the way it is used today for a desktop system. Why it was not a fragmenter test done before to simulate disk usage!
But in the last time did you noticed a relevant benchmark so far running here? Made by a good methodology? Not being either spectacular journalism, like showing off a new hardware feature (like OpenGL 3.0 in a benchamark based on a driver support) or a software feature (adding LLVM to Mono could speedup in some benchmarks), either too deep in what results means.