My big beef here is that you use a *fast* system for your tests. Now how about running each OS on a netbook, where the slower CPU, more limited memory, slower disk would all hightlight the differences better.
I've also wondered about how many times each test is run, and whether you collect meaningful statistics. I'd love to see the mean, median and std deviation for each test. I think in alot of cases, we'd see that the systems really are just tied, or have just a small improvement.
It would also be nice to see how reproduceable each sub-test really is, which would tell us alot about how useful each test really is.
Another tweak to show would be to run each test multiple times, but to drop and not drop the vm caches between tests, to see how well the VM and it's caching helps.
I do like these benchmarks, they're certainly improving over time, but they could be better. More data please!