Announcement

Collapse
No announcement yet.

Statistical Significance In Benchmark Results

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Statistical Significance In Benchmark Results

    Phoronix: Statistical Significance In Benchmark Results

    For those of you following the developments of Phoronix Test Suite 2.2 (codenamed "Bardu"), some new benchmarking features were pushed into its Git tree this week. The latest Phoronix Test Suite 2.2 code now has better FreeBSD 8.0 compatibility and support for network proxies with network communication, but larger than that is new support for ensuring test results are statistically significant. When any test profile is set to run multiple times, the Phoronix Test Suite is now capable of computing the standard deviation between each of the test runs...

    http://www.phoronix.com/vr.php?view=NzU2MA

  • #2
    Excellent addition. Will a feature also be added to put error bars on the graphs, so the final standard deviation is visible on the charts?

    Comment


    • #3
      Originally posted by chaos386 View Post
      Excellent addition. Will a feature also be added to put error bars on the graphs, so the final standard deviation is visible on the charts?
      You can view the spread right now (and for the past months) using "phoronix-test-suite analyze-all-runs <result>". Though building into the Adobe SWF/Flash renderer I may end up writing support so that the different information is built into the graph itself and can be displayed on mouse-over or when clicking a button or something else, such as for when results are displayed on Phoronix.com.
      Michael Larabel
      http://www.michaellarabel.com/

      Comment


      • #4
        but larger than that is new support for ensuring test results are statistically significant. When any test profile is set to run multiple times, the Phoronix Test Suite is now capable of computing the standard deviation between each of the test runs...
        I just registered for these forums so I could say: "Thank you!". This can add some real meaning to the Phoronix test results, rather than only giving a feel of what might be going on.

        One thing to be careful of when increasing the number of runs is the difference between statistical significance and practical significance. Given enough runs, every comparison will become statistically significant - but a statistically significant difference of 0.5% is of no practical significance (there's usually not much point in scoring a "win" for an application or device by such a small amount, even if it is a real difference).

        Anyway, I'll say thanks again. Winner of best feature award for sure.

        Comment


        • #5
          Hi Michael.

          Have you considered adding some kind of ANOVA function to the PTS? Having a confidence interval (95% or somthing) on each graph would be very useful I think.

          For example in the BFS article, while you imply that BFS is faster for PHP compilation, I suspect that the difference is statistically insignificant, and BFS cannot really be said to be faster with any reasonable confidence.

          For an example of the sort of analysis I mean, see here.

          Comment


          • #6
            Thanks. great addition.

            Comment


            • #7
              Thanks for the info!

              As you probably know I'm new to this forum, so please excuse if the following has been covered or is out of context.

              What I want to know is what safe guards are in place to make sure that the latest Turbo Boost based processors are loaded to the point that thermal throttling is discovered. It is my position that tricking out a system for maximum performance is al well and good but if those benchmark numbers don't translate into valid figures for normal implementations of a chip then you haven't done your readers much of a favor.

              So lets say your bench mark runs a series of video encoding tasks, which ought to load the processor across all cores. Now initially for a small file this may not impact the chip to the point that thermal throttling is noticeable. But what happens if the we have something less than a high performance cooler and sub optimal thermal conditions, something that reflects most home based systems?

              I ask this because of the Intel based rebuttal to you earlier Lynnfield tests. I'm certain that the BIOS issue was real, after all this is brand new product, but I have to wonder about the differences in the results which really don't make sense. It makes me wonder if the processors might have been sitting under a huge air conditioner as this would likely keep the cores running at higher clock rates.

              I bring this up because we really haven't had processor quite like this on the market in the past. Thus it is hard to offer up a clear picture of what one can expect out of Lynnfield given non optimal conditions. The sad thing is we are talking big difference in performance based on how well the chip can cool itself over time. So it would make sense to test a given processor with a variety of heat removal capabilities to see just how much of a regression we will see with those different heat sinks. A simple question might be how long does Intels stock cooler allow a Lynnfield to benefit from Turbo Boost over the course of a long video encoding, in a room free of air conditioning.

              As you can see I'm puzzled by what sort of performance a person making an average investment in Lynnfield would get. Even with Intels tests, which in some cases I find bogus, it looks like an AMD chip works just as well.


              Dave

              Comment

              Working...
              X