Announcement

Collapse
No announcement yet.

Linux 2.6.24 Through Linux 2.6.33 Benchmarks

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by atmartens View Post
    Agreed. If it takes too much time, perhaps someone else out there could chip in - you make the graphs, raise some questions, and someone else, maybe who works on these software projects, can explain.
    Benchmarking loss (which is really what we are talking about here) stomps on egos pretty hard. Most times that I have reached out pro-actively, the response from the developers is usually quite painful.

    Realistically, reaching out to developers individually does usually get better traction. That involves going around the project's due process of posting to a list. That raises the bar to getting to the bottom of the problem even further.

    I guess that discussing the likely impacted area is possibly the next increment in analysis, but even then you will still have the people turning around and saying "you have no idea what you are talking about, stop spreading FUD".

    Not an easy problem to fix, but *very* costly to make valuable.

    Comment


    • #17
      Originally posted by maccam94 View Post
      I appreciate the great job Phoronix does on reporting news in the Linux community, but I find that the benchmarking articles could be much better. I don't need someone to show me a graph and then list the statistics in the text below the graph. The graph shows the statistics already. These articles fail to draw any real conclusions about the results. Rather than saying "these numbers went down, these numbers went up, and these numbers stayed the same," Phoronix should look into *why* changes occur. I'm not saying that you have to research every regression you find, but at least put a little effort into finding a couple of real interesting development notes to provide some solid information along with the figures.
      That's the problem with regressions. If the developers knew it was going to cause a performance delta, then you it shouldn't be a surprise (performance, as expected went down due to this). The issue is that most of the time a performance regression (good or bad) is a confluence of other issues which don't always make sense to even the developer working on the component themselves.

      In a testing and benchmarking poor environment (most Open Source projects), getting a hypothesis raised and validated is almost impossible. What makes it even worse is that a lot of people have huge personal investment in a project, and telling them that they have broken it or are slow cuts straight through the ego.

      Comment


      • #18
        Originally posted by maccam94 View Post
        I don't need someone to show me a graph and then list the statistics in the text below the graph. The graph shows the statistics already.
        This is standard practice.

        All scientific journals require this - tell the readers in words what the graphs and tables say anyway.

        The benefit is also that the results become searchable through search engines.


        .

        Comment


        • #19
          Originally posted by mtippett View Post
          Benchmarking loss (which is really what we are talking about here) stomps on egos pretty hard. Most times that I have reached out pro-actively, the response from the developers is usually quite painful.

          Realistically, reaching out to developers individually does usually get better traction. That involves going around the project's due process of posting to a list. That raises the bar to getting to the bottom of the problem even further.

          I guess that discussing the likely impacted area is possibly the next increment in analysis, but even then you will still have the people turning around and saying "you have no idea what you are talking about, stop spreading FUD".

          Not an easy problem to fix, but *very* costly to make valuable.
          A couple of options you could take are:
          Correlate tests to subsystems they stress, then do a quick search through their changelogs/bug reports.
          Do not focus on the fact the numbers changed. Ask why the results of a benchmark might have changed with default settings. Developers might be interested to explain how they found a clever new way to boost performance, or to explain that performance has decreased to increase safety.
          Invite developers to comment on the results before posting them.

          Yes, these options would require some extra work, but I think your readers (and the development community) would really appreciate it.

          Comment


          • #20
            Originally posted by maccam94 View Post
            A couple of options you could take are:
            Correlate tests to subsystems they stress, then do a quick search through their changelogs/bug reports.
            Do not focus on the fact the numbers changed. Ask why the results of a benchmark might have changed with default settings. Developers might be interested to explain how they found a clever new way to boost performance, or to explain that performance has decreased to increase safety.
            Invite developers to comment on the results before posting them.

            Yes, these options would require some extra work, but I think your readers (and the development community) would really appreciate it.
            I'll let Michael make comments on the reporting.

            My view is that the impact of different subsystems is heavily dependent on the interactions between different parts of the system. In a lot of cases, the changelogs may indicate, but it would usually take domain expertise in that subsystem to be able to correlate the two.

            I agree that more information would be useful, but I am not sure if it's going to add too much for a detailed analysis on each regression. I'd expect the collective intelligence of the forums would probably have more luck crowd-sourcing what is the trigger than Michael or myself digging deep.

            The numbers are there, the tests are there and the kernels are there. If anyone is willing to dig deep to understand the difference, I would be very interested to know how far they get. There is no barrier for anyone to reproduce the results and Fight the Good Fight to understand what is going on.

            Any takers?

            Matthew

            Comment


            • #21
              if only Linus used Phorx Test

              I read posts from kernel devs. that state they want test results to see improvements, failures, etc. Well here is real data.

              Be nice when you dont' have to recompile in order to change most the kernel parameters. #cpus, hi-mem, timer freq., dynamic ticks, cpuarch.

              Check out amd64. I regress

              Comment


              • #22
                Originally posted by mtippett View Post
                The numbers are there, the tests are there and the kernels are there. If anyone is willing to dig deep to understand the difference, I would be very interested to know how far they get.
                Do what you are doing, publish the numbers.

                One thing though. If it is only one app which sees a regression it might be that that particular app is doing something wrong. If you have a number of apps which regress on the same kernel, then it may well be a kernel regression.

                As benchmark time is limited, I would use as many PTS benchmarks as possible, but don't run each for a long time. Instead of 5 minute runs for five applications, one could use 30 second runs for 30 applications; both would be 900 seconds of benchmark.

                Is this feasible?

                Comment


                • #23
                  Originally posted by mtippett View Post
                  Skip to the next response to that thread..

                  I turned on apache, and played with ab a bit, and yup, ab is a hog, so
                  any fairness hurts it a badly. Ergo, running ab on the same box as
                  apache suffers with CFS when NEW_FAIR_SLEEPERS are turned on. Issuing
                  ab bandwidth to match it's 1:N pig nature brings throughput right back.


                  http://lkml.indiana.edu/hypermail/li...9.1/02861.html

                  Remember that you can't test anything, and testing in the obvious path will usually result in flat lines - since they represent the 95% path.

                  As indicated above, what has been identified is that in some scenarios CFS completely tanks. The ab is just a tool to make this visible.
                  In this scenario with fair sleepers enabled, yes. However, this scenario is one out of the reality until someone runs apache on a single machine which is not recommended.* I think it's something natural scheduler is tuned to perform well in real situations. So, what's the point of this benchmark?

                  As per usual, if there is any benchmark which you believes provides a suitable equivalent scenario but is more "correct", please tell us.
                  Maybe replace "correct" by more meaningful. The problem is I'm not sure what could be equivalent scenario to this benchmark and if there's no such scenario this benchmark means: we've got different results in Apache benchmark running on the same machine which isn't recommended. Btw. about what scenario you were talking about? Such *?

                  Comment


                  • #24
                    Originally posted by mtippett View Post
                    A Regression is a unexpected change in behavior. If the kernel developers make a change in one area, and they are not expecting the behavior change in other areas those areas have regressed.
                    If developer decided to change default file system mode to some other it's not a regression, because it is expected change in the file system behavior (it is also known it will affect some benchmarks). Michael isn't dev is he?

                    I'd like you to expand on your "not done properly" if you could.
                    Recommended way is to run ab on a different machine, so that's why I consider it wasn't done properly or this benchmark is strange in my opinion if you like.

                    Comment


                    • #25
                      Originally posted by sabriah View Post
                      This is standard practice.

                      All scientific journals require this - tell the readers in words what the graphs and tables say anyway.

                      The benefit is also that the results become searchable through search engines.


                      .
                      No they don't.

                      Scientific journals require authors to describe in word what figures and tables show AND to draw from those numbers a valuable conclusion (something that isn't done here, obviously). If you don't you get your paper rejected.

                      Comment


                      • #26
                        Originally posted by Xheyther View Post
                        No they don't.

                        Scientific journals require authors to describe in word what figures and tables show AND to draw from those numbers a valuable conclusion (something that isn't done here, obviously). If you don't you get your paper rejected.
                        I agree with what you say about the AND.

                        BUT, and the but is big, here we talk about Phoronix role as a whistleblower. They didn't write the code, and, debugging someone else's code is a nightmare, even for Freddy on Elm Street.

                        I never expect them to identify the pivotal change in the code. Heck, even deciding which of several possible layers (eg App or Kernel) can be worse than difficult.

                        However, I do think the ones to draw the valuable conclusions you mention from the numbers presented at Phoronix should be the developers. Who else can interpret them with a comparatively minimal effort, and solve them?

                        Showing the world system based regressions is one of several important ways to catch bugs and I applaud Phoronix for doing this task.

                        I also realize that their use of default settings is a pragmatic choice, not suitable to all practices. But, tweaked settings rapidly enter the inescapable permutation hell; in how many ways can you fine tune web servers and databases?! Which is the least silly setting? Well, the default, because that is the one everyone has access to.


                        .

                        Comment


                        • #27
                          Originally posted by kraftman View Post
                          If developer decided to change default file system mode to some other it's not a regression, because it is expected change in the file system behavior (it is also known it will affect some benchmarks). Michael isn't dev is he?
                          It's a game of whack-a-mole. You make a change with an expected mole to be whacked. Once the change is made, three unexpected moles pop up.

                          Industry metrics formal a formal (testing, QA, etc) environment with

                          Recommended way is to run ab on a different machine, so that's why I consider it wasn't done properly or this benchmark is strange in my opinion if you like.
                          For determining the expected performance of apache, yes I agree that you should have ab and the server on a different machine. But remember that we are not testing the apache installation. The component under test is the kernel in this instance, or at the very least different hardware.

                          What we are showing is that there is a synthetic load that is strongly affected by the kernel changes. If we called it "pig-test" the results would still be the same.

                          Comment


                          • #28
                            Originally posted by mtippett View Post
                            But remember that we are not testing the apache installation. The component under test is the kernel in this instance, or at the very least different hardware.
                            Right. This makes some things clear

                            Comment


                            • #29
                              Originally posted by mtippett View Post
                              I'll let Michael make comments on the reporting.

                              My view is that the impact of different subsystems is heavily dependent on the interactions between different parts of the system. In a lot of cases, the changelogs may indicate, but it would usually take domain expertise in that subsystem to be able to correlate the two.
                              Here's a good example

                              http://airlied.livejournal.com/69074.html

                              Dave, a veritable graphics guru had to ponder and run further benchmarks. And even then he still has concerns about what and where the real tradeoffs will be. Understanding the reason for a regression is absolutely a specialty. Making sure the tests allow for easy analysis is probably the primary area that we can add value.

                              All in all, what a good regression benchmark needs to have is sensitivity to different areas of the system under test. A targeted benchmark for making a purchase decision is a whole different ball game. I am sure Michael is open to targeting some runs to particular areas, if they are of general interest.

                              Unfortunately, a choosing your kernel and filesystem for peak web server performance isn't really what the general populace is interested in.

                              Comment


                              • #30
                                Originally posted by mtippett View Post
                                But remember that we are not testing the apache installation.
                                [...]
                                What we are showing is that there is a synthetic load that is strongly affected by the kernel changes. If we called it "pig-test" the results would still be the same.
                                I do not agree. What is interesting to see is how the kernel behaves under *REAL* loads, everything else is useless because developing a kernel is a continuous trade-off between different load scenarios. Who cares if the kernel performs badly in an unreal load scenario?
                                ## VGA ##
                                AMD: X1950XTX, HD3870, HD5870
                                Intel: GMA45, HD3000 (Core i5 2500K)

                                Comment

                                Working...
                                X