Announcement

Collapse
No announcement yet.

Why Software Defaults Are Important & Benchmarked

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by pingufunkybeat View Post
    I haven't checked the blobs recently, but the free drivers sync the rendering to the refresh interval of the display by default.

    Which means that any and every benchmark will always give a maximum of 60fps on a modern LCD display, using default settings. That is a sane default for most things -- and it removes tearing when playing video, for example -- but makes for an utterly pointless benchmark -- and makes the results less reliable, since most frame rates are cut to 20, 30 or 60 fps.

    Which is why most benchmarks disable vsync in order to test the drivers' and the cards' ability to render complex scenes at their limit. And this is already not the default configuration, and introduces tearing. I don't see a fundamental difference between disabling vsync (causes tearing, improves performance) and disabling SwapBuffersWait (causes tearing, improves performance).
    Ah, I hadn't considered that one, thanks. Yes, some things like that should probably be mentioned - I had a quick read of the latest ubuntu article (normally I don't read them - what ubuntu does has little to no interest for me) and there was a mention about disabling SwapBuffersWait to improve performance. Perhaps a few more details like that in articles?
    I still think testing defaults is ok though, as if nothing else it provides a baseline, but mention of what can skew results (and why it is/isn't done) is definitely useful.

    Comment


    • #17
      Originally posted by pingufunkybeat View Post
      I haven't checked the blobs recently, but the free drivers sync the rendering to the refresh interval of the display by default.

      Which means that any and every benchmark will always give a maximum of 60fps on a modern LCD display, using default settings. That is a sane default for most things -- and it removes tearing when playing video, for example -- but makes for an utterly pointless benchmark -- and makes the results less reliable, since most frame rates are cut to 20, 30 or 60 fps.

      Which is why most benchmarks disable vsync in order to test the drivers' and the cards' ability to render complex scenes at their limit. And this is already not the default configuration, and introduces tearing. I don't see a fundamental difference between disabling vsync (causes tearing, improves performance) and disabling SwapBuffersWait (causes tearing, improves performance).
      I don't know if Michael has a 120hz display or not, but he had a few benchmarks in this article where the free drivers got over 60fps. One even had 90fps.

      Comment


      • #18
        Originally posted by pvtcupcakes View Post
        I don't know if Michael has a 120hz display or not, but he had a few benchmarks in this article where the free drivers got over 60fps. One even had 90fps.
        If vsync is off, then this is already not a default configuration.

        As for default config as a baseline, I'm all for it. I have nothing against testing the default. But I personally think that Ubuntu has made some bad decisions -- like using Compiz, which cannot suspend compositing on the fly AFAIK -- which harms performance on free drivers disproportionately.

        KDE users do not have this issue.

        It's unfair if poor defaults (in this case, unrelated to the driver) make the driver look bad.

        Comment


        • #19
          What is wrong with benchmarks

          A lot is wrong with this kind of article, if not merely with the way the testing is done. I posted in some considerable depth about this in December.

          One of my main gripes is that even though subject X might gain more from performance tuning than subject Z, subject Z might be favored by the environment and performance/benchmarks articles are all too often used as a decision maker.

          Example:
          • Candidates: XFCE and Gnome
          • Test Environment: Old Computer
          • (Add performance test results in here)
          • Conclusion: XFCE is faster, thus better than Gnome.

          The PROBLEM is that people who use performance numbers to make decisions don't always realize (or choose to ignore) the features and any other benefits/advantages of the other subjects in the test.

          I explained my gripe with this kind of article in much more detail in the post. In particular I ask numerous questions about the test environment used in the article about the performance of various file systems, and point out specific flaws in the testing and reporting.

          Comment


          • #20
            you have a tough job.

            maybe you just need to be clear why you are doing the benchmarking. that would then influence the options. this site does lots of different sorts of benchmarking for different reasons.

            if you are benchmarking to see which distro/OS is fastest, then defaults might be best.
            if you want to compare 2 drivers for the same hardware, then you need to make sure the drivers are doing the same task.
            if you want to compare several gcc releases then a range of options is good (e.g. does -lto make a difference).
            comparing 2 different compiles, i'd say find the fastest option that still makes a functioning program.

            for graphics drivers more than FPS than the screen refresh is meaning less and useless for a user. it is fast enough. maybe you need to find more demanding settings for the benchmark.

            Comment


            • #21
              Originally posted by hartz View Post
              One of my main gripes is that even though subject X might gain more from performance tuning than subject Z, subject Z might be favored by the environment and performance/benchmarks articles are all too often used as a decision maker.

              Example:
              • Candidates: XFCE and Gnome
              • Test Environment: Old Computer
              • (Add performance test results in here)
              • Conclusion: XFCE is faster, thus better than Gnome.
              I don't know if I understand this, but this is wrong! For example;
              You test KDE4 and XFCE on a Pentium 2 400mHz, 4MB RAM and an ATI 9200.
              Conclusion: XFCE outperforms KDE4 because it requires less instructions to be executed, leaving room for a lot of other things.

              Now let's re-run the test on a Core i7, 4GB RAM and an ATI 2400.
              Conclusion: multiple cores so room enough left for other processes and due to KDE4's heavy cashing and whatnot it responds way faster than XFCE with a lot of things.

              Comment


              • #22
                Originally posted by ssam View Post
                for graphics drivers more than FPS than the screen refresh is meaning less and useless for a user.
                Not this again....

                A benchmark tests the graphics card's ability to render and push frames. Whether the user sees these frames is totally irrelevant from a performance point of view.

                Benchmarks must be uncapped, because capping them to refresh rate introduces an almost random factor that fudges the numbers in ways that are not related to anything.

                Comment


                • #23
                  Apologies if this has been mentioned already because I haven't read the whole thread yet, but here goes,

                  upstream maintainer -- whether it be the project itself developing the software in question or the distribution vendor
                  that's packaging and maintaining the given component -- to choose the most sane and reliable settings
                  Problem is that for example what if a package ships with -O2, what -O2 encompasses differs greatly from compiler to compiler. For instance, with GCC and CLANG -O2 sometimes beats -O3, while on Open64 I've never seen -O2 perform anywhere near -O3 (which is more as it should be I guess). So in a comparison of a package where it defaults to -O2 it would certainly not give Open64 a 'fair' shake as far as performance goes. However -O3 on the other hand, is supposed to generate the fastest code and should thus be the closest thing to a fair comparison as to what performance can be had from the compiler.

                  Also, some official packages ship in pure debug mode since they ASSUME that he who compiles it will choose their own preferred settings. Last I checked, p7zip for instance came with -O0, which is no optimizations. Testing such a package without changing the flags would be totally pointless.

                  So, if Phoronix is actually interested in presenting the best performance each compiler can generate, then they would need to make sure that all packages are configured to generate the fastest code. I would suggest '-O3 -march=native -ffast-math'. As it stands, the upstream maintainer can very well have optimized the flags for a particular compiler, or even a particular version of a compiler, which would not reflect the best settings for other compilers (or even later versions of the same compiler) tested. So yes, I find this methodology of using the 'out of the box' flags for packages flawed when it comes to comparing the performance of generated code.

                  Comment


                  • #24
                    Somehing I would like to see

                    It would be nice if instead of just saying for example "xfce is faster than gnome", say "xfce is faster than gnome because of more hevy gtk effect using and more ram hungry proccess"(for example),Also when talking about distros I hear a lot of "A is faster than B" but little of, "A is faster than B because A is using X version of package Y, which is more optimized or didin't have a regression." I mean if the objetive is to make linux better, only numbers are not enough, the explanation about the numbers is more important I think.
                    About default in tests, I would say "it depends" as some other said:
                    1-For driver testing, make the input and output the same, and if there are any important configuration (xorg tweaks) test them if you have time for it, if not then use defaults in that part (no xorg)
                    2-For distro testing, leaving everything default is perfect, but as I said it would be nicer if you say distro A is faster than B because of this configuration, so that the users/packagers of the slow distro can fix it.
                    3-For compiler testing: You should test the compilers with the recommended settings, not only the defaults, if llvm-gcc is designed to work with -O3 then use it with -O3, use the compilers as a real user or developer would use them.
                    4-For hardware reviews(this is the harder):You should start using he default options for everything but taking care that the input and output of everything is the same (in the case of games), also different distros and configurations should be used.
                    Imagine what happens if you compare 2 different cards, and by casualty you use a distro or package which has a BIG regression on 1 card. I think you should test hardware across at least 3 totally different distributions, like fedora, ubuntu, and slackware.

                    Comment


                    • #25
                      Originally posted by XorEaxEax View Post
                      ...

                      So, if Phoronix is actually interested in presenting the best performance each compiler can generate, then they would need to make sure that all packages are configured to generate the fastest code. I would suggest '-O3 -march=native -ffast-math'. As it stands, the upstream maintainer can very well have optimized the flags for a particular compiler, or even a particular version of a compiler, which would not reflect the best settings for other compilers (or even later versions of the same compiler) tested. So yes, I find this methodology of using the 'out of the box' flags for packages flawed when it comes to comparing the performance of generated code.
                      It depends on where you are coming from. Think of it from a few angles

                      1) A developer integrating a 3rd party component
                      2) An upstream development working on their code
                      3) A distribution vendor integrating a library

                      In 1) the developer may just want it to work, they don't want to invest the time to make sure that -O2 is faster than -O3 or -march=native or -ffast-math. In fact, they -O2 is worse than -O3 is not always true since they are just facades over a set of options that _in general_ perform better.

                      In 2) the developer is _publishing_ a product when they tag a release. It is entirely reasonable for anyone downstream to make the assumption that for the most common ecosystem (ie gcc) that the published component has been tuned for the developers expected common usage.

                      For 3) the distribution vendor will most likely look at the options to harmonize those with the general practices within the distribution.

                      My primary point is that there is no cut and dry solution right thing to do. I can easily argue from the angles above or even more and each time it will come out with a different resolution. Yes, tweaking and tuning is possible, but most people do not do that.

                      Comment


                      • #26
                        Originally posted by mtippett View Post
                        My primary point is that there is no cut and dry solution right thing to do. I can easily argue from the angles above or even more and each time it will come out with a different resolution. Yes, tweaking and tuning is possible, but most people do not do that.
                        True, but these compiler benchmarks are supposed to show which compilers generate the fastest code, but unless you configure the compilers to generate the fastest code then that is not what the results will show. They will show which compiler generated the fastest code at optimization level X, which may not be representative of what the compiler generates when it is told to generate the fastest code it actually can.

                        Obviously it's near impossible to test all combinations of flags and optimization levels to see which actually generates the fastest code, so some generalization needs to be done. And given that -O3 is supposed to generate the fastest code then it would be the most reasonable optimization level for benchmarks. I would also add -ffast-math since some compilers default to this and some don't and that it has a big effect on certain tests.

                        -O3 should generate the fastest code, all else be damned, -O2 is supposed to stike a balance between speed and code size and -Os favours size over speed. Given this, if you want to test the best performance of code generated by compilers then you would want to use -O3.

                        Also, relying on upstream maintainers to set the compiler options has other problems that makes certain benchmarks next to useless, take x264 which is configured to use hand-written assembly optimizations. The assembly code will not be optimized by the compiler in any way, and since x264 uses assembly optimizations for pretty much every place where performance matters, a compiler benchmark that doesn't disable asm optimizations is simply pointless.

                        I think that from now on you should clearly state that the compiler options are the default with which your upstream maintainer ships them so as to make it abundantly clear that they may not in fact represent the best code generation the compilers are able to provide even at the standard -O3 level, since it depends on what some upstream maintainer felt was the appropriate optimization level.

                        Comment


                        • #27
                          Originally posted by damipereira View Post
                          ...

                          3-For compiler testing: You should test the compilers with the recommended settings, not only the defaults, if llvm-gcc is designed to work with -O3 then use it with -O3, use the compilers as a real user or developer would use them.
                          What is the recommended settings? -O3 and -O2 are just more aggressive collections of optimizations. As per staalmannen's testing the optimum configuration options require near exhaustive analysis of each and every set of options. IIRC, -Os was faster than -O3 for some compilers. For example, -O3 is worse than -O2 is worse with gcc on 64bit for the bullet physics engine.

                          I don't see a clear way of dealing with these issues other than allowing the domain experts (compilers and original developers) provide the best configuration. There are interesting systems such as Acovea http://www.coyotegulch.com/products/acovea/ which look to take it further.


                          4-For hardware reviews(this is the harder):You should start using he default options for everything but taking care that the input and output of everything is the same (in the case of games), also different distros and configurations should be used.
                          Imagine what happens if you compare 2 different cards, and by casualty you use a distro or package which has a BIG regression on 1 card. I think you should test hardware across at least 3 totally different distributions, like fedora, ubuntu, and slackware.
                          What you are really alluding to is making sure that the variant portion of the testing is captured and controlled. That's really what we did with PTS and OpenBenchmarking.org. If for a particular angle, there was a killer regression for a piece of hardware, then that's the market reality. HW piece foo shouldn't be used on that distro.

                          We're talking about an ecosystem here, so there are many groups that need to all carry their part. Some will carry it better than others. Determining which load to carry or consider is a different issue altogether - to which there is no clear answer.

                          Comment


                          • #28
                            Originally posted by XorEaxEax View Post
                            ...

                            Obviously it's near impossible to test all combinations of flags and optimization levels to see which actually generates the fastest code, so some generalization needs to be done. And given that -O3 is supposed to generate the fastest code then it would be the most reasonable optimization level for benchmarks. I would also add -ffast-math since some compilers default to this and some don't and that it has a big effect on certain tests.
                            See acovea (http://www.coyotegulch.com/products/acovea/). -O3 invokes the most aggressive optimizations. These optimizations may improve or degrade performance depending on workload.

                            -O3 should generate the fastest code, all else be damned, -O2 is supposed to stike a balance between speed and code size and -Os favours size over speed. Given this, if you want to test the best performance of code generated by compilers then you would want to use -O3.
                            See http://openbenchmarking.org/result/1...IV-AAAA3619586. You are testing the *most aggresive* optimizations. They may not be faster. It's clear from the resultset above that it isn't always true.

                            Also, relying on upstream maintainers to set the compiler options has other problems that makes certain benchmarks next to useless, take x264 which is configured to use hand-written assembly optimizations. The assembly code will not be optimized by the compiler in any way, and since x264 uses assembly optimizations for pretty much every place where performance matters, a compiler benchmark that doesn't disable asm optimizations is simply pointless.
                            Yes, that is true. But at the very least, the developers behind John the Ripper contacted Michael and indicated that the hand tuned assembly is only as due to sub-optimal code generation in existing compilers. IIRC, the hope was that LLVM would assist in removing that need.

                            I think that from now on you should clearly state that the compiler options are the default with which your upstream maintainer ships them so as to make it abundantly clear that they may not in fact represent the best code generation the compilers are able to provide even at the standard -O3 level, since it depends on what some upstream maintainer felt was the appropriate optimization level.
                            I still disagree with your assertion. All the benchmarks have clear source availability, you can investigate the options at your leisure. I do agree that when you _vary_ from the shipped default, it should be documented. In the compiler benchmarks and fs benchmarks done recently it _has_ been documented.

                            Comment


                            • #29
                              Originally posted by pingufunkybeat View Post
                              Not this again....
                              A benchmark tests the graphics card's ability to render and push frames. Whether the user sees these frames is totally irrelevant from a performance point of view.

                              Benchmarks must be uncapped, because capping them to refresh rate introduces an almost random factor that fudges the numbers in ways that are not related to anything.
                              Incorrect. Benchmarks are a measure of a system. If the benchmarks are capped, there are other measures to use. Power consumption, sound, CPU or GPU utilization.

                              FPS is _not_ the only measure. For my HTPC, the CPU/GPU utilization for video decode is my priority concern, plausible GL support is needed for compositing, but decode is what I need.

                              Comment


                              • #30
                                Originally posted by hartz View Post
                                A lot is wrong with this kind of article, if not merely with the way the testing is done. I posted in some considerable depth about this in December.
                                Feel free to reproduce the test results (that is what PTS is for), and then tune as per the guides. I agree that you went into the considerable depth about what should be done, but didn't do it. Can you expand on the rationale for not following that guide yourself.


                                ...

                                The PROBLEM is that people who use performance numbers to make decisions don't always realize (or choose to ignore) the features and any other benefits/advantages of the other subjects in the test.
                                Invert the consideration. People who are looking at a particular featureset want to make a tradeoff decision about the cost of picking up that feature. If you need snapshotting a la BTRFS, you will still want to know how it performs otherwise to drive the decision for fast/big/stripped/SSD.

                                Comment

                                Working...
                                X