Announcement

Collapse
No announcement yet.

Why Software Defaults Are Important & Benchmarked

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • phoronix
    started a topic Why Software Defaults Are Important & Benchmarked

    Why Software Defaults Are Important & Benchmarked

    Phoronix: Why Software Defaults Are Important & Benchmarked

    Almost every time benchmarks are published on Phoronix, there's always at least a handful of people - or more - that will right away say the benchmarks are flawed, meaningless, or just plain wrong. Why? Because the software configuration is tested with its default (stock) settings. These users then go on to say that the defaults are not optimized for performance and that "everyone else knows better" to use a particular set of options, etc. But it's my firm belief that it's up to the upstream maintainer -- whether it be the project itself developing the software in question or the distribution vendor that's packaging and maintaining the given component -- to choose the most sane and reliable settings, and that's what most people use. In addition, with open-source software, there's endless possibilities for how a given piece of software can be tuned and tweaked. Here's some numbers confirming these beliefs of testing software at its defaults...

    http://www.phoronix.com/vr.php?view=OTE4OQ

  • ciplogic
    replied
    Originally posted by mtippett View Post
    The assertion that removal of the BKL is going to have no meaningful impact for most people is a fair statement.
    I will put beforehand in "balanced" team. I do think that defaults are what matter in users' machine.
    On the other hand, in many benchmarks tweaking or not tweaking get wrong conclusions.
    I will put some hypotethical cases but they are a mirror of many articles.
    NVidia card 8800GTX will give 400 FPS in OpenArena and 60 FPS in opensource driver. This may make as an benchmark "conclusion" that NVidia driver works 6.5 times faster classic one. (when is clear that VSync option is set on the second case). Probably disabling it in OSS driver will give 240 FPS or whatever and will get a proper measure in raw performance benchmark.
    Statement 2: Compressed filesystem works very slow compared with normal disks on burst writes, but works decently in threaded writes and in reads, and much faster in reads a zero-filled file. This should be *always* the case on a fast CPU. Doesn't matter the FS, it is because compression will add some extra headache on CPU to do compression or decompression but will reduce (depends on file content) the disk usage. Also those benchmarks are afecting the compilation time (because of CPU usage in compress/decompress). If it was explained in a paragraph how things works, the benchmarks will not be meaningless, but will give the measure of impacted CPU usage.
    Statement 3: How GCC/LLVM AVX support mean something? Depends a lot of application "nature". Those instructions are for parallel programming, somewhat like CUDA. A compiler may be optimized in a range of 1%, an application as Gimp in (some) filters may be in range of 20% and scientific high parallel computations may be close to 100% speedup. When you read GCC/related articles, they will mostly appear like compilers are just hit of regressions and at least on two benchmarks they get annomalies. Also, compilers are today fairly mature, so a 100% speedup (without hardware support, like multi-core, AVX, but just of applications are rewritten to support them) are fairly untouchable.
    So I think that the people when are frustrated, they complain for the statements that appear not to be informed, and a statement before that shows what may be the expectations based on how things work, will be the reason why people (I hope and thing) they will complain less.

    Leave a comment:


  • mtippett
    replied
    Originally posted by V!NCENT View Post
    Well Michael's approach isn't bad per se. The problem is that he tests different things tht he mentions.

    For example "How does the BKL removal affect performance?" should be "Does the BKL removal influences everyday operations?" and then jump toh the concluion "Not realy" and then maybe a diacussion for a little more depth, being "While doesn't have any real effect on performance, it does make this do that better or faster or worse.".
    Remember that the lay person is really getting the hype from the community and trying to parse that. The BKL has long been pushed as a problem for scalability and performance - with minimal details below that. Michael did put it to the test, and more or less as expected showed that BKL is a performance non-event. As David Airlie posted in the forum for that article, the BKL is a non-event due to work that has been done to get around it over the last few years.

    The assertion that removal of the BKL is going to have no meaningful impact for most people is a fair statement.

    Determining where the BKL has an impact and showing the performance delta there is a completely different article, but I'd expect there would be forum posts of "that's all well and good, but it's irrelevant to me". Again, it all depends on you perspective and the implicit questions you are looking to be answered in articles.

    PTS has lowered the bar to doing repeatable benchmarking, OpenBenchmarking has created a collaborative forum around repeatable results. Anyone can look to show how it should be done, go forth and benchmark! (This is a general call, not something pointed at you V!ncent).

    Leave a comment:


  • pingufunkybeat
    replied
    Originally posted by mtippett View Post
    Incorrect. Benchmarks are a measure of a system. If the benchmarks are capped, there are other measures to use. Power consumption, sound, CPU or GPU utilization.

    FPS is _not_ the only measure.
    From THAT point of view, I absolutely agree!

    But if you're measuring pure performance, then capping the output to 60Hz is really stupid.

    You're completely right that there are other important factors to benchmark, like power consumption, durability, noise, etc.

    Leave a comment:


  • V!NCENT
    replied
    Well Michael's approach isn't bad per se. The problem is that he tests different things tht he mentions.

    For example "How does the BKL removal affect performance?" should be "Does the BKL removal influences everyday operations?" and then jump toh the concluion "Not realy" and then maybe a diacussion for a little more depth, being "While doesn't have any real effect on performance, it does make this do that better or faster or worse.".

    Leave a comment:


  • mtippett
    replied
    Originally posted by hartz View Post
    A lot is wrong with this kind of article, if not merely with the way the testing is done. I posted in some considerable depth about this in December.
    Feel free to reproduce the test results (that is what PTS is for), and then tune as per the guides. I agree that you went into the considerable depth about what should be done, but didn't do it. Can you expand on the rationale for not following that guide yourself.


    ...

    The PROBLEM is that people who use performance numbers to make decisions don't always realize (or choose to ignore) the features and any other benefits/advantages of the other subjects in the test.
    Invert the consideration. People who are looking at a particular featureset want to make a tradeoff decision about the cost of picking up that feature. If you need snapshotting a la BTRFS, you will still want to know how it performs otherwise to drive the decision for fast/big/stripped/SSD.

    Leave a comment:


  • mtippett
    replied
    Originally posted by pingufunkybeat View Post
    Not this again....
    A benchmark tests the graphics card's ability to render and push frames. Whether the user sees these frames is totally irrelevant from a performance point of view.

    Benchmarks must be uncapped, because capping them to refresh rate introduces an almost random factor that fudges the numbers in ways that are not related to anything.
    Incorrect. Benchmarks are a measure of a system. If the benchmarks are capped, there are other measures to use. Power consumption, sound, CPU or GPU utilization.

    FPS is _not_ the only measure. For my HTPC, the CPU/GPU utilization for video decode is my priority concern, plausible GL support is needed for compositing, but decode is what I need.

    Leave a comment:


  • mtippett
    replied
    Originally posted by XorEaxEax View Post
    ...

    Obviously it's near impossible to test all combinations of flags and optimization levels to see which actually generates the fastest code, so some generalization needs to be done. And given that -O3 is supposed to generate the fastest code then it would be the most reasonable optimization level for benchmarks. I would also add -ffast-math since some compilers default to this and some don't and that it has a big effect on certain tests.
    See acovea (http://www.coyotegulch.com/products/acovea/). -O3 invokes the most aggressive optimizations. These optimizations may improve or degrade performance depending on workload.

    -O3 should generate the fastest code, all else be damned, -O2 is supposed to stike a balance between speed and code size and -Os favours size over speed. Given this, if you want to test the best performance of code generated by compilers then you would want to use -O3.
    See http://openbenchmarking.org/result/1...IV-AAAA3619586. You are testing the *most aggresive* optimizations. They may not be faster. It's clear from the resultset above that it isn't always true.

    Also, relying on upstream maintainers to set the compiler options has other problems that makes certain benchmarks next to useless, take x264 which is configured to use hand-written assembly optimizations. The assembly code will not be optimized by the compiler in any way, and since x264 uses assembly optimizations for pretty much every place where performance matters, a compiler benchmark that doesn't disable asm optimizations is simply pointless.
    Yes, that is true. But at the very least, the developers behind John the Ripper contacted Michael and indicated that the hand tuned assembly is only as due to sub-optimal code generation in existing compilers. IIRC, the hope was that LLVM would assist in removing that need.

    I think that from now on you should clearly state that the compiler options are the default with which your upstream maintainer ships them so as to make it abundantly clear that they may not in fact represent the best code generation the compilers are able to provide even at the standard -O3 level, since it depends on what some upstream maintainer felt was the appropriate optimization level.
    I still disagree with your assertion. All the benchmarks have clear source availability, you can investigate the options at your leisure. I do agree that when you _vary_ from the shipped default, it should be documented. In the compiler benchmarks and fs benchmarks done recently it _has_ been documented.

    Leave a comment:


  • mtippett
    replied
    Originally posted by damipereira View Post
    ...

    3-For compiler testing: You should test the compilers with the recommended settings, not only the defaults, if llvm-gcc is designed to work with -O3 then use it with -O3, use the compilers as a real user or developer would use them.
    What is the recommended settings? -O3 and -O2 are just more aggressive collections of optimizations. As per staalmannen's testing the optimum configuration options require near exhaustive analysis of each and every set of options. IIRC, -Os was faster than -O3 for some compilers. For example, -O3 is worse than -O2 is worse with gcc on 64bit for the bullet physics engine.

    I don't see a clear way of dealing with these issues other than allowing the domain experts (compilers and original developers) provide the best configuration. There are interesting systems such as Acovea http://www.coyotegulch.com/products/acovea/ which look to take it further.


    4-For hardware reviews(this is the harder):You should start using he default options for everything but taking care that the input and output of everything is the same (in the case of games), also different distros and configurations should be used.
    Imagine what happens if you compare 2 different cards, and by casualty you use a distro or package which has a BIG regression on 1 card. I think you should test hardware across at least 3 totally different distributions, like fedora, ubuntu, and slackware.
    What you are really alluding to is making sure that the variant portion of the testing is captured and controlled. That's really what we did with PTS and OpenBenchmarking.org. If for a particular angle, there was a killer regression for a piece of hardware, then that's the market reality. HW piece foo shouldn't be used on that distro.

    We're talking about an ecosystem here, so there are many groups that need to all carry their part. Some will carry it better than others. Determining which load to carry or consider is a different issue altogether - to which there is no clear answer.

    Leave a comment:


  • XorEaxEax
    replied
    Originally posted by mtippett View Post
    My primary point is that there is no cut and dry solution right thing to do. I can easily argue from the angles above or even more and each time it will come out with a different resolution. Yes, tweaking and tuning is possible, but most people do not do that.
    True, but these compiler benchmarks are supposed to show which compilers generate the fastest code, but unless you configure the compilers to generate the fastest code then that is not what the results will show. They will show which compiler generated the fastest code at optimization level X, which may not be representative of what the compiler generates when it is told to generate the fastest code it actually can.

    Obviously it's near impossible to test all combinations of flags and optimization levels to see which actually generates the fastest code, so some generalization needs to be done. And given that -O3 is supposed to generate the fastest code then it would be the most reasonable optimization level for benchmarks. I would also add -ffast-math since some compilers default to this and some don't and that it has a big effect on certain tests.

    -O3 should generate the fastest code, all else be damned, -O2 is supposed to stike a balance between speed and code size and -Os favours size over speed. Given this, if you want to test the best performance of code generated by compilers then you would want to use -O3.

    Also, relying on upstream maintainers to set the compiler options has other problems that makes certain benchmarks next to useless, take x264 which is configured to use hand-written assembly optimizations. The assembly code will not be optimized by the compiler in any way, and since x264 uses assembly optimizations for pretty much every place where performance matters, a compiler benchmark that doesn't disable asm optimizations is simply pointless.

    I think that from now on you should clearly state that the compiler options are the default with which your upstream maintainer ships them so as to make it abundantly clear that they may not in fact represent the best code generation the compilers are able to provide even at the standard -O3 level, since it depends on what some upstream maintainer felt was the appropriate optimization level.

    Leave a comment:

Working...
X