Page 2 of 3 FirstFirst 123 LastLast
Results 11 to 20 of 25

Thread: Benchmarks Of GCC 4.2 Through GCC 4.7 Compilers

  1. #11
    Join Date
    Oct 2009
    Posts
    845

    Default

    Quote Originally Posted by birdie View Post
    I'm totally confused - is this a test of GCC compilers or GCC + LLVM backend?
    The table on the first page lists LLVM backend for all GCC versions so I've no idea what to think.
    Should be a typo since the tests aims to be about different GCC versions, also gcc-llvm has been deprecated in favour of Dragonegg which does the same thing against newer GCC versions (4.5 forwards iirc) through the plugin framework.

    Quote Originally Posted by sabriah View Post
    No, it is very timely. Now the developers have a chance of addressing these issues.
    Well, the compiler developers have test suites of their own which are much more extensive than that of Phoronix's. As for them even considering these tests at all I'd say that ship has sailed long time ago. Both the gcc devs and Chris Lattner (LLVM project leader) has stated that Phoronix's tests are totally worthless due to the poor test conditions through which the are done. Whatever compiler options Micheal states are being used, take it with a large grain of salt as it has been shown over and over again that he doesn't seem to know how to configure these packages correctly before testing (himeno pressure tests using -O0, Povray defaulting to tuning for Amd K8 no matter what processor is being used, etc etc).

    And we still see it, what use is there to do a test of ffmpeg/x264 with assembly optimizations enabled? All the performance critical code is out of reach for the compilers, they are pretty much left to optimize the commandline option handling, yay!

  2. #12
    Join Date
    Oct 2009
    Posts
    845

    Default

    Quote Originally Posted by Ansla View Post
    -march=native was a great choice, but maybe -O2 would have been better then -O3, the gcc documentation recomends using -O2 as -O3 does some risky optimisations.
    I would prefer both -O2 and -O3, -O3 is aimed at producing the fastest code but due to the problems of correctly determining the best optimization strategy during compile time some of the aggressive optimizations in -O3 backfires and in those cases -O2 ends up faster. Still, if only one option is to be used I do prefer -O3 as it is supposed to generate the fastest code.

  3. #13
    Join Date
    Jan 2007
    Posts
    459

    Default

    Quote Originally Posted by XorEaxEax View Post
    Should be a typo since the tests aims to be about different GCC versions, also gcc-llvm has been deprecated in favour of Dragonegg which does the same thing against newer GCC versions (4.5 forwards iirc) through the plugin framework.


    Well, the compiler developers have test suites of their own which are much more extensive than that of Phoronix's. As for them even considering these tests at all I'd say that ship has sailed long time ago. Both the gcc devs and Chris Lattner (LLVM project leader) has stated that Phoronix's tests are totally worthless due to the poor test conditions through which the are done. Whatever compiler options Michael states are being used, take it with a large grain of salt as it has been shown over and over again that he doesn't seem to know how to configure these packages correctly before testing (himeno pressure tests using -O0, Povray defaulting to tuning for Amd K8 no matter what processor is being used, etc etc).

    And we still see it, what use is there to do a test of ffmpeg/x264 with assembly optimizations enabled? All the performance critical code is out of reach for the compilers, they are pretty much left to optimize the commandline option handling, yay!
    indeed, but then to do it properly would mean Michael finally makes some time and effort to actually care about his test suite and update it with the current ffmpeg/avconv and x264 Git code then set a configure switch to disable the assembly ( and so fall back to the slow C routines)for cases like this test.

    come to that he doesn't seem to even care about running out the box ARM NEON SIMD results now we are into retail ARM quad cores such as the Asus Transformer Prime Tegra3, and several other quads freescale Qualcomm etc. in retail soon enough , never mind all the old dual core ARM NEON kit out there today people and companies would like to see and compare results for.

    given that current ffmpeg/avconv and x264 have limited (but worthwhile testing) NEON SIMD today then these compiler tests would be perfectly suited to cross compile ARM/NEON testing as they would fall back to the C code routines and perhaps show some speed improvements and show where the "auto vectorising" needs more work..... and lets face it "auto vectorising" NEED's a LOT of work still and/or better developers that can learn some real assembly and liberally apply it in their apps code where it helps
    Last edited by popper; 12-03-2011 at 05:12 AM.

  4. #14
    Join Date
    Apr 2010
    Posts
    271

    Default

    Quote Originally Posted by XorEaxEax View Post
    I would prefer both -O2 and -O3, -O3 is aimed at producing the fastest code but due to the problems of correctly determining the best optimization strategy during compile time some of the aggressive optimizations in -O3 backfires and in those cases -O2 ends up faster. Still, if only one option is to be used I do prefer -O3 as it is supposed to generate the fastest code.
    A benchmark that tests code that produces possibly faulty results is worthless. If you ask O2 compiled code what 2+2 is and it says 4 in a half second while the O3 code says 5 in a quarter second, which is better?

  5. #15
    Join Date
    Jan 2007
    Posts
    459

    Default

    Quote Originally Posted by locovaca View Post
    A benchmark that tests code that produces possibly faulty results is worthless. If you ask O2 compiled code what 2+2 is and it says 4 in a half second while the O3 code says 5 in a quarter second, which is better?
    obviously the O3 code after you finally realise that the "auto vectorising" code in your compiler is so badly broken (perhaps a simple typo etc) that its producing faulty (or even just slow prototype speed code) output and needs fixing ASAP. but then devs should be checking their code routines speed improvements down to the pico second as it all add's up to lost time and efficiancy.
    Last edited by popper; 12-04-2011 at 01:17 AM.

  6. #16
    Join Date
    Oct 2009
    Posts
    845

    Default

    Quote Originally Posted by locovaca View Post
    A benchmark that tests code that produces possibly faulty results is worthless. If you ask O2 compiled code what 2+2 is and it says 4 in a half second while the O3 code says 5 in a quarter second, which is better?
    The fact that -O3 doesn't always generate faster code than -O2 does not mean the code is broken, it just means that the heuristics governing the use of more aggressive optimizations enabled at -O3 sometimes fail to correctly estimate if an optimization will generate faster code and instead actually generates slower code.

    This is not a new thing, there are many options like for instance global loop unrolling which is terribly hard to estimate without runtime data and is therefore not enabled by default in compiler optimization levels, same goes for loop vectorization although that is turned on at -O3 on GCC. Compilers have improved alot in this area, but the only way to really know the compiler will make they right choice is to either manually unroll (like in the good ole days) or provide the compiler with runtime data (profile/feedback optimization).

    As for -O3 generating faulty code, I haven't come across that in a long time, do you have any fresh examples? And yes, any benchmark which doesn't validate that the results are correct is indeed worthless.

  7. #17
    Join Date
    Feb 2008
    Location
    Linuxland
    Posts
    5,195

    Default

    Re autovectorizing, I recently looked at gcc's generated asm and it did better than what I would've done by hand (I had targeted SSE, and gcc used SSE2 though). Very nice to have the compiler do that successfully.

  8. #18
    Join Date
    Oct 2010
    Posts
    325

    Default

    Quote Originally Posted by XorEaxEax View Post
    The fact that -O3 doesn't always generate faster code than -O2 does not mean the code is broken, it just means that the heuristics governing the use of more aggressive optimizations enabled at -O3 sometimes fail to correctly estimate if an optimization will generate faster code and instead actually generates slower code.

    This is not a new thing, there are many options like for instance global loop unrolling which is terribly hard to estimate without runtime data and is therefore not enabled by default in compiler optimization levels, same goes for loop vectorization although that is turned on at -O3 on GCC. Compilers have improved alot in this area, but the only way to really know the compiler will make they right choice is to either manually unroll (like in the good ole days) or provide the compiler with runtime data (profile/feedback optimization).

    As for -O3 generating faulty code, I haven't come across that in a long time, do you have any fresh examples? And yes, any benchmark which doesn't validate that the results are correct is indeed worthless.
    This is a (incomplete) list of Gentoo ebuilds that filter O3 because of invalid code generation:
    app-arch/ppmd
    app-emulation/xen
    dev-ada/asis-gcc
    dev-ada/asis-gpl
    dev-lang/python
    dev-scheme/guile
    dev-util/valgrind
    app-editors/vim
    games-emulation/visualboyadvance
    games-fps/duke3d
    games-strategy/asc
    kde-base/kdm
    media-libs/ming
    media-sound/gnomad
    media-video/kaffeine
    sci-electronics/alliance
    sci-visualization/opendx
    sys-freebsd/freebsd-sbin
    sys-fs/evms
    sys-libs/libsmbios
    sys-process/procps
    www-plugins/nspluginwrapper
    x11-libs/gtk+

    This does not necessarily mean if you'll try to build any of these packages with latest gcc and -O3 the resulting binary will be completely broken, the breakage might only occur with older gcc versions, only on some arches, or even only in some corner use cases. The thing is, in order for a flag to be filtered someone must report a bug of the resulting binary not working properly while working when compiling with -O2.

    That doesn't mean -O2 is perfectly safe, there are plenty packages filtering -O2 or even -O1 on some arches but usually when an optimization level generates bad code all the superior levels will generate bad code as well.

    Long story short, optimization should not theoretically affect what the code does, but every gcc branch has known bugs, and O3 being not officially recommended and not so widely used as O2 will contain even more undiscovered bugs.

    P.S. it might be that locovaca was referring to the effects of -fast-math when saying O3 code will compute 2+2=5, but that's not enabled by O3. only -Ofast enables fast-math that will break almost any program that is not a game or multimedia codec.
    Last edited by Ansla; 12-04-2011 at 11:56 AM.

  9. #19
    Join Date
    Jan 2007
    Posts
    459

    Default

    Quote Originally Posted by curaga View Post
    Re autovectorizing, I recently looked at gcc's generated asm and it did better than what I would've done by hand (I had targeted SSE, and gcc used SSE2 though). Very nice to have the compiler do that successfully.
    indeed it is when it works for your case, it might be a bit of fun for you and others reading to git pull the current x264 and turn off the assembly flag then look at the code generated from the fall back C routines and compare to the real tried and trusted fully benchmarked assembly routines.....

    and im sure Loren merrit , Dark shikari, Ronald S. Bultje, and many other assembly devs over on x264-dev IRC can give you lots of real life broken routine cases if anyone cares to fix them in a given compiler and Ansla's partial list is also interesting.

  10. #20
    Join Date
    Oct 2009
    Posts
    845

    Default

    Quote Originally Posted by popper View Post
    indeed it is when it works for your case, it might be a bit of fun for you and others reading to git pull the current x264 and turn off the assembly flag then look at the code generated from the fall back C routines and compare to the real tried and trusted fully benchmarked assembly routines.....
    Certainly hand-optimized assembly made by an expert both on assembly and on the subject at hand (as they certainly are when it comes to video encoding) will beat any compiler. However for us mere mortals, GCC and others likely generate far better vectorization than we could do ourselves.

    Quote Originally Posted by popper View Post
    and im sure Loren merrit , Dark shikari, Ronald S. Bultje, and many other assembly devs over on x264-dev IRC can give you lots of real life broken routine cases if anyone cares to fix them in a given compiler and Ansla's partial list is also interesting.
    Perhaps, but I remember the compiler smackdowns over at 'breaking eggs and making omelettes' for ffmpeg which were all done with assembly optimization turned off and the only compiler I recall generating broken code was LLVM which back then was alot less mature than it is now.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •