Announcement

Collapse
No announcement yet.

Benchmarks Of GCC 4.2 Through GCC 4.7 Compilers

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by birdie View Post
    I'm totally confused - is this a test of GCC compilers or GCC + LLVM backend?
    The table on the first page lists LLVM backend for all GCC versions so I've no idea what to think.
    Should be a typo since the tests aims to be about different GCC versions, also gcc-llvm has been deprecated in favour of Dragonegg which does the same thing against newer GCC versions (4.5 forwards iirc) through the plugin framework.

    Originally posted by sabriah View Post
    No, it is very timely. Now the developers have a chance of addressing these issues.
    Well, the compiler developers have test suites of their own which are much more extensive than that of Phoronix's. As for them even considering these tests at all I'd say that ship has sailed long time ago. Both the gcc devs and Chris Lattner (LLVM project leader) has stated that Phoronix's tests are totally worthless due to the poor test conditions through which the are done. Whatever compiler options Micheal states are being used, take it with a large grain of salt as it has been shown over and over again that he doesn't seem to know how to configure these packages correctly before testing (himeno pressure tests using -O0, Povray defaulting to tuning for Amd K8 no matter what processor is being used, etc etc).

    And we still see it, what use is there to do a test of ffmpeg/x264 with assembly optimizations enabled? All the performance critical code is out of reach for the compilers, they are pretty much left to optimize the commandline option handling, yay!

    Comment


    • #12
      Originally posted by Ansla View Post
      -march=native was a great choice, but maybe -O2 would have been better then -O3, the gcc documentation recomends using -O2 as -O3 does some risky optimisations.
      I would prefer both -O2 and -O3, -O3 is aimed at producing the fastest code but due to the problems of correctly determining the best optimization strategy during compile time some of the aggressive optimizations in -O3 backfires and in those cases -O2 ends up faster. Still, if only one option is to be used I do prefer -O3 as it is supposed to generate the fastest code.

      Comment


      • #13
        Originally posted by XorEaxEax View Post
        Should be a typo since the tests aims to be about different GCC versions, also gcc-llvm has been deprecated in favour of Dragonegg which does the same thing against newer GCC versions (4.5 forwards iirc) through the plugin framework.


        Well, the compiler developers have test suites of their own which are much more extensive than that of Phoronix's. As for them even considering these tests at all I'd say that ship has sailed long time ago. Both the gcc devs and Chris Lattner (LLVM project leader) has stated that Phoronix's tests are totally worthless due to the poor test conditions through which the are done. Whatever compiler options Michael states are being used, take it with a large grain of salt as it has been shown over and over again that he doesn't seem to know how to configure these packages correctly before testing (himeno pressure tests using -O0, Povray defaulting to tuning for Amd K8 no matter what processor is being used, etc etc).

        And we still see it, what use is there to do a test of ffmpeg/x264 with assembly optimizations enabled? All the performance critical code is out of reach for the compilers, they are pretty much left to optimize the commandline option handling, yay!
        indeed, but then to do it properly would mean Michael finally makes some time and effort to actually care about his test suite and update it with the current ffmpeg/avconv and x264 Git code then set a configure switch to disable the assembly ( and so fall back to the slow C routines)for cases like this test.

        come to that he doesn't seem to even care about running out the box ARM NEON SIMD results now we are into retail ARM quad cores such as the Asus Transformer Prime Tegra3, and several other quads freescale Qualcomm etc. in retail soon enough , never mind all the old dual core ARM NEON kit out there today people and companies would like to see and compare results for.

        given that current ffmpeg/avconv and x264 have limited (but worthwhile testing) NEON SIMD today then these compiler tests would be perfectly suited to cross compile ARM/NEON testing as they would fall back to the C code routines and perhaps show some speed improvements and show where the "auto vectorising" needs more work..... and lets face it "auto vectorising" NEED's a LOT of work still and/or better developers that can learn some real assembly and liberally apply it in their apps code where it helps
        Last edited by popper; 12-03-2011, 05:12 AM.

        Comment


        • #14
          Originally posted by XorEaxEax View Post
          I would prefer both -O2 and -O3, -O3 is aimed at producing the fastest code but due to the problems of correctly determining the best optimization strategy during compile time some of the aggressive optimizations in -O3 backfires and in those cases -O2 ends up faster. Still, if only one option is to be used I do prefer -O3 as it is supposed to generate the fastest code.
          A benchmark that tests code that produces possibly faulty results is worthless. If you ask O2 compiled code what 2+2 is and it says 4 in a half second while the O3 code says 5 in a quarter second, which is better?

          Comment


          • #15
            Originally posted by locovaca View Post
            A benchmark that tests code that produces possibly faulty results is worthless. If you ask O2 compiled code what 2+2 is and it says 4 in a half second while the O3 code says 5 in a quarter second, which is better?
            obviously the O3 code after you finally realise that the "auto vectorising" code in your compiler is so badly broken (perhaps a simple typo etc) that its producing faulty (or even just slow prototype speed code) output and needs fixing ASAP. but then devs should be checking their code routines speed improvements down to the pico second as it all add's up to lost time and efficiancy.
            Last edited by popper; 12-04-2011, 01:17 AM.

            Comment


            • #16
              Originally posted by locovaca View Post
              A benchmark that tests code that produces possibly faulty results is worthless. If you ask O2 compiled code what 2+2 is and it says 4 in a half second while the O3 code says 5 in a quarter second, which is better?
              The fact that -O3 doesn't always generate faster code than -O2 does not mean the code is broken, it just means that the heuristics governing the use of more aggressive optimizations enabled at -O3 sometimes fail to correctly estimate if an optimization will generate faster code and instead actually generates slower code.

              This is not a new thing, there are many options like for instance global loop unrolling which is terribly hard to estimate without runtime data and is therefore not enabled by default in compiler optimization levels, same goes for loop vectorization although that is turned on at -O3 on GCC. Compilers have improved alot in this area, but the only way to really know the compiler will make they right choice is to either manually unroll (like in the good ole days) or provide the compiler with runtime data (profile/feedback optimization).

              As for -O3 generating faulty code, I haven't come across that in a long time, do you have any fresh examples? And yes, any benchmark which doesn't validate that the results are correct is indeed worthless.

              Comment


              • #17
                Re autovectorizing, I recently looked at gcc's generated asm and it did better than what I would've done by hand (I had targeted SSE, and gcc used SSE2 though). Very nice to have the compiler do that successfully.

                Comment


                • #18
                  Originally posted by XorEaxEax View Post
                  The fact that -O3 doesn't always generate faster code than -O2 does not mean the code is broken, it just means that the heuristics governing the use of more aggressive optimizations enabled at -O3 sometimes fail to correctly estimate if an optimization will generate faster code and instead actually generates slower code.

                  This is not a new thing, there are many options like for instance global loop unrolling which is terribly hard to estimate without runtime data and is therefore not enabled by default in compiler optimization levels, same goes for loop vectorization although that is turned on at -O3 on GCC. Compilers have improved alot in this area, but the only way to really know the compiler will make they right choice is to either manually unroll (like in the good ole days) or provide the compiler with runtime data (profile/feedback optimization).

                  As for -O3 generating faulty code, I haven't come across that in a long time, do you have any fresh examples? And yes, any benchmark which doesn't validate that the results are correct is indeed worthless.
                  This is a (incomplete) list of Gentoo ebuilds that filter O3 because of invalid code generation:
                  app-arch/ppmd
                  app-emulation/xen
                  dev-ada/asis-gcc
                  dev-ada/asis-gpl
                  dev-lang/python
                  dev-scheme/guile
                  dev-util/valgrind
                  app-editors/vim
                  games-emulation/visualboyadvance
                  games-fps/duke3d
                  games-strategy/asc
                  kde-base/kdm
                  media-libs/ming
                  media-sound/gnomad
                  media-video/kaffeine
                  sci-electronics/alliance
                  sci-visualization/opendx
                  sys-freebsd/freebsd-sbin
                  sys-fs/evms
                  sys-libs/libsmbios
                  sys-process/procps
                  www-plugins/nspluginwrapper
                  x11-libs/gtk+

                  This does not necessarily mean if you'll try to build any of these packages with latest gcc and -O3 the resulting binary will be completely broken, the breakage might only occur with older gcc versions, only on some arches, or even only in some corner use cases. The thing is, in order for a flag to be filtered someone must report a bug of the resulting binary not working properly while working when compiling with -O2.

                  That doesn't mean -O2 is perfectly safe, there are plenty packages filtering -O2 or even -O1 on some arches but usually when an optimization level generates bad code all the superior levels will generate bad code as well.

                  Long story short, optimization should not theoretically affect what the code does, but every gcc branch has known bugs, and O3 being not officially recommended and not so widely used as O2 will contain even more undiscovered bugs.

                  P.S. it might be that locovaca was referring to the effects of -fast-math when saying O3 code will compute 2+2=5, but that's not enabled by O3. only -Ofast enables fast-math that will break almost any program that is not a game or multimedia codec.
                  Last edited by Ansla; 12-04-2011, 11:56 AM.

                  Comment


                  • #19
                    Originally posted by curaga View Post
                    Re autovectorizing, I recently looked at gcc's generated asm and it did better than what I would've done by hand (I had targeted SSE, and gcc used SSE2 though). Very nice to have the compiler do that successfully.
                    indeed it is when it works for your case, it might be a bit of fun for you and others reading to git pull the current x264 and turn off the assembly flag then look at the code generated from the fall back C routines and compare to the real tried and trusted fully benchmarked assembly routines.....

                    and im sure Loren merrit , Dark shikari, Ronald S. Bultje, and many other assembly devs over on x264-dev IRC can give you lots of real life broken routine cases if anyone cares to fix them in a given compiler and Ansla's partial list is also interesting.

                    Comment


                    • #20
                      Originally posted by popper View Post
                      indeed it is when it works for your case, it might be a bit of fun for you and others reading to git pull the current x264 and turn off the assembly flag then look at the code generated from the fall back C routines and compare to the real tried and trusted fully benchmarked assembly routines.....
                      Certainly hand-optimized assembly made by an expert both on assembly and on the subject at hand (as they certainly are when it comes to video encoding) will beat any compiler. However for us mere mortals, GCC and others likely generate far better vectorization than we could do ourselves.

                      Originally posted by popper View Post
                      and im sure Loren merrit , Dark shikari, Ronald S. Bultje, and many other assembly devs over on x264-dev IRC can give you lots of real life broken routine cases if anyone cares to fix them in a given compiler and Ansla's partial list is also interesting.
                      Perhaps, but I remember the compiler smackdowns over at 'breaking eggs and making omelettes' for ffmpeg which were all done with assembly optimization turned off and the only compiler I recall generating broken code was LLVM which back then was alot less mature than it is now.

                      Comment

                      Working...
                      X