Announcement

**AnonymousCoward** · 30 May 2012, 07:06 PM

Originally posted by XorEaxEax View Post

However the whole 'representative load' thing doesn't really make any sense unless you have an application with vastly different code paths under different 'loads' (any examples?).

Example: bsnes, a SNES emulator that emulates various special chips, and you have to run a set of games that actually use all of them to get everything optimized. Maybe "a load covering all performance-relevant code paths" is more appropriate (and longer), and I guess running them for non-interactive programs is easier.

**XorEaxEax** · 30 May 2012, 08:32 PM

Originally posted by AnonymousCoward View Post

Example: bsnes, a SNES emulator that emulates various special chips, and you have to run a set of games that actually use all of them to get everything optimized. Maybe "a load covering all performance-relevant code paths" is more appropriate (and longer), and I guess running them for non-interactive programs is easier.

Yes, these are separate emulator cores which are only activated if the rom uses them (DSP, FX etc) and you are right, it's a good example of an application which has very different code path's depending on 'load', Mame would be an even more extreme example as it emulates tons of cpu's aswell as tons of video and audio hardware. BSnes ships official binaries which are PGO optimized (iirc byuu has stated between ~20-40% increased performance with PGO for BSnes) and I assume he just runs through the attract gameplay sequences on a couple of games which has the corresponding special chips when PGO optimizing these official builds.

Atleast that's what I did when I benchmarked Mame, I wrote an ugly bash-script which launched you in Mame from where you had to manually run games atleast a bit into the attract gameplay sequence (good thing arcade games always has one) while profiling and then when you quit it started benchmarking the roms from a user-defined list (which would correspond to the roms you ran in profiling mode, else kinda pointless) and the PGO builds always surpassed the non-PGO builds in performance, sometimes severely (like ~20%+).

Generally though programs doesn't really have such varied codepaths, and if they do they can likely often be dealt with easily like rendering a test scene in Blender which has fluid physics, hair, textures, different style lamps etc would reach the code paths of all these separate features, or running a script in gimp which cycles through different filters etc etc.

That said, PGO really only matters if we are talking about cpu intensive stuff and the time you spend on recompiling a program using PGO only really matters if the extra performance will make a difference, Firefox running more smoothly with PGO, an emulator running in 100% instead of 70%, a compressor/encoder/renderer/compiler program you use very regularly being made ~15-20% faster etc. These are likely good 'end-user' reasons for recompiling something with PGO, more convenient though (if perhaps a little less performant) if it's done by your upstream binary packager.

Then of course there are those who like me just find compiler optimization technology fascinating and benchmark just out of interest, beats collecting stamps (though probably just barely) ;D

**kiputnik** · 31 May 2012, 06:51 AM

Originally posted by XorEaxEax View Post

However I agree that this lies outside the scope of a testsuite like Phoronix as it would be quite unrealistic to expect Micheal to write automatic profiling scripts for these tests.

The point is - he doesn't have to. Just run the benchmark twice - once for the -fprofile-generate binary, recompile with -fprofile-use, and run it again. He already has the necessary 'representative load' required.

It's a myth that you have to stress every path for PGO to be effective* - you just need to stress the most statistically significant path, and you'll most likely always get a speed boost.

* - for the vast majority of software

**oleid** · 31 May 2012, 07:34 AM

Intel Compiler

Originally posted by curaga View Post

You've been under a rock perhaps? The binaries by ICC will suck on AMD, VIA and anything else non-Intel x86. See Agner's site for insightful explanations, including benchmarks where he changes his Via to introduce as Intel.

This is only true, if the binary is optimized to run on any CPU. As soon as you compile ONLY using specific optimizations, e.g.

icc -msse2 [...]

It works fine. Depending on the code I compile, icc CAN produces binaries which are faster on AMD than the same code compiled with GCC. But I have also some code which is slower using icc, than gcc, even on intel hardware.

@Michael:
Concerning the availability of the intel compuiler. Michael, I guess you may not use the non commercial version for testing, right? Maybe you could contact intel and ask for permission. I consider the intel compiler as one of the best compilers, so inclusion would be sensible.

Access Denied

http://software.intel.com/en-us/articles/non-commercial-software-faq/#14

**ChrisJefferson** · 31 May 2012, 07:58 AM

Originally posted by kiputnik View Post

The point is - he doesn't have to. Just run the benchmark twice - once for the -fprofile-generate binary, recompile with -fprofile-use, and run it again. He already has the necessary 'representative load' required.

It's a myth that you have to stress every path for PGO to be effective* - you just need to stress the most statistically significant path, and you'll most likely always get a speed boost.

* - for the vast majority of software

Actually, this can make PGO look much better than it is.

I have found sometimes running PGO on exactly the same benchmark creates a considerable speed-up, but the resulting executable is slower at other jobs. This is not surprising, as you are tuning for one specific benchmark (or small set of benchmarks).

To test PGO fairly, you really need a range of "standard" instances, which avoid over-tuning. This really has to come from the original program authors I would say, some of whom include a 'make profiled' flag, and some don't.

**curaga** · 31 May 2012, 08:04 AM

Hm, having to wait for Gimp filters often, I wonder if it supports that (automatic PGO build, testing all filters)? Any Gimp people around?

**uid313** · 31 May 2012, 08:34 AM

Originally posted by curaga View Post

Hm, having to wait for Gimp filters often, I wonder if it supports that (automatic PGO build, testing all filters)? Any Gimp people around?

GIMP will improve filter performance using the GEGL library which is OpenCL powered and provides hardware acceleration.

**curaga** · 31 May 2012, 11:35 AM

Originally posted by uid313 View Post

GIMP will improve filter performance using the GEGL library which is OpenCL powered and provides hardware acceleration.

My HD4350 doesn't do OpenCL, and even if it did, it'd likely be slower than the CPU. Though I do believe CPU OpenCL scaled to six cores beats the single-threaded status quo.

But there's a CPU downside too: OpenCL code written for GPUs is limited, and so under a CPU it is likely to be slower than code natively written for a CPU (pthreads, et al).

**jrch2k8** · 31 May 2012, 11:47 AM

Originally posted by elanthis View Post

Except that GCC has always been "stupid rubbish shit" -- and has _intentionally_ been that way due to RMS's paranoia -- except for the barely-relevant part where it produces faster binaries than irrelevant compilers almost nobody uses (Open64) or a compiler that's practically an infant in comparison (Clang/LLVM). Clang matches the performance it took GCC 25 years to achieve, not to mention the fact that it has an equivalent level of language conformance and features (again, from zero to that complete in a teeny tiny fraction of the time it took GCC), plus the so-freaking-awesome toolset support it enables that GCC goes out of its way to make impossible to write.

In the few cases where binary performance in a few specialized micro-benchmarks actually matter, it's worth noting that GCC is still not even top dog, so it has the unpleasant distinction of being neither the faster compiler nor the more featureful, flexible, maintainable, extensible compiler. The only crown it can hold is "most popular compiler for UNIX systems." Yay.

Without Clang, the world of Open Source compilers would be stuck forever with glorified Notepad apps (Vim, Emacs) and a practically tools-free development environment. With Clang, the FOSS scene actually has a chance to start playing catch-up to Visual Studio / VAX. There's a chance to have actually useful code completion (real-time, no need to regenerate ctags and wait 5 minutes for it to complete), to have powerful code refactoring (nobody but VS/VAX has this yet, which is why it's so important for FOSS to catch up), and most importantly to have a compiler that provides a valid test ground for new language extensions and features to propose to the relevant committees (GCC is a nightmare to extend, maintain, learn, or improve; only a small handful of people can deal with its horrific internals). This is of course why just about every company on the planet with an interest in C/C++ have gotten involved with Clang: it is a massive improvement on all fronts that _actually matter_, and the performance of compiled binaries non-issue can be improved as time goes on (and again, it has improved at a much MUCH faster rater than GCC has).

But thanks anyway for your input as a non-developer fanboy. The world would such a worse place without your clueless rants and abuse of fonts.

well you have a very weird sense of things, so ill go point by point(btw i am a c++ dev that use only linux as plataform for commercial software and some oss projects i contribute from time to time).

1.) is not like core i7 and bulldozer exist from 25 years and gcc is just catching up now, gcc has grown and supported every hardware generation that has ever existed or at least been thinked by a human being relatively close to the release and has always offered very competitive performance with very few exceptions, even comparing the old gcc 2.95 vs icc of the same age shows gcc been second only to icc and not by much
2.) Clang yes is catching up fast but is not in the same league as gcc you are comparing apples to tires here, to begin with Clang/llvm only support a very minimal subset of plataforms compared to gcc and Clang is barely used today, so no need to maintain backward compatibility with anything so you can clean your code as much as you want without hassle.
3.) GCC is the most used compiler in the universe not even visual studio or ICC comes close and is not tied to unix like systems either LOL, but for this reason they need to maintain a crap load of backward compatibility code for many plataform/oses which you may think is stupid but are vital to many massive companies and institutes around the world(not all ppl use compilers for desktop you know!!).
4.) GCC has been always creative and efficient integrating new technologies inside the compiler that achieve real world performance/efficiency in all the plataforms supported by gcc(when possible ofc) namely C/C++11/atomics/ssa/pgo/branching/cpu features/lto/IPA/profiling/OpenMP/etc. Which even today are example for other compilers including clang(so is not like clang reinvented LTO, they used gcc as base example and polish for their code later).
5.) is true that gcc is extremely complex inside but is not cuz gcc devs are visual studio tards and clang ppl are einstein like geniuses, is the reasons i mentioned before. For example is no the same to do an IPA pass when you only have to support X86 than when you have to deal with the quirks and specifics of 15+ plataforms/oses combinations while keeping 25 years of backward compatibility in your back.
6.) Gcc is really efficient as a compiler as we stated already but i admit from a developer perspective it lacks many eye candy in the output that clang offers that is really helpful but is not like i cannot develop without it either tho
7.) tool-free dev env? emacs? WTF!! LOL, to begin with there are like a zillion IDE/rad enviroments (kdevelop/qtcreator/netbeans/qtdevelop/adjunta/monkey studio, etc) without mention GDB, valgrind,etc,etc. BTW Genius visual studio is 2 separated softwares LOL aka first is a COMPILER and the other is a RAD/IDE enviroment (dependng the language ofc) and you can develop in VS using cmake and then compile using GCC or using plugins you can bypass VC compiler entirely and use ICC for example, so yes the clang compiler output is closer to VC compiler and the linux IDE/Rad(not dependant on GCC) miss some features compared to VS IDE/RAD but clang is not an IDE/RAD
8.) The parse tree in Clang is also more suitable for supporting automated code refactoring but is not like GCC can't and code refactoring is IDE/RAD job and mostly compiler/language independant.
9.) Gcc was started many years ago with the tech available for that age including all the massive grow it has since then and the fact that is somehow became almost an industry standard in many sectors is reasonable to expect it will become massive enough to be really hard making invasive changes but this is will happen to LLVM/CLang eventually too and any other software, look at apache for example sure nginx/lighthttp are awesome and apache is quite bloated if you ask me but apache is an industry standard so they can't just change stuff without massive care and years of warning so ppl can decide to upgrade beside they need to maintain previous version for many years too(many big enterprise software still require apache 1.3 for example)

Clang is a very nice project and in some years will be important and big enough to compete with GCC but for now GCC is the most powerfull/efficient OSS compiler and truth is the only compiler superior to it is ICC not for some benchys but true real world performance(again industry class software not some gnome applet) and no one is stopping you to use clang for development and compile your final builds with GCC to get the best of both worlds so not need to go all IANAL about it.

**XorEaxEax** · 31 May 2012, 07:40 PM

Originally posted by kiputnik View Post

The point is - he doesn't have to. Just run the benchmark twice - once for the -fprofile-generate binary, recompile with -fprofile-use, and run it again. He already has the necessary 'representative load' required.

Well yes, in theory. In practice lots of stuff doesn't come with a nice autoconf 'configure' script which in turn means that your benchmark scripts will often have to edit existing makefiles, something Micheal has shown very little interest in as he even benchmarks pointless optimization settings like -O0. I'm guessing the extent of his efforts is generally something like 'CC=foo CXX=foo++ ./configure', run benchmark.

I personally think of any optimizations that lie outside of the standard -On levels as rather 'exotic' and not necessarily part of a superficial benchmark as that of Phoronix OpenBenchmarking, I'm just happy if we no longer have insane -mtune=k8 tunings on intel cpu tests and no pointless -O0, -O1 optimization settings.

Originally posted by kiputnik View Post

It's a myth that you have to stress every path for PGO to be effective* - you just need to stress the most statistically significant path, and you'll most likely always get a speed boost.

* - for the vast majority of software

Yes, but in the example we were discussing with emulators then the 'relevant' code paths become the cpu/video/audio emulation of whatever arcade machine (using Mame as an example here) we want to improve the performance of. If I run a SH4 cpu based game in the profiling stage and then run a m68000 cpu emulated game using the resulting binary the m68000 cpu emulation will have gained nothing from PGO since it's code path was completely untouched during the profiling stage. That said other general parts of the emulator will likely have improved due to PGO so there will likely be some speedup, just not near that of what it could have had.

Announcement

11-Way Intel Ivy Bridge Compiler Comparison

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment