Compiler Benchmarks Of GCC, LLVM-GCC, DragonEgg, Clang

Death Knight replied

28 April 2013, 09:40 PM
I believe this test are needed to be rebuild with recent gcc and clang versions like
gcc4.7 vs clang3.2
gcc4.8svn vs clang3.3svn
I wonder the results.
Leave a comment:
XorEaxEax replied

26 November 2010, 03:33 AM
Weird, why would PGO help more with the asm version? Should be the ecaxt opposite imo since the compiler can't optimize that hand-written assembly in any way, but it should atleast be able to do some optimizing with the c code. Anyway I can understand why you wouldn't want to redo the tests since the argument was regarding hand-optimized assembly vs compiler generated code, maybe I'll give it a shot myself since I'm curious as to what Shikari said. Again thanks for the benchmarks!
Leave a comment:
Ranguvar replied

26 November 2010, 03:17 AM
Originally posted by XorEaxEax View Post

Thanks for the benchmarks, the asm vs compiler generated ratio is pretty much as expected but if I'm reading this correctly the PGO versions are not faster (even slightly slower!?) which means you are not getting it to work properly. You need to run the pgo versions through an encoding and then re-compile for it to be able to use the generated runtime data. As I recall there is a semi-automated framework for this in x264, I'll see if I can find some proper instructions and redo the PGO tests myself (unless you would like to). Even with enabling all assembly optimizations, using PGO gave another 5% performance increase total according to 'Dark Shikari' so PGO isn't working in your tests.

I talked with Dark Shikari on #x264 about the results. He said A.) PGO would help more with the hand asm, since it apparently does not benefit what the pure C build spends most of its time doing (DSP functions), and B.) I screwed something up with the PGO, because it should be ~1% faster. I did the build correctly from what I can tell (make fprofiled VIDS="videohere.y4m"), but I couldn't be be bothered to recompile, retest, etc. etc. to confirm a ~1% performance increase
Leave a comment:
XorEaxEax replied

26 November 2010, 02:31 AM
Thanks for the benchmarks, the asm vs compiler generated ratio is pretty much as expected but if I'm reading this correctly the PGO versions are not faster (even slightly slower!?) which means you are not getting it to work properly. You need to run the pgo versions through an encoding and then re-compile for it to be able to use the generated runtime data. As I recall there is a semi-automated framework for this in x264, I'll see if I can find some proper instructions and redo the PGO tests myself (unless you would like to). Even with enabling all assembly optimizations, using PGO gave another 5% performance increase total according to 'Dark Shikari' so PGO isn't working in your tests.
Leave a comment:
Ranguvar replied

25 November 2010, 08:20 PM
Testing done.

Here is the summary: http://ix.io/1h1

And here are the logfiles: http://ompldr.org/vNmJicA/parkrun_benchmark_logs.tar.gz

In conclusion, x264's hand-assembly means speeds are increased by 2.4x-5.8x, with a larger improvement when performing more complex encoding.

There's your evidence. Feel free to perform your own tests.
Leave a comment:
XorEaxEax replied

21 November 2010, 02:01 AM
Please do make that test Ranguvar since I'm interested in seeing the difference in performance. Given that x264 has a 'fprofile' option to compile with it PGO, is there any chance you would do a test with that and '--disable-asm' to see how much it differs from just standard compile with '--disable-asm'.
Leave a comment:
Ranguvar replied

21 November 2010, 01:23 AM
Originally posted by energyman View Post

so you have no evidence at all?

anecdotal evidence does not count.

Besides, which cpu manuifactured in the last 12 years does not have mmx?

I know that does not count as 'real' evidence. But I don't see you jumping to do a test. Hell, you know what -- you've got me motivated. I'll post back tomorrow or the next day with the results. I'll use the 'parkrun' clip (http://media.xiph.org/video/derf/), constant quality mode with two or three different --preset options, with and without --disable-asm. I use a Q6600 CPU with 6GiB of RAM on Arch Linux, GCC 4.5.1. If that is not to your satisfaction, let me know.

ARM CPUs, perhaps (which they are slowly adding some assembly for)? I do not see the point of that comment.
Leave a comment:
smitty3268 replied

20 November 2010, 12:57 AM
There's no question hand tuned assembly speeds up codecs a lot. All you have to do to test this if you don't believe it is to run the test yourself. I remember a while ago Ubuntu shipped with a mis-configured xvid library, with all the assembly code disabled, and it ran at about 1/3rd the normal speed. Every codec will be different, of course, but given how much work has gone into x264 I would imagine the difference there would be even greater.

Compilers still aren't very good at utilizing SSE instruction sets automatically, and even if they are they tend to target only a specific instruction set while the hand-tuned code can target SSE4 while still providing fallback code for older CPUs.
Leave a comment:
XorEaxEax replied

19 November 2010, 06:32 PM
Well, although I don't have any actual data to back it up with right now, from experience I will side with Ranguvar on this. Hand optimized assembly done by an expert will generally beat that of an compiler, particularly when it comes to newer cpu extensions like SSE. The x264 devs didn't rewrite a ton of code in assembly just for fun, they benchmarked their assembly code versus compiler output and found that their hand optimized assembly performed alot better.

However compilers are getting better all the time, and while I think expert hand-tuned assembly will always equal or better that which is compiler generated, there will come a time when the difference is so small that it end up being a waste of time doing it manually.

The general problem for a compiler when doing optimization is knowledge about the program. A skillful human programmer knows exactly what he is trying to achieve and will be able to make the best possible optimization decisions based upon that. The compiler does not possess that deep knowledge and also can't make assumptions. Some of this can be alleviated through the use of compiler extensions that allow you to give the compiler more detailed instructions then the language normally permits, and also PGO (profile guided optimization) which gives the compiler a ton of runtime data with which it can 'understand' the program better and thus perform better optimizations.
Leave a comment:
energyman replied

19 November 2010, 03:57 PM
so you have no evidence at all?

anecdotal evidence does not count.

Besides, which cpu manuifactured in the last 12 years does not have mmx?
Leave a comment:

Announcement

Compiler Benchmarks Of GCC, LLVM-GCC, DragonEgg, Clang

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: