Originally posted by energyman
View Post
Announcement
Collapse
No announcement yet.
Compiler Benchmarks Of GCC, LLVM-GCC, DragonEgg, Clang
Collapse
X
-
-
Originally posted by energyman View Postbold claims. Evidence?
If you want to do a benchmark with and without --disable-asm, be my guest. I already know what the answer will be, so I'm not too motivated to do so.
The testcase would not be "completely detached from reality" -- as I said, their C is tuned as well, and used when a CPU does not have the features that their assembly needs (sometimes only MMX, sometimes SSE4, sometimes a whole different architecture than x86).
Comment
-
Well, although I don't have any actual data to back it up with right now, from experience I will side with Ranguvar on this. Hand optimized assembly done by an expert will generally beat that of an compiler, particularly when it comes to newer cpu extensions like SSE. The x264 devs didn't rewrite a ton of code in assembly just for fun, they benchmarked their assembly code versus compiler output and found that their hand optimized assembly performed alot better.
However compilers are getting better all the time, and while I think expert hand-tuned assembly will always equal or better that which is compiler generated, there will come a time when the difference is so small that it end up being a waste of time doing it manually.
The general problem for a compiler when doing optimization is knowledge about the program. A skillful human programmer knows exactly what he is trying to achieve and will be able to make the best possible optimization decisions based upon that. The compiler does not possess that deep knowledge and also can't make assumptions. Some of this can be alleviated through the use of compiler extensions that allow you to give the compiler more detailed instructions then the language normally permits, and also PGO (profile guided optimization) which gives the compiler a ton of runtime data with which it can 'understand' the program better and thus perform better optimizations.
Comment
-
There's no question hand tuned assembly speeds up codecs a lot. All you have to do to test this if you don't believe it is to run the test yourself. I remember a while ago Ubuntu shipped with a mis-configured xvid library, with all the assembly code disabled, and it ran at about 1/3rd the normal speed. Every codec will be different, of course, but given how much work has gone into x264 I would imagine the difference there would be even greater.
Compilers still aren't very good at utilizing SSE instruction sets automatically, and even if they are they tend to target only a specific instruction set while the hand-tuned code can target SSE4 while still providing fallback code for older CPUs.
Comment
-
Originally posted by energyman View Postso you have no evidence at all?
anecdotal evidence does not count.
Besides, which cpu manuifactured in the last 12 years does not have mmx?
ARM CPUs, perhaps (which they are slowly adding some assembly for)? I do not see the point of that comment.
Comment
-
Please do make that test Ranguvar since I'm interested in seeing the difference in performance. Given that x264 has a 'fprofile' option to compile with it PGO, is there any chance you would do a test with that and '--disable-asm' to see how much it differs from just standard compile with '--disable-asm'.
Comment
-
Testing done.
Here is the summary: http://ix.io/1h1
And here are the logfiles: http://ompldr.org/vNmJicA/parkrun_benchmark_logs.tar.gz
In conclusion, x264's hand-assembly means speeds are increased by 2.4x-5.8x, with a larger improvement when performing more complex encoding.
There's your evidence. Feel free to perform your own tests.
Comment
-
Thanks for the benchmarks, the asm vs compiler generated ratio is pretty much as expected but if I'm reading this correctly the PGO versions are not faster (even slightly slower!?) which means you are not getting it to work properly. You need to run the pgo versions through an encoding and then re-compile for it to be able to use the generated runtime data. As I recall there is a semi-automated framework for this in x264, I'll see if I can find some proper instructions and redo the PGO tests myself (unless you would like to). Even with enabling all assembly optimizations, using PGO gave another 5% performance increase total according to 'Dark Shikari' so PGO isn't working in your tests.
Comment
Comment