it would be more interessting compare that hand written asm with gcc generated code. Unless that is done there is no reason to turn off assembly just to create a testcase that is completely detached from reality.
x264's handwritten asm absolutely destroys GCC's. It's not even close. At least half of all semi-recent x264 development has gone into the assembly, and the developers are well-known to trash GCC (and most other compiler) generated assembly. They tune their C as well, but no compiler could compare to what they've done in asm.
Years of encoding videos with AviSynth and x264, reading Doom9 forums, and sitting in #x264 and #x264-dev.
If you want to do a benchmark with and without --disable-asm, be my guest. I already know what the answer will be, so I'm not too motivated to do so.
The testcase would not be "completely detached from reality" -- as I said, their C is tuned as well, and used when a CPU does not have the features that their assembly needs (sometimes only MMX, sometimes SSE4, sometimes a whole different architecture than x86).
Well, although I don't have any actual data to back it up with right now, from experience I will side with Ranguvar on this. Hand optimized assembly done by an expert will generally beat that of an compiler, particularly when it comes to newer cpu extensions like SSE. The x264 devs didn't rewrite a ton of code in assembly just for fun, they benchmarked their assembly code versus compiler output and found that their hand optimized assembly performed alot better.
However compilers are getting better all the time, and while I think expert hand-tuned assembly will always equal or better that which is compiler generated, there will come a time when the difference is so small that it end up being a waste of time doing it manually.
The general problem for a compiler when doing optimization is knowledge about the program. A skillful human programmer knows exactly what he is trying to achieve and will be able to make the best possible optimization decisions based upon that. The compiler does not possess that deep knowledge and also can't make assumptions. Some of this can be alleviated through the use of compiler extensions that allow you to give the compiler more detailed instructions then the language normally permits, and also PGO (profile guided optimization) which gives the compiler a ton of runtime data with which it can 'understand' the program better and thus perform better optimizations.
There's no question hand tuned assembly speeds up codecs a lot. All you have to do to test this if you don't believe it is to run the test yourself. I remember a while ago Ubuntu shipped with a mis-configured xvid library, with all the assembly code disabled, and it ran at about 1/3rd the normal speed. Every codec will be different, of course, but given how much work has gone into x264 I would imagine the difference there would be even greater.
Compilers still aren't very good at utilizing SSE instruction sets automatically, and even if they are they tend to target only a specific instruction set while the hand-tuned code can target SSE4 while still providing fallback code for older CPUs.
Besides, which cpu manuifactured in the last 12 years does not have mmx?
I know that does not count as 'real' evidence. But I don't see you jumping to do a test. Hell, you know what -- you've got me motivated. I'll post back tomorrow or the next day with the results. I'll use the 'parkrun' clip (http://media.xiph.org/video/derf/), constant quality mode with two or three different --preset options, with and without --disable-asm. I use a Q6600 CPU with 6GiB of RAM on Arch Linux, GCC 4.5.1. If that is not to your satisfaction, let me know.
ARM CPUs, perhaps (which they are slowly adding some assembly for)? I do not see the point of that comment.
Please do make that test Ranguvar since I'm interested in seeing the difference in performance. Given that x264 has a 'fprofile' option to compile with it PGO, is there any chance you would do a test with that and '--disable-asm' to see how much it differs from just standard compile with '--disable-asm'.
Thanks for the benchmarks, the asm vs compiler generated ratio is pretty much as expected but if I'm reading this correctly the PGO versions are not faster (even slightly slower!?) which means you are not getting it to work properly. You need to run the pgo versions through an encoding and then re-compile for it to be able to use the generated runtime data. As I recall there is a semi-automated framework for this in x264, I'll see if I can find some proper instructions and redo the PGO tests myself (unless you would like to). Even with enabling all assembly optimizations, using PGO gave another 5% performance increase total according to 'Dark Shikari' so PGO isn't working in your tests.