Originally posted by Raka555
View Post
And yes there is usually an even greater improvement with hand-tweaked assembly coding (or compiler intrinsics). In some cases the vectorized inner loop can reach the RAM bandwidth limit of the CPU. In that case the only performance improvements come from moving the code to a Xeon.
Seriously, there's a reason that running AVX2 code results in so much CPU heating. Because it is getting a LOT OF WORK DONE. It's worth it and the same is true of lower levels like MMX, SSE, SSE2, etc.
Besides vectorization, compiler optimizations add up. If you have a 5% boost from one thing and a 10% boost from another before you know it the software is running 30% faster.
Obviously, more modern CPU instructions work better than old ones. Otherwise we'd still be running 8 bit and 16 bit code and doing floating point using integer math just like the 1980s. Will anyone argue that recompiling 1980s 8086 code with -march=native is a waste of time and that it's not a "spectacular difference?" The changes do add up.
Comment