Nice benchmarks, thanks for your time. It definitely wasn't for nothing, despite the non-dramatic results.
In theory I believe the Gentoo camp is right - although instead of having all these options perhaps they could use a -march=native for the intended machine. I am not using Gentoo, so I don't know if all these architectures are intended for filling repositories with binaries from each architecture (but from what I remember, Gentoo compiles on the fly for the the architecture of the host).
In practice, there are two reasons why the performance isn't there.
1) Native functionality cannot be exploited fully due to the kernel not allowing the use of SSE/AVX so that it doesn't have to save their register state back and forth all the time.
2) The GCC compiler probably not doing an excellent work with cache size differences (how can you not exploit the difference of a 64kb L1 instruction cache vs a 32kb L1 instruction cache?) or instructions that can actually be exploited like BMI/BMI2/ADCX/ADOX/MULX, etc. Some of these newer instructions suffer in terms of optimization and most of the time you have to hand-tune the assembly in performance-critical code.
For (1) maybe a very complex solution to this could be to have an algorithm that sees how much the SSE/AVX instructions will be used, then save SSE/AVX state of the XMM/YMM/ZMM registers that will be used, proceed to using SSE/AVX for the gains they'll give in a certain function and then restore the SSE/AVX registers upon finishing. For crypto stuff they are already doing this I guess, but without the algorithm. They just know it's faster and are using the SSE/AVX versions and just save/restore the state because it's worth it.
For (2) perhaps GCC could do a better work (?).
In theory I believe the Gentoo camp is right - although instead of having all these options perhaps they could use a -march=native for the intended machine. I am not using Gentoo, so I don't know if all these architectures are intended for filling repositories with binaries from each architecture (but from what I remember, Gentoo compiles on the fly for the the architecture of the host).
In practice, there are two reasons why the performance isn't there.
1) Native functionality cannot be exploited fully due to the kernel not allowing the use of SSE/AVX so that it doesn't have to save their register state back and forth all the time.
2) The GCC compiler probably not doing an excellent work with cache size differences (how can you not exploit the difference of a 64kb L1 instruction cache vs a 32kb L1 instruction cache?) or instructions that can actually be exploited like BMI/BMI2/ADCX/ADOX/MULX, etc. Some of these newer instructions suffer in terms of optimization and most of the time you have to hand-tune the assembly in performance-critical code.
For (1) maybe a very complex solution to this could be to have an algorithm that sees how much the SSE/AVX instructions will be used, then save SSE/AVX state of the XMM/YMM/ZMM registers that will be used, proceed to using SSE/AVX for the gains they'll give in a certain function and then restore the SSE/AVX registers upon finishing. For crypto stuff they are already doing this I guess, but without the algorithm. They just know it's faster and are using the SSE/AVX versions and just save/restore the state because it's worth it.
For (2) perhaps GCC could do a better work (?).
Comment