Originally posted by Virtus
View Post
We ended up implementing the exponential function we needed by hand fully with AVX.
We got a nice performance improvement with our hand written AVX-256 implementation actually. Our algorithm had like 50 AVX-Instructions per processed vector with 32bit floats. So we did 8 elements per instruction instead of one.. Obviously did not improve 8 fold, but was a good bit faster than the non-vectorized implementation.
We came from a Java implementation, which ran for ~45 minutes for the task at hand and reduced that to ~12 seconds..
Other optimizations of the algorith itself where done as well to archive that.
Was our first endeavour into really using vector extensions on that level and we noticed pretty quickly that even small changes, like having non-memory alligned data or having a non vectorized step in the algorithm made huge performance differences. worst case scenario: switch between SSE2 instructions and AVX2, that tanked performance, as the CPU has to save all vector registers to main? memory as the xmm (128bit wide used in SSE2) and ymm (256bit wide for AVX2) registers are the same in the hardware as far as i understand it.
Comment