Originally posted by chris200x9
View Post
for the smart mouths FMA and XOR really make huge diff if used properly and your cache is sane and remember that AVX can process 4 double(very rare is most workloads) or 8 floats(very common). in my case i modified my old SSE idct[Mx'M] code to use avx and FMA and loop it to run 1 million times and is around 35/45% faster in the same bulldozer cpu, so bd have a lot of juice to extract yet.
bd problem is most software outhere is poorly threaded and/or with crappy cache management and/or barely uses SIMD at all(this applies to SB too though)
Comment