in this case there isnt much difference between sse and sse2
sse2 is mostly operations for integers (mmx for 128bit registers) and a couple op's for non-temporal store and cache hints
gcc(4.8) gives me scalar code all the time
i also tried some more specific compiler options like
gcc -o matrixm.o matrixm.c -shared -O3 -ftree-slp-vectorize -ffast-math -msse2
even with hints in C code
avx, fma and xop, on the other hand, have instructions that are great for this kind of operations like VFMADDPS and HADDPS
the compiler does use them when possible, but from what i tried its still scalar
threading assembly code is as easy as threading C code
in fact i think my loop would do better threaded then the C one as a cache line is from what i can tell usually 64bits and the scalar code uses just the first 32bits
i could be wrong if, for example, the cpu sees that and loads the whole cache into 2 registers
i tried -flto now and it does give better performance
maybe cuz it aligned the loop (my loop isnt aligned either, tried now, it didnt help much)
this is for example of sse usefulness
if someone wants to use the loop in a BLAS library il finish it to work in all cases
and it can also be used for software fallback when opengl3.x is not available, like with laptops
fun thing is i will be using this to shave off a % or 2 from a game, will need to change couple lines thou
writing a loop here and there isnt that demanding
debugging is done by following the flow of the program
load -> shuffle -> load2 -> shuffle_together
what is and should be in the affected register is written
you can rename them as you wish
i admit its not as clear as copying parts of C code
then again SIMD and MIMD can require different algorithms so its at least good to know about it
i personally put stores and calls to print to stdout to know whats going on
(gdb returns an address of the problem, then you can look at the disassembly where is the problem)
i feel good documentation is the way to better quality software, not limiting all programers to one way of doing things
then again i do this for a hobby so what do i know
for software production it probably dosent matter at all
most people have many core cpus so optimizing that couple % dosent matter
then again sometimes its useful
like in scientific things, encryption, databases, physics on the cpu
also on code that changes flow frequently based on the results of calculations, that is better done on the cpu
unfortunately i cant test your loop since it returns wrong calculations, and it does only a third of the calculations necessary
still 177472 * 3 is 532416
hmm, guess its because of less cache usage
nice try thou, similar loop is in gcc's vectorization manual
PS i also tried icc and llvm on this site, llvm is 3.0 thou
sse2 is mostly operations for integers (mmx for 128bit registers) and a couple op's for non-temporal store and cache hints
gcc(4.8) gives me scalar code all the time
Code:
67e: f3 0f 10 4f 80 movss xmm1,DWORD PTR [rdi-0x80] 683: f3 0f 59 46 d0 mulss xmm0,DWORD PTR [rsi-0x30] 688: f3 0f 59 4e d4 mulss xmm1,DWORD PTR [rsi-0x2c] 68d: f3 0f 58 c1 addss xmm0,xmm1
gcc -o matrixm.o matrixm.c -shared -O3 -ftree-slp-vectorize -ffast-math -msse2
even with hints in C code
Code:
vertex = __builtin_assume_aligned (vertex, 32); matrix = __builtin_assume_aligned (matrix, 32); result = __builtin_assume_aligned (result, 32);
the compiler does use them when possible, but from what i tried its still scalar
threading assembly code is as easy as threading C code
in fact i think my loop would do better threaded then the C one as a cache line is from what i can tell usually 64bits and the scalar code uses just the first 32bits
i could be wrong if, for example, the cpu sees that and loads the whole cache into 2 registers
i tried -flto now and it does give better performance
maybe cuz it aligned the loop (my loop isnt aligned either, tried now, it didnt help much)
this is for example of sse usefulness
if someone wants to use the loop in a BLAS library il finish it to work in all cases
and it can also be used for software fallback when opengl3.x is not available, like with laptops
fun thing is i will be using this to shave off a % or 2 from a game, will need to change couple lines thou
writing a loop here and there isnt that demanding
debugging is done by following the flow of the program
load -> shuffle -> load2 -> shuffle_together
what is and should be in the affected register is written
you can rename them as you wish
i admit its not as clear as copying parts of C code
then again SIMD and MIMD can require different algorithms so its at least good to know about it
i personally put stores and calls to print to stdout to know whats going on
(gdb returns an address of the problem, then you can look at the disassembly where is the problem)
i feel good documentation is the way to better quality software, not limiting all programers to one way of doing things
then again i do this for a hobby so what do i know
for software production it probably dosent matter at all
most people have many core cpus so optimizing that couple % dosent matter
then again sometimes its useful
like in scientific things, encryption, databases, physics on the cpu
also on code that changes flow frequently based on the results of calculations, that is better done on the cpu
unfortunately i cant test your loop since it returns wrong calculations, and it does only a third of the calculations necessary
still 177472 * 3 is 532416
hmm, guess its because of less cache usage
nice try thou, similar loop is in gcc's vectorization manual
PS i also tried icc and llvm on this site, llvm is 3.0 thou
Comment