Originally posted by JS987
View Post
1.i cant make the loop as i want to make it; althou the gain would be a few kB and some .1% so this is not such a big problem
2.what the loop does is complicated enough without thinking what the compiler will do
the loop needs minimum 8 sse registers, and if the compiler messes up 1 step or i do 1 step wrong(as intrinsics adds "mov" steps) then comes register spilling (check the phoronix post on gcc's new register allocator)
3.debugging is hard as its all jumbled up
so i make one version in assembly then redo it in intrinsics, but i still depend on the compiler doing register allocation right
also http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19680
that should change with gcc 4.8, i repeat "should"
and in the end i like assembly and find it easy to test its speed and correctness in small loops
why should i do it in intrinsics anyway when theres inline; that is "why is there inline there if its useless and ugly"
and again, i would file a bug report on it but i feel it would surely be ignored as there is no ISO standard on inline assembly
Leave a comment: