Interesting (and remarkable) that we get such a regression from -ffast-math. It'd be interesting (if it's not a hassle) to learn why.
One possibility (which may or may not be the case) is that vectorization is at fault here. A big push in 3.3 was to ensure that the vectorization cost model was accurate, so that your vectorized code didn't make things worse by spending so much time just shuffling data. But the hope of having an accurate cost model doesn't mean that you ACTUALLY have one. It's possible that there's something severely broken in the cost model (at least for FP vectors) which is giving us these results.
The unroll-loops does not surprise me. The LLVM guys probably believe they have good heuristics for when (or not) to do this and are likely correct.
The architecture specific stuff may be linked to the inaccurate cost model issue? It would be amusing if we learned there was an off-by-one error or something in the micro-architecture specs table that drove the compiler!
I guess 3.4 will be released in the next month or two, and it would be interesting to revisit this at that point.