I guess SSE2 has a lot to do with this (someone told me it is enforced by the compiler for AMD64, see quote), and also, those programs that are slower in 64b distros, are so because a process uses more memory than in a 32b system.
From 'man gcc':
Good benchmark, thank you very much.Intel 386 and AMD x86-64 Options
For the i386 compiler, you must use -march=cpu-type, -msse or -msse2 switches to enable SSE extensions and make this option effective. For
the x86-64 compiler, these extensions are enabled by default.