I was comparing core i7 and phenom II memory benchmarks and was quiet impressed by the i7 performance. Meanwhile, I wonder wether the comparison is legitimate or not.
Benchmarking (with a slightly modified version of the code provided here http://www.streambench.org/) older CPUs (and chipsets :-), I have discovered something that makes me wonder:
On the one hand, on intel machine (I have a core 2 duo as a desktop and some dual Xeon/penryn) you get almost the full bandwidth using a single thread per processor.
On the other hand, you need no less than one thread per core on a barcelona to get the full bandwidth.
Moreover, to get a real "memory bandwidth" benchmark out of the stream code from gcc (4.3.2) you should use quite different option sets depending on the cpu arch. For instance on intel processors "-O3 -march=native" is sufficient. On the contrary gcc is quite relucant to optimize the code for barcelona and one need something like : "-march=barcelona -static -fomit-frame-pointer -fexpensive-optimizations -funsafe-loop-optimizations -funroll-loops -ffast-math -O3" to really test bandwidth in every kind of operations.
I wonder if the memory bandwidth benchmarks are already run in parallel within the phoronix test suit. And in the case the aren't I think it would be quite convenient to add a parallel section to the memory benchmark :
The sequential test tells you how much you can expect from you processor charged with a single demanding thread. While the parallel test tells you how much you can really get from your memory under heavy load. The following example image shows dramatically (according to me) the difference between the two results : for single threads load the xeons have a large advance while they are outpaced under multi-threaded loads.
Has anyone already investigated this matter ?
PS: sorry for the large image : I haven't find any way to display it smaller...