you loose 900 mhz turbo boost max with your stupid settings. enable ALL powersave features including acpi c6 and turbo boost - then test again.
I have been looking at the performance of two servers:
- dual Xeon X5550 2.67GHz (Nehalem, Dell R610)
- dual Xeon E5-2690 2.90 GHz (Sandy Bridge, Dell R620 & HP dl360g8p)
For my particular (proprietary) application, the Sandy Bridge systems are significantly slower. At least one facet of this problem has to do with:
- pthread condition variable signaling
- pthread mutex lock contention
I wrote a simple (<300 lines) C program that demonstrates this: http://pastebin.com/0jPt0AJS
The program has two tests:
- "lc", a lock contention test, where two threads "fight" over incrementing and decrementing an integer, arbitrated with a pthread_mutex_t
- "cv", a condition variable signaling test, where two threads "politely" take turns incrementing and decrementing an integer, signaling each other with a condition variable
The program uses pthread_setaffinity_np() to pin each thread to its own CPU core.
I would expect the SNB-based servers to be faster, since they have a clockspeed and architecture advantage.
Results of X5550 @ 2.67 GHz server under CentOS 5.7:
# ./snb_slow_demo -c 3 -C 5 -t cv -n 50000000
runtime, seconds ........ 143.339958
# ./snb_slow_demo -c 3 -C 5 -t lc -n 500000000
runtime, seconds ........ 58.278671
Results of Dell E5-2690 @ 2.90 GHz under CentOS 5.7:
# ./snb_slow_demo -c 2 -C 4 -t cv -n 50000000
runtime, seconds ........ 179.272697
# ./snb_slow_demo -c 2 -C 4 -t lc -n 500000000
runtime, seconds ........ 103.437226
I upgraded the E5-2690 server to CentOS 6.2, then tried both the current release kernel.org kernel version 3.4.4, and also 3.5.0-rc5. The "lc" test results are about the same, but the "cv" tests are worse yet: the same test takes about 229 seconds to run.
Also noteworthy is that the HP has generally better performance than the Dell. But the HP E5-2690 is still worse than the X5550.
In all cases, for all servers, I disabled power-saving features (cpu frequency scaling, C-states, C1E). I verified with i7z that all CPUs spend 100% of their time in state C0.
Is this simply a corner case where Sandy Bridge is worse than its predecessor? Or is there an implementation problem?
In other words, with this benchmark, enabling any kind of power saving features (on either CPU) makes things worse.
To be fair, turbo boost is debatable. In this particular benchmark, it improves things slightly; but overall, SNB still falls well behind Westmere.
I basically have got i7-880, i7-2600, i7-3770S. But not yet wheezy on the i7-880. The gcc is important for speed, gcc-4.7 is better as you see here in most benchmarks:
You can forget the filesystem benchmarks as that is ext3 vs ext4. Also you could try differnet compiler flags and -march=native. I doubt that you lose so much latency with power management, maybe use a low latency kernel.
Last edited by Kano; 07-11-2012 at 01:31 PM.
If you're interested, I encourage you to compile and run the sample program to which I linked (instructions for building are in the top comment).
FWIW, I have tested this program with several compilers. I haven't tried gcc 4.7 yet, but I have tried 4.6.3 on gentoo, having re-emerged the whole system with -march=corei7-avx, and built my demo program similarly. I also tried Intel's compiler (I didn't rebuild the whole gentoo system w/icc, but did build my program). The different compilers have so far made very little difference.
My sample program doesn't actually do much that is interesting; it's more or less a system call benchmark. So my suspicion is that these kernel functions I'm using are implemented sub-optimally for SNB, or this is simply a corner-case where SNB is slower than previous-gen CPUs.
I've been playing with this for a while, so I'm kind of hoping to get the attention of someone with deeper knowledge of kernel-CPU internals than me.
Basically your benchmark is the worst multicore example that could be there. Lets talk about cv mode, when you specify different cores htop does never show more than 55% load on each core. With lc mode you see 100% load, but your code is written in both cases to run on 1 core! The speed difference is extreme. I did not wait for cv to finish with 2 different cores, thats just too long. Used i7-3770S, Turbo fixed at 39.
Code:./snb_slow_demo -c 0 -C 1 -t lc -n 500000000 RUNTIME PARAMS: n_iter ..... 500000000 cpu1 ....... 0 cpu2 ....... 1 testname ... lc runtime, microseconds ... 77268376 runtime, seconds ........ 77.268376 ./snb_slow_demo -c 0 -C 0 -t lc -n 500000000 RUNTIME PARAMS: n_iter ..... 500000000 cpu1 ....... 0 cpu2 ....... 0 testname ... lc runtime, microseconds ... 18515720 runtime, seconds ........ 18.515720 ./snb_slow_demo -c 0 -C 0 -t cv -n 50000000 RUNTIME PARAMS: n_iter ..... 50000000 cpu1 ....... 0 cpu2 ....... 0 testname ... cv runtime, microseconds ... 74678823 runtime, seconds ........ 74.678823