sandy bridge performance hit w/pthread condition variables, contended mutexes?

liam replied

11 July 2012, 04:50 PM
Originally posted by finance_coder View Post

No, for ultra low latency applications, it is absolutely critical to disable all powersaving features in modern CPUs. This is standard practice in industries such as high frequency trading and some other high performance computing situations. There is a measurable latency hit for a CPU to transition from a low power/slow state to a higher power/fast state. The latency to transition from one state to another is lessened in SNB compared to Westmere, but it's still there.

In other words, with this benchmark, enabling any kind of power saving features (on either CPU) makes things worse.

To be fair, turbo boost is debatable. In this particular benchmark, it improves things slightly; but overall, SNB still falls well behind Westmere.

If latency is important, and you are working for a large finance house, why aren't you using the rh messaging kernel thats designed for low latency?
Leave a comment:

Kano replied

11 July 2012, 04:35 PM

Basically your benchmark is the worst multicore example that could be there. Lets talk about cv mode, when you specify different cores htop does never show more than 55% load on each core. With lc mode you see 100% load, but your code is written in both cases to run on 1 core! The speed difference is extreme. I did not wait for cv to finish with 2 different cores, thats just too long. Used i7-3770S, Turbo fixed at 39.

Code:

./snb_slow_demo -c 0 -C 1 -t lc -n 500000000
RUNTIME PARAMS:
    n_iter ..... 500000000
    cpu1 ....... 0
    cpu2 ....... 1
    testname ... lc
runtime, microseconds ... 77268376
runtime, seconds ........ 77.268376

./snb_slow_demo -c 0 -C 0 -t lc -n 500000000
RUNTIME PARAMS:
    n_iter ..... 500000000
    cpu1 ....... 0
    cpu2 ....... 0
    testname ... lc
runtime, microseconds ... 18515720
runtime, seconds ........ 18.515720

./snb_slow_demo -c 0 -C 0 -t cv -n 50000000
RUNTIME PARAMS:
    n_iter ..... 50000000
    cpu1 ....... 0
    cpu2 ....... 0
    testname ... cv
runtime, microseconds ... 74678823
runtime, seconds ........ 74.678823

Leave a comment:

finance_coder replied

11 July 2012, 01:38 PM
If you're interested, I encourage you to compile and run the sample program to which I linked (instructions for building are in the top comment).

FWIW, I have tested this program with several compilers. I haven't tried gcc 4.7 yet, but I have tried 4.6.3 on gentoo, having re-emerged the whole system with -march=corei7-avx, and built my demo program similarly. I also tried Intel's compiler (I didn't rebuild the whole gentoo system w/icc, but did build my program). The different compilers have so far made very little difference.

My sample program doesn't actually do much that is interesting; it's more or less a system call benchmark. So my suspicion is that these kernel functions I'm using are implemented sub-optimally for SNB, or this is simply a corner-case where SNB is slower than previous-gen CPUs.

I've been playing with this for a while, so I'm kind of hoping to get the attention of someone with deeper knowledge of kernel-CPU internals than me.
Leave a comment:
Kano replied

11 July 2012, 01:28 PM
I basically have got i7-880, i7-2600, i7-3770S. But not yet wheezy on the i7-880. The gcc is important for speed, gcc-4.7 is better as you see here in most benchmarks:

Debian: Squeeze vs. Wheezy On Linux And kFreeBSD - Phoronix

http://www.phoronix.com/scan.php?page=article&item=debian_squeeze_wheezy_2012&num=1

Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

You can forget the filesystem benchmarks as that is ext3 vs ext4. Also you could try differnet compiler flags and -march=native. I doubt that you lose so much latency with power management, maybe use a low latency kernel.

Last edited by Kano; 11 July 2012, 01:31 PM.
Leave a comment:
finance_coder replied

11 July 2012, 12:44 PM
Originally posted by Kano View Post

http://ark.intel.com/de/products/645...GTs-Intel-QPI)

you loose 900 mhz turbo boost max with your stupid settings. enable ALL powersave features including acpi c6 and turbo boost - then test again.

No, for ultra low latency applications, it is absolutely critical to disable all powersaving features in modern CPUs. This is standard practice in industries such as high frequency trading and some other high performance computing situations. There is a measurable latency hit for a CPU to transition from a low power/slow state to a higher power/fast state. The latency to transition from one state to another is lessened in SNB compared to Westmere, but it's still there.

In other words, with this benchmark, enabling any kind of power saving features (on either CPU) makes things worse.

To be fair, turbo boost is debatable. In this particular benchmark, it improves things slightly; but overall, SNB still falls well behind Westmere.
Leave a comment:
Kano replied

11 July 2012, 12:27 PM
Access Denied

http://ark.intel.com/de/products/64596/Intel-Xeon-Processor-E5-2690-(20M-Cache-2_90-GHz-8_00-GTs-Intel-QPI)

you loose 900 mhz turbo boost max with your stupid settings. enable ALL powersave features including acpi c6 and turbo boost - then test again.
Leave a comment:
finance_coder started a topic sandy bridge performance hit w/pthread condition variables, contended mutexes?

11 July 2012, 12:20 PM
sandy bridge performance hit w/pthread condition variables, contended mutexes?
I have been looking at the performance of two servers:
dual Xeon X5550 2.67GHz (Nehalem, Dell R610)

dual Xeon E5-2690 2.90 GHz (Sandy Bridge, Dell R620 & HP dl360g8p)

For my particular (proprietary) application, the Sandy Bridge systems are significantly slower. At least one facet of this problem has to do with:
pthread condition variable signaling

pthread mutex lock contention

I wrote a simple (<300 lines) C program that demonstrates this: http://pastebin.com/0jPt0AJS

The program has two tests:
"lc", a lock contention test, where two threads "fight" over incrementing and decrementing an integer, arbitrated with a pthread_mutex_t

"cv", a condition variable signaling test, where two threads "politely" take turns incrementing and decrementing an integer, signaling each other with a condition variable

The program uses pthread_setaffinity_np() to pin each thread to its own CPU core.

I would expect the SNB-based servers to be faster, since they have a clockspeed and architecture advantage.

Results of X5550 @ 2.67 GHz server under CentOS 5.7:
# ./snb_slow_demo -c 3 -C 5 -t cv -n 50000000
runtime, seconds ........ 143.339958

# ./snb_slow_demo -c 3 -C 5 -t lc -n 500000000
runtime, seconds ........ 58.278671

Results of Dell E5-2690 @ 2.90 GHz under CentOS 5.7:
# ./snb_slow_demo -c 2 -C 4 -t cv -n 50000000
runtime, seconds ........ 179.272697

# ./snb_slow_demo -c 2 -C 4 -t lc -n 500000000
runtime, seconds ........ 103.437226

I upgraded the E5-2690 server to CentOS 6.2, then tried both the current release kernel.org kernel version 3.4.4, and also 3.5.0-rc5. The "lc" test results are about the same, but the "cv" tests are worse yet: the same test takes about 229 seconds to run.

Also noteworthy is that the HP has generally better performance than the Dell. But the HP E5-2690 is still worse than the X5550.

In all cases, for all servers, I disabled power-saving features (cpu frequency scaling, C-states, C1E). I verified with i7z that all CPUs spend 100% of their time in state C0.

Is this simply a corner case where Sandy Bridge is worse than its predecessor? Or is there an implementation problem?
Tags: None

Announcement

sandy bridge performance hit w/pthread condition variables, contended mutexes?

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

sandy bridge performance hit w/pthread condition variables, contended mutexes?