Results 1 to 7 of 7

Thread: sandy bridge performance hit w/pthread condition variables, contended mutexes?

  1. #1

    Default sandy bridge performance hit w/pthread condition variables, contended mutexes?

    I have been looking at the performance of two servers:
    • dual Xeon X5550 2.67GHz (Nehalem, Dell R610)
    • dual Xeon E5-2690 2.90 GHz (Sandy Bridge, Dell R620 & HP dl360g8p)


    For my particular (proprietary) application, the Sandy Bridge systems are significantly slower. At least one facet of this problem has to do with:
    • pthread condition variable signaling
    • pthread mutex lock contention


    I wrote a simple (<300 lines) C program that demonstrates this: http://pastebin.com/0jPt0AJS

    The program has two tests:
    • "lc", a lock contention test, where two threads "fight" over incrementing and decrementing an integer, arbitrated with a pthread_mutex_t
    • "cv", a condition variable signaling test, where two threads "politely" take turns incrementing and decrementing an integer, signaling each other with a condition variable


    The program uses pthread_setaffinity_np() to pin each thread to its own CPU core.

    I would expect the SNB-based servers to be faster, since they have a clockspeed and architecture advantage.

    Results of X5550 @ 2.67 GHz server under CentOS 5.7:
    # ./snb_slow_demo -c 3 -C 5 -t cv -n 50000000
    runtime, seconds ........ 143.339958

    # ./snb_slow_demo -c 3 -C 5 -t lc -n 500000000
    runtime, seconds ........ 58.278671

    Results of Dell E5-2690 @ 2.90 GHz under CentOS 5.7:
    # ./snb_slow_demo -c 2 -C 4 -t cv -n 50000000
    runtime, seconds ........ 179.272697

    # ./snb_slow_demo -c 2 -C 4 -t lc -n 500000000
    runtime, seconds ........ 103.437226

    I upgraded the E5-2690 server to CentOS 6.2, then tried both the current release kernel.org kernel version 3.4.4, and also 3.5.0-rc5. The "lc" test results are about the same, but the "cv" tests are worse yet: the same test takes about 229 seconds to run.

    Also noteworthy is that the HP has generally better performance than the Dell. But the HP E5-2690 is still worse than the X5550.

    In all cases, for all servers, I disabled power-saving features (cpu frequency scaling, C-states, C1E). I verified with i7z that all CPUs spend 100% of their time in state C0.

    Is this simply a corner case where Sandy Bridge is worse than its predecessor? Or is there an implementation problem?

  2. #2
    Join Date
    Aug 2007
    Posts
    6,679

    Default

    http://ark.intel.com/de/products/645...GTs-Intel-QPI)

    you loose 900 mhz turbo boost max with your stupid settings. enable ALL powersave features including acpi c6 and turbo boost - then test again.

  3. #3

    Default

    Quote Originally Posted by Kano View Post
    http://ark.intel.com/de/products/645...GTs-Intel-QPI)

    you loose 900 mhz turbo boost max with your stupid settings. enable ALL powersave features including acpi c6 and turbo boost - then test again.
    No, for ultra low latency applications, it is absolutely critical to disable all powersaving features in modern CPUs. This is standard practice in industries such as high frequency trading and some other high performance computing situations. There is a measurable latency hit for a CPU to transition from a low power/slow state to a higher power/fast state. The latency to transition from one state to another is lessened in SNB compared to Westmere, but it's still there.

    In other words, with this benchmark, enabling any kind of power saving features (on either CPU) makes things worse.

    To be fair, turbo boost is debatable. In this particular benchmark, it improves things slightly; but overall, SNB still falls well behind Westmere.

  4. #4
    Join Date
    Aug 2007
    Posts
    6,679

    Default

    I basically have got i7-880, i7-2600, i7-3770S. But not yet wheezy on the i7-880. The gcc is important for speed, gcc-4.7 is better as you see here in most benchmarks:

    http://www.phoronix.com/scan.php?pag...ezy_2012&num=1

    You can forget the filesystem benchmarks as that is ext3 vs ext4. Also you could try differnet compiler flags and -march=native. I doubt that you lose so much latency with power management, maybe use a low latency kernel.
    Last edited by Kano; 07-11-2012 at 02:31 PM.

  5. #5

    Default

    If you're interested, I encourage you to compile and run the sample program to which I linked (instructions for building are in the top comment).

    FWIW, I have tested this program with several compilers. I haven't tried gcc 4.7 yet, but I have tried 4.6.3 on gentoo, having re-emerged the whole system with -march=corei7-avx, and built my demo program similarly. I also tried Intel's compiler (I didn't rebuild the whole gentoo system w/icc, but did build my program). The different compilers have so far made very little difference.

    My sample program doesn't actually do much that is interesting; it's more or less a system call benchmark. So my suspicion is that these kernel functions I'm using are implemented sub-optimally for SNB, or this is simply a corner-case where SNB is slower than previous-gen CPUs.

    I've been playing with this for a while, so I'm kind of hoping to get the attention of someone with deeper knowledge of kernel-CPU internals than me.

  6. #6
    Join Date
    Aug 2007
    Posts
    6,679

    Default

    Basically your benchmark is the worst multicore example that could be there. Lets talk about cv mode, when you specify different cores htop does never show more than 55% load on each core. With lc mode you see 100% load, but your code is written in both cases to run on 1 core! The speed difference is extreme. I did not wait for cv to finish with 2 different cores, thats just too long. Used i7-3770S, Turbo fixed at 39.
    Code:
    ./snb_slow_demo -c 0 -C 1 -t lc -n 500000000
    RUNTIME PARAMS:
        n_iter ..... 500000000
        cpu1 ....... 0
        cpu2 ....... 1
        testname ... lc
    runtime, microseconds ... 77268376
    runtime, seconds ........ 77.268376
    
    ./snb_slow_demo -c 0 -C 0 -t lc -n 500000000
    RUNTIME PARAMS:
        n_iter ..... 500000000
        cpu1 ....... 0
        cpu2 ....... 0
        testname ... lc
    runtime, microseconds ... 18515720
    runtime, seconds ........ 18.515720
    
    ./snb_slow_demo -c 0 -C 0 -t cv -n 50000000
    RUNTIME PARAMS:
        n_iter ..... 50000000
        cpu1 ....... 0
        cpu2 ....... 0
        testname ... cv
    runtime, microseconds ... 74678823
    runtime, seconds ........ 74.678823

  7. #7
    Join Date
    Jan 2009
    Posts
    1,498

    Default

    Quote Originally Posted by finance_coder View Post
    No, for ultra low latency applications, it is absolutely critical to disable all powersaving features in modern CPUs. This is standard practice in industries such as high frequency trading and some other high performance computing situations. There is a measurable latency hit for a CPU to transition from a low power/slow state to a higher power/fast state. The latency to transition from one state to another is lessened in SNB compared to Westmere, but it's still there.

    In other words, with this benchmark, enabling any kind of power saving features (on either CPU) makes things worse.

    To be fair, turbo boost is debatable. In this particular benchmark, it improves things slightly; but overall, SNB still falls well behind Westmere.
    If latency is important, and you are working for a large finance house, why aren't you using the rh messaging kernel thats designed for low latency?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •