Announcement

Collapse
No announcement yet.

Multi-Core Scaling Performance Of AMD's Bulldozer

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ciplogic
    replied
    Originally posted by RealNC View Post
    I don't view it that way. If you're gonna have, say, 8MB cache on 4 cores, it's better to make it shared rather than 2MB per core. That way, on loads that involve fewer cores the cache increases (on a two-thread load you have 4MB per core).

    But of course that view comes from someone who doesn't know the details behind CPU cache memory :-P
    The issue is not how you do it, is that is not possible, is that reduced cache but in lower levels, is worse if you simply combine with a bigger cache. Having a big shared cache theoretically is better but in multi-threaded context can make things slower. Let's take a typical case: you make a 'make -J9' (meaning 9 processes/jobs will try to stress all your 8 CPU cores). As the processes start, every time a process switch happen (even in a multiple small caches), the new process will simply blank the cache. This is bad, but it is not that bad, as the code of the source code may fit in that cache fairly right. And even it would happen that blanking, the cache of the one core will be make a theoretical worst case slow down of around 12.5 percent (in accessing memory). Let's get to a make -J8 with a shared cache: this will make that you start process 1 to process 8 initially and they override one's other cache, but when some things appear to be common, one of processes ends, and a new process, process9 will simply clear once again the shared cache making it in vain all that "locality" that was given to the cache.
    Another issue is how cache are made, L1, is closest (as distance and CPU cycles) to the math units, memory unit, and so on. The L2 is a bit further apart, and L3 (that we think as shared cache) is the slowest. The electrons have to "walk more" to get the data from the cache to any specific core. So the solution that most people will say is to have 1 MB L1 for every core (the opposite of the 8 M L3 shared). This will make the synchronizations of the CPU impossible (I mean possible like in Athlon X4, but slower, as L3 is used also for syncing the data between cores).
    The cache hit/miss ratio and branch prediction (if you have a cache miss, you would have to make the predictor to make computations right (in advance) in the time you wait for memory) is very hard to get it right, and to succeed it the work is on two fronts: how software is written (for Bulldozer that your multi-threaded process will try to have the core most used logic to fit in 2 MB) and how to not get in the architecture bottlenecks.

    Leave a comment:


  • Ansla
    replied
    There is an error in the article

    Last sentence on page 3 says:
    "With eight threads (fully utilizing the FX-8150), the improvement was 6.05x over the single-core result while the Opteron 2394 was at 8.02x and the Core i7 990X at 6.11x."

    However, 6.05x is the improvement for 6 threads, the one for all 8 threads was 7.44x

    Leave a comment:


  • highlandsun
    replied
    Originally posted by EyalBD View Post
    Sometimes one can squeeze more performance when running the test with more threads than the actual core of the processor.

    Here is an example:
    http://openbenchmarking.org/result/1...IV-1090TX26430

    X264 performance peaked at 18 threads for my 1090T.

    I think thread count should be a value that can be determined by the actual tests (some other tests have no justification for such manipulation) AND the processor since some processor improve their results when loading more threads and some others don't
    Whenever you see results like that it means there is a problem with the benchmark. Either the code is written poorly, or the threads are blocked by I/O. In the latter case, obviously changing the disk subsystem will change the result, and then you no longer have a meaningful benchmark of CPU performance. In the former case, you have some other block on a shared resource. It might be valid still as a benchmark, if the contended resource is the same on all test platforms.

    Leave a comment:


  • EyalBD
    replied
    Scaling with more threads than actual cores

    Sometimes one can squeeze more performance when running the test with more threads than the actual core of the processor.

    Here is an example:
    http://openbenchmarking.org/result/1...IV-1090TX26430

    X264 performance peaked at 18 threads for my 1090T.

    I think thread count should be a value that can be determined by the actual tests (some other tests have no justification for such manipulation) AND the processor since some processor improve their results when loading more threads and some others don't

    Leave a comment:


  • PsynoKhi0
    replied
    Originally posted by Qaridarium
    people wondering about the bad scaling on 8 threats...

    but remember the bulldozer is a 4 core cpu not a 8core...

    the speed in comparison on an 4 core with 4 threats is really high.

    and no 4core+ CMP isn't a 8core cpu.

    its a 4 core with CMP (like 4core+Hyperthreating on intel side.)
    Oh f*ck me... Not again!?!?

    In any case... Nice article, though the CLOMP results for BD keep bugging me. Can anyone try and explain what's going on and what are the implications?

    Leave a comment:


  • gururise
    replied
    Originally posted by nepwk View Post
    Good article. You did a great job of showing the difference between 8 semi-real cores vs. hyperthreading.
    Agreed. Thank you Michael, this was a very informative article!

    Leave a comment:


  • curaga
    replied
    I see my div suggestion was applied. Nice to have a scrollable specs table.

    Leave a comment:


  • smitty3268
    replied
    Originally posted by rohcQaH View Post
    In any case, a direct comparison of "4 threads across 4 modules" against "4 threads crammed into 2 modules" might be interesting to see how much Bulldozer's modules actually lose over discrete cores by sharing certain parts of the CPU pipeline.
    Yes, that's all i was getting at. Maybe scaling would be worse, or maybe better, but it would have been an interesting test to see exactly what does happen. It might even tell us something about the BD architecture.

    Leave a comment:


  • wizard69
    replied
    This isn't really that bad.

    Some of the initial performance reports where very negative but this looks very good to me for a generation one processor. There is good reason now to save a few bucks going AMD as the performance penalty isn't overwhelming

    It will be especially interesting to see more testing with different configurations of the processors. We simply don't have the experience to imply anything about cache trade offs as the architecture is so new.

    The other thing to realize is that there has likely been little in the way of Bulldozer specific optimizations in these tests. That would be switches for the compilers. Even though some OS optimization has been done that doesnt preclude any other improvements specific to Bulldozer. In the end AMD could be sitting pretty with a hardware revision and some optimized compiler technologies.

    Leave a comment:


  • rohcQaH
    replied
    In any case, a direct comparison of "4 threads across 4 modules" against "4 threads crammed into 2 modules" might be interesting to see how much Bulldozer's modules actually lose over discrete cores by sharing certain parts of the CPU pipeline. Of course this is only meaningful if they run at fixed frequencies, i.e. with turbo core and any dynamic frequency scaling disabled.

    Leave a comment:

Working...
X