Announcement

Collapse
No announcement yet.

CPUs From 2004 Against AMD's New 64-Core Threadripper 3990X + Tests Against FX-9590

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • CochainComplex
    replied
    Originally posted by Zan Lynx View Post

    Here's a fun CPU, a Power9 at 3.8 GHz. The system is a Raptor Talos II.

    32-bit int:
    Code:
    $ perf stat ./silly-prime
    664580
    
    Performance counter stats for './silly-prime':
    
    2,656.22 msec task-clock:u # 1.000 CPUs utilized
    0 context-switches:u # 0.000 K/sec
    0 cpu-migrations:u # 0.000 K/sec
    39 page-faults:u # 0.015 K/sec
    9,969,909,880 cycles:u # 3.753 GHz (33.48%)
    23,459,002 stalled-cycles-frontend:u # 0.24% frontend cycles idle (50.31%)
    3,127,568,964 stalled-cycles-backend:u # 31.37% backend cycles idle (16.57%)
    15,721,500,489 instructions:u # 1.58 insn per cycle
    # 0.20 stalled cycles per insn (33.13%)
    3,515,175,504 branches:u # 1323.373 M/sec (49.69%)
    5,110,866 branch-misses:u # 0.15% of all branches (16.56%)
    
    2.656827962 seconds time elapsed
    
    2.646893000 seconds user
    0.002416000 seconds sys
    I was going to post its 64-bit result, but it is essentially identical so no point.
    I wish I would have such a machine ...

    Leave a comment:


  • Zan Lynx
    replied
    Originally posted by CochainComplex View Post

    Code:
    real 0m10,153s
    Code:
    $ perf stat ./prime_gcc_int64
    664580
    
    Performance counter stats for './prime_gcc_int64':
    
    10.184,94 msec task-clock:u # 1,000 CPUs utilized
    0 context-switches:u # 0,000 K/sec
    0 cpu-migrations:u # 0,000 K/sec
    54 page-faults:u # 0,005 K/sec
    47.981.199.959 cycles:u # 4,711 GHz
    17.446.323.647 instructions:u # 0,36 insn per cycle
    3.508.147.325 branches:u # 344,445 M/sec
    4.406.190 branch-misses:u # 0,13% of all branches
    
    10,185381274 seconds time elapsed
    
    10,175968000 seconds user
    0,000999000 seconds sys
    Xeon E-2286M
    Here's a fun CPU, a Power9 at 3.8 GHz. The system is a Raptor Talos II.

    32-bit int:
    Code:
    $ perf stat ./silly-prime
    664580
    
     Performance counter stats for './silly-prime':
    
              2,656.22 msec task-clock:u              #    1.000 CPUs utilized          
                     0      context-switches:u        #    0.000 K/sec                  
                     0      cpu-migrations:u          #    0.000 K/sec                  
                    39      page-faults:u             #    0.015 K/sec                  
         9,969,909,880      cycles:u                  #    3.753 GHz                      (33.48%)
            23,459,002      stalled-cycles-frontend:u #    0.24% frontend cycles idle     (50.31%)
         3,127,568,964      stalled-cycles-backend:u  #   31.37% backend cycles idle      (16.57%)
        15,721,500,489      instructions:u            #    1.58  insn per cycle         
                                                      #    0.20  stalled cycles per insn  (33.13%)
         3,515,175,504      branches:u                # 1323.373 M/sec                    (49.69%)
             5,110,866      branch-misses:u           #    0.15% of all branches          (16.56%)
    
           2.656827962 seconds time elapsed
    
           2.646893000 seconds user
           0.002416000 seconds sys
    I was going to post its 64-bit result, but it is essentially identical so no point.

    Leave a comment:


  • CochainComplex
    replied
    Originally posted by Raka555 View Post

    I did try with clang as well, but the code gcc produced was faster or the same.

    This was originally not a "benchmark". I actually needed to produce prime numbers, so it stems from a real world application.
    (And I know this is not the optimized version)
    I also played with march=native in both examples 32bit/64bit ..but no significant change above error margin

    Leave a comment:


  • smitty3268
    replied
    A micro-benchmark that tests integer division simply isn't that useful. No serious number crunching app relies on that, precisely because integer division is known to be a slow operation across all different kinds of cpu architectures. The cpu designers just have to create something that's good enough that people won't notice it in the general application case, and it's largely not optimized heavily by anyone.

    Something like SpecInt is still flawed and synthetic, but a much more useful and realistic comparison than a simple micro-benchmark will be. As you can see from the link below, AMD's 3700x is roughly competitive with the 9900k in single-thread performance, at least with the newer tests. Some of the older ones tend to still favor Intel. The 1st gen Ryzen is way behind.

    https://www.anandtech.com/show/14605...sing-the-bar/6
    Last edited by smitty3268; 02-11-2020, 04:44 AM.

    Leave a comment:


  • Raka555
    replied
    Originally posted by lilunxm12 View Post


    You just shouldn't do that on a 32 bit system, as single registers can't hold the operand
    I hear you, but I find it a bit hard to believe that they ran ARM processors for decades without being able to get the most out of 32bit integer math.
    I would rather put my money on "modern compilers" not generating optimal code for 32bit ARM.

    Leave a comment:


  • Raka555
    replied
    Originally posted by klokik View Post

    So either clang generates significantly more optimal code compared to gcc, or time switch to arm64 But I'd rather say this benchmark is not that representative of real world cpu performance.
    I did try with clang as well, but the code gcc produced was faster or the same.

    This was originally not a "benchmark". I actually needed to produce prime numbers, so it stems from a real world application.
    (And I know this is not the optimized version)
    Last edited by Raka555; 02-11-2020, 01:38 AM.

    Leave a comment:


  • CochainComplex
    replied
    there is a market for everything ... eg shi*** rgb lightning. So I think Threadripper is more reasonable.

    Leave a comment:


  • JoelH
    replied
    Originally posted by TemplarGR View Post
    Bulldozer was a great architecture and was a step towards Fusion. AMD's grand plan was to eliminate FPU and SIMD from the cpu cores completely, eventually, and move those calculations on the iGPU. This makes a metric ton of sense, since cpu cores only rarely calculate floating point math. And those calculations are better suited for gpgpu, which is only hindered these days by pcie latency. AMD Fusion was the best idea for cpus in 2 decades. But AMD didn't have the software and marketing grunt to push for such change, and Intel realising they would lose if AMD went that road, doubled up on AVX and their floating point calculations, especially per thread.
    Bulldozer was not a good architecture. Bulldozer was an architecture designed around some very specific bets. AMD bet that it could slim down the core and gain size advantages, even though it was competing at n-1 versus Intel. It bet that it could offset the performance impacts of that slimming by scaling clock, and it bet that it could keep the IPC impact of certain Bulldozer design decisions to a minimum.

    The first bet worked: Bulldozer cores were smaller than K10 cores had been. The second bet succeeded marginally with Piledriver -- AMD was definitely able to get higher clocks out of the uarch compared to K10. But the third bet -- the one about IPC? That failed. Instead of being confined to a small number of scenarios, the "corner cases" of Bulldozer dominated its performance. L1 cache contention between the two CPU cores caused lower than expected CPU scaling in multi-threaded workloads, which exacerbated the problem. Kaveri and Carrizo would later improve this penalty by increasing the size of the L1 caches and making other improvements to the chip, but those changes came much later. The original BD design was downright bad, which is why the CPU was delayed from launch in 2011. AMD had good yield on Bulldozer (a CPU they knew wouldn't be competitive) and poor yield on Llano, the APU it could have actually sold. AMD delayed Bulldozer and forced GF to eat a mountain of losses by only paying them for good Llano die. GF paid them back the following year by forcing AMD to give up its interest in the foundry in exchange for being able to manufacture parts at TSMC.

    The idea you have -- namely that you can just replace your FPU with a GPU -- never would have worked. For starters, there's non-zero latency caused by spinning a workload off to the GPU. That workload has to be setup and initialized by the CPU, and GPUs are high latency devices. Yes, AMD made some noise about this idea at the very beginning of the Fusion PR rollout, but there's a reason they never pursued it. GPU acceleration is about using the GPU for workloads and in areas where it makes sense to do so, not in areas where it does not. In many cases, the latency hit for setting a problem up on the GPU is larger than the increased performance the video card can bring to bear on the problem.

    Originally posted by TemplarGR View Post
    These days on 7nm, cpu cores even with all those SIMD parts, are TINY. It would have made a lot more sense to have even tinier cpu cores by removing the floating point units (which cost a LOT of silicon), adding tons of cache, and a beefy igpu, and move those calculations there. It would have been far better performant. It would allow the cpu cores to stop bothering with things they are not at their best, and leave the igpu do what it is best suited for... But this failed to evolve because idiots thought Bulldozer was a failure just because video games relied still on single and dual cores and as we all know, gaming is the most important thing in computing.... Even today intel sells a ton of cpus because it has slightly better per core performance and this matters to gaming. People are cretins. Now all AMD is doing is copying Intel's design but selling it at a far lower profit margin.... Yay.
    Wouldn't have worked. Again, a GPU is not a low-latency device. It isn't (and cannot) be as tightly coupled as the FPU, because the GPU *does* far more than the FPU and cannot be integrated into the CPU in the same fashion. It has its own cache, its own internal buses, its own data paths. GPUs don't implement x87 or SIMD instruction sets, so you're basically arguing that AMD should have either 1). Found a way to build a translation layer between GPU and FPU / SIMD or 2). Invented a new standard. The first wouldn't be performant -- we're now emulating a different instruction set on a much slower card. The second is unrealistic. Fusion didn't get adopted as it was.

    I was at AMD's 2013 HSA event, where Sun executives pledged that Java 8 would be fully HSA-aware and capable. It was not. Bulldozer was a bad architecture. According to Jim Keller (who told me this in-person), when he was hired, AMD had to make a choice between fixing BD or building Zen. The effort to do one was judged to be as difficult as the other. He decided to go for Zen, and the team backed him. Bulldozer didn't get put down because it sucked in gaming. Bulldozer got put down because it was a poor design, period, and AMD felt it wasn't worth the effort of fixing. The decision to build Zen was made about a month after Keller was hired, and they never looked back. Nor should they have. Best decision AMD ever made.

    Leave a comment:


  • klokik
    replied
    Originally posted by Raka555 View Post

    If you mean what I get when timing it:

    r7-3700x:
    real 0m8.138s

    i7-3770:
    real 0m4.083s

    i7-4600u:
    real 0m4.883s
    I know that's not apples to apples comparison but just out of curiosity compiled and tested that on a phone:

    Snapdragon 820:
    $clang --version
    clang version 8.0.1 (tags/RELEASE_801/final)
    Target: aarch64-unknown-linux-android
    $ clang prime.c -o prime -O3
    $ time ./prime
    664580

    real 0m2.402s
    user 0m2.380s
    sys 0m0.000s


    Update:

    Int64 version
    $ clang prime64.c -O3 -o prime64
    $ time ./prime64
    664580

    real 0m2.489s
    user 0m2.470s
    sys 0m0.000s

    So either clang generates significantly more optimal code compared to gcc, or time switch to arm64 But I'd rather say this benchmark is not that representative of real world cpu performance.
    Last edited by klokik; 02-10-2020, 06:33 PM.

    Leave a comment:


  • boitano
    replied
    Originally posted by torsionbar28 View Post
    $10,000 Xeon Platinum 8280
    $10000 for 28 cores VS. $4000 for 64 cores. Brutal.

    Originally posted by torsionbar28 View Post
    threading has always been a chicken or egg scenario.
    One totally non professional and mundane task I would like to see these many cores CPUs used for in the future is AI simulation in videogames, where there are hundreds of NPCs on the scene, each one simulated with one dedicated thread. NPCs are usually so stupid their stupidity ruins all the immersion.

    Leave a comment:

Working...
X