NVIDIA GH200 Grace CPU vs. AMD EPYC 9005 Turin CPU Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • coder
    replied
    Originally posted by Michael View Post
    But beyond that most cloud providers offer very limited access to free/gratis instances for benchmarking, especially after launch of any new instance type. And not within budget to do all that extra cloud benchmark runs when not provided by the CSP.
    Have you ever tried running benchmarks on spot instances? Or do they go offline too quickly to be a practical option?

    Originally posted by Michael View Post
    ​Edit: and with the CSP comparisons also worth mentioning the lack of CPU power monitoring access.
    It's obviously imperfect, but here's where I look to pricing as a proxy. Especially with spot pricing, you can be sure they're not going to price it lower than the marginal costs are. I'll bet most people using spot instances are doing so for compute-heavy work, since that's about all they're good for, so the prices should account for most of the cores running flat-out, most of the time.

    That's said, there's obviously some supply/demand aspect of spot pricing at AWS, since like the R7g.12xlarge you just tested currently has a spot price of $0.4759, whereas an identical instance with half the RAM has a spot price of only $0.3176. However, cutting the RAM in half again saves nothing, with the C7G.12xlarge having a spot price $0.0005 higher.

    BTW, how much consistency in benchmark scores do you see from run-to-run, on these cloud instances?

    Leave a comment:


  • Michael
    replied
    Originally posted by jbhateja View Post
    While socket level comparison is fair enough as that compares the performance of one off the shelf package (cpu cores, caches, fabric , memory controllers, io controllers, other soc components) against other.

    Given that two parts have different numbers of cores, Turin supports HT and Neoverse V2 cores are single threaded, thus its not an apple to apple comparison if seen in cloud context where VM instances run over fixed number of vCPUs.

    ​​​​Most of your benchmarking is inclined towards bare metal instances, it will help the readers if you also measure the performance of selected workloads in cloud context with released parts hoisted by CSPs at nearly ISO configuration in terms of vCPUs and memory.
    While nice in theory, in practicality it doesn't really work out in cases like this... GH200 isn't really available in a cloud setting beyond niche providers and Turin is still rolling out to cloud providers.

    But beyond that most cloud providers offer very limited access to free/gratis instances for benchmarking, especially after launch of any new instance type. And not within budget to do all that extra cloud benchmark runs when not provided by the CSP. When I do run CSP benchmarks at-cost for instances myself, most of the articles don't even make any profit but mostly out of my own technical interest.

    Edit: and with the CSP comparisons also worth mentioning the lack of CPU power monitoring access.

    Leave a comment:


  • jbhateja
    replied
    While socket level comparison is fair enough as that compares the performance of one off the shelf package (cpu cores, caches, fabric , memory controllers, io controllers, other soc components) against other.

    Given that two parts have different numbers of cores, Turin supports HT and Neoverse V2 cores are single threaded, thus its not an apple to apple comparison if seen in cloud context where VM instances run over fixed number of vCPUs.

    ​​​​Most of your benchmarking is inclined towards bare metal instances, it will help the readers if you also measure the performance of selected workloads in cloud context with released parts hoisted by CSPs at nearly ISO configuration in terms of vCPUs and memory.

    Leave a comment:


  • coder
    replied
    Originally posted by dkokron View Post
    I wonder what the performance/$ comparison looks like.
    Because it's such a specialized product, I think Grace won't be very competitive on that front.

    Graviton 4 is much more compelling on perf/$, if you go purely by their billing rates. It uses the same Neoverse V2 cores as Grace, but clocked a bit lower. It uses 96 cores per CPU, with 768-bit DDR5-5600, which works out to 537.6 GB/s. Nvidia claims Grace's LPDDR5X is good for 500 GB/s. So, they probably have about the same bandwidth per core per GHz.

    Leave a comment:


  • coder
    replied
    Originally posted by ikoz View Post
    GH200 is targeted at HPC workloads with GPU acceleration. That's why it has 1 Grace and coherent HBM memory with the H100.
    Yeah, Nvidia was clear that their goal was to have CPU nodes distributed throughout the fabric, so there wouldn't be a host memory bottleneck, in multi-GPU systems. Grace was very much made to pair with GPUs, rather than to be used as a standalone server processor. That said, it'd be very interesting to see how well it scales up to 16 CPUs (1152 cores) per system, if you fully populated all of the SXM slots with dual-Grace boards.

    Originally posted by ikoz View Post
    The Grace superchip with 2x Grace (144 cores and double memory bandwidth, at 500W) is the one to compare to the AMD Turin.
    Only if you're going on the basis of power. Although, even then, the 500W budget for 2x Grace includes 1TB of LPDDR5 memory.

    I think the most interesting points of comparison were the 64-core 9575F and the 96-core 9655, both of which are rated at 400W. Interestingly, the 9575F has 5.0/3.3 GHz boost/base clocks. The 9655 has 4.5/2.6 GHz. I'm reading Grace has a base clock speed of 2.8 GHz, although I don't know if that's standard or system-specific. In the article, lscpu lists its boost clock as 3.5 GHz.

    Originally posted by ikoz View Post
    The equivalent of GH200 from AMD is MI300A (Zen 4). The newer MI325X is GPU-only just like GB200.
    Unfortunately this expensive hardware isn't easily obtainable (and that is what makes them special anyways).
    I think it's reasonable to test Grace as a pure CPU. That's one way GPTshop.AI is selling it. I was certainly interested to see how it compares, in that sense.

    BTW, the way to benchmark exotic hardware is to find a cloud instance that's available. If you can use spot pricing, it might be fairly affordable to run an hour's worth of benchmarks. However, I'll bet there's probably zero spot availability for any of the latest products. So, Michael would have to reach out to the manufacturers (or a cloud operator) and see if they'd make special arrangements with him. He's already testing the Grace CPU remotely.
    Last edited by coder; 08 November 2024, 02:38 PM.

    Leave a comment:


  • dkokron
    replied
    I wonder what the performance/$ comparison looks like.

    Leave a comment:


  • ikoz
    replied
    Originally posted by coder View Post
    Michael , thanks for the benchmarks!

    BTW, does --enable-multiarch provide functionality similar to -march=native? I'm just curious whether the Grace CPU gets to use its SVE2 capability, which is specific to ARMv9-A.
    NVIDIA recommends GCC 12.3+ (-m cpu=neoverse-v2) and LLVM 16+ (-mc pu=neoverse-v2) for ARMv9 (seehttps://docs.nvidia.com/grace-perf-t...html#compilers).

    Leave a comment:


  • ikoz
    replied
    GH200 is targeted at HPC workloads with GPU acceleration. That's why it has 1 Grace and coherent HBM memory with the H100. The Grace superchip with 2x Grace (144 cores and double memory bandwidth, at 500W) is the one to compare to the AMD Turin. The equivalent of GH200 from AMD is MI300A (Zen 4). The newer MI325X is GPU-only just like GB200.
    Unfortunately this expensive hardware isn't easily obtainable (and that is what makes them special anyways).

    Leave a comment:


  • jruhe
    replied
    I wonder what would happen if you lowered the TDP of AMD CPUs until you reached the geometric mean of the GH200's performance. I suspect that the efficiency of AMD CPUs would also improve significantly on average.

    Leave a comment:


  • nlgranger
    replied
    Just my 2ct on the performance per watts discussion: it is not linear, so it is possible that AMD would fare better if it was capped to match the speed of the ARM.
    Last edited by nlgranger; 10 November 2024, 01:11 PM.

    Leave a comment:

Working...
X