NVIDIA GH200 Grace CPU vs. AMD EPYC 9005 Turin CPU Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • pegasus
    Senior Member
    • Aug 2015
    • 318

    #11
    My takeout is that there's still a lot of maturing of software ecosystem left to do ... unfortunately. But yes, very refreshing set of benchmarks.

    Comment

    • nlgranger
      Junior Member
      • Jun 2024
      • 3

      #12
      Just my 2ct on the performance per watts discussion: it is not linear, so it is possible that AMD would fare better if it was capped to match the speed of the ARM.
      Last edited by nlgranger; 10 November 2024, 01:11 PM.

      Comment

      • jruhe
        Junior Member
        • Nov 2024
        • 4

        #13
        I wonder what would happen if you lowered the TDP of AMD CPUs until you reached the geometric mean of the GH200's performance. I suspect that the efficiency of AMD CPUs would also improve significantly on average.

        Comment

        • ikoz
          Junior Member
          • Oct 2024
          • 8

          #14
          GH200 is targeted at HPC workloads with GPU acceleration. That's why it has 1 Grace and coherent HBM memory with the H100. The Grace superchip with 2x Grace (144 cores and double memory bandwidth, at 500W) is the one to compare to the AMD Turin. The equivalent of GH200 from AMD is MI300A (Zen 4). The newer MI325X is GPU-only just like GB200.
          Unfortunately this expensive hardware isn't easily obtainable (and that is what makes them special anyways).

          Comment

          • ikoz
            Junior Member
            • Oct 2024
            • 8

            #15
            Originally posted by coder View Post
            Michael , thanks for the benchmarks!

            BTW, does --enable-multiarch provide functionality similar to -march=native? I'm just curious whether the Grace CPU gets to use its SVE2 capability, which is specific to ARMv9-A.
            NVIDIA recommends GCC 12.3+ (-m cpu=neoverse-v2) and LLVM 16+ (-mc pu=neoverse-v2) for ARMv9 (seehttps://docs.nvidia.com/grace-perf-t...html#compilers).

            Comment

            • dkokron
              Junior Member
              • May 2021
              • 19

              #16
              I wonder what the performance/$ comparison looks like.

              Comment

              • coder
                Senior Member
                • Nov 2014
                • 8829

                #17
                Originally posted by ikoz View Post
                GH200 is targeted at HPC workloads with GPU acceleration. That's why it has 1 Grace and coherent HBM memory with the H100.
                Yeah, Nvidia was clear that their goal was to have CPU nodes distributed throughout the fabric, so there wouldn't be a host memory bottleneck, in multi-GPU systems. Grace was very much made to pair with GPUs, rather than to be used as a standalone server processor. That said, it'd be very interesting to see how well it scales up to 16 CPUs (1152 cores) per system, if you fully populated all of the SXM slots with dual-Grace boards.

                Originally posted by ikoz View Post
                The Grace superchip with 2x Grace (144 cores and double memory bandwidth, at 500W) is the one to compare to the AMD Turin.
                Only if you're going on the basis of power. Although, even then, the 500W budget for 2x Grace includes 1TB of LPDDR5 memory.

                I think the most interesting points of comparison were the 64-core 9575F and the 96-core 9655, both of which are rated at 400W. Interestingly, the 9575F has 5.0/3.3 GHz boost/base clocks. The 9655 has 4.5/2.6 GHz. I'm reading Grace has a base clock speed of 2.8 GHz, although I don't know if that's standard or system-specific. In the article, lscpu lists its boost clock as 3.5 GHz.

                Originally posted by ikoz View Post
                The equivalent of GH200 from AMD is MI300A (Zen 4). The newer MI325X is GPU-only just like GB200.
                Unfortunately this expensive hardware isn't easily obtainable (and that is what makes them special anyways).
                I think it's reasonable to test Grace as a pure CPU. That's one way GPTshop.AI is selling it. I was certainly interested to see how it compares, in that sense.

                BTW, the way to benchmark exotic hardware is to find a cloud instance that's available. If you can use spot pricing, it might be fairly affordable to run an hour's worth of benchmarks. However, I'll bet there's probably zero spot availability for any of the latest products. So, Michael would have to reach out to the manufacturers (or a cloud operator) and see if they'd make special arrangements with him. He's already testing the Grace CPU remotely.
                Last edited by coder; 08 November 2024, 02:38 PM.

                Comment

                • coder
                  Senior Member
                  • Nov 2014
                  • 8829

                  #18
                  Originally posted by dkokron View Post
                  I wonder what the performance/$ comparison looks like.
                  Because it's such a specialized product, I think Grace won't be very competitive on that front.

                  Graviton 4 is much more compelling on perf/$, if you go purely by their billing rates. It uses the same Neoverse V2 cores as Grace, but clocked a bit lower. It uses 96 cores per CPU, with 768-bit DDR5-5600, which works out to 537.6 GB/s. Nvidia claims Grace's LPDDR5X is good for 500 GB/s. So, they probably have about the same bandwidth per core per GHz.

                  Comment

                  • jbhateja
                    Junior Member
                    • May 2022
                    • 1

                    #19
                    While socket level comparison is fair enough as that compares the performance of one off the shelf package (cpu cores, caches, fabric , memory controllers, io controllers, other soc components) against other.

                    Given that two parts have different numbers of cores, Turin supports HT and Neoverse V2 cores are single threaded, thus its not an apple to apple comparison if seen in cloud context where VM instances run over fixed number of vCPUs.

                    ​​​​Most of your benchmarking is inclined towards bare metal instances, it will help the readers if you also measure the performance of selected workloads in cloud context with released parts hoisted by CSPs at nearly ISO configuration in terms of vCPUs and memory.

                    Comment

                    • Michael
                      Phoronix
                      • Jun 2006
                      • 14287

                      #20
                      Originally posted by jbhateja View Post
                      While socket level comparison is fair enough as that compares the performance of one off the shelf package (cpu cores, caches, fabric , memory controllers, io controllers, other soc components) against other.

                      Given that two parts have different numbers of cores, Turin supports HT and Neoverse V2 cores are single threaded, thus its not an apple to apple comparison if seen in cloud context where VM instances run over fixed number of vCPUs.

                      ​​​​Most of your benchmarking is inclined towards bare metal instances, it will help the readers if you also measure the performance of selected workloads in cloud context with released parts hoisted by CSPs at nearly ISO configuration in terms of vCPUs and memory.
                      While nice in theory, in practicality it doesn't really work out in cases like this... GH200 isn't really available in a cloud setting beyond niche providers and Turin is still rolling out to cloud providers.

                      But beyond that most cloud providers offer very limited access to free/gratis instances for benchmarking, especially after launch of any new instance type. And not within budget to do all that extra cloud benchmark runs when not provided by the CSP. When I do run CSP benchmarks at-cost for instances myself, most of the articles don't even make any profit but mostly out of my own technical interest.

                      Edit: and with the CSP comparisons also worth mentioning the lack of CPU power monitoring access.
                      Michael Larabel
                      https://www.michaellarabel.com/

                      Comment

                      Working...
                      X