Announcement

Collapse
No announcement yet.

NVIDIA GH200 CPU Performance Benchmarks Against EPYC Zen 4 & Xeon Emerald Rapids

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    Originally posted by coder View Post
    GPTshop.ai , can you please comment on the storage, which is listed as "960GB SAMSUNG MZ1L2960HCJR-00A07 + 1920GB SAMSUNG MZTL21T9" ?

    A web search quickly indicates the first as a Samsung PM9A3, which is an enterprise-grade PCIe 4.0 NVMe U.2 drive.

    I was not successful in finding details on the latter. The closest I came was the MZ1L21T9, which is a M.2 version of the PM9A3. Is that right? If so, the M.2 version is 2-sided and I hope cooled adequately, with an appropriate heatsink.

    I have one of the PM9A3 M.2 drives - they're good but already a couple generations old. As they're 22110 (22x110 mm) it was extremely difficult to find an appropriate heatsink that provided adequate cooling for the entirety of both sides. I ended up sourcing one directly from AliExpress, called "JEYI 22110 SSD Heatsink". It's not great, but with a generous amount of high-quality thermal compound, including where its top & bottom halves meet, it's adequate for my light duty usage.

    The reason I ask is that I wonder if thermal-throttling by a sub-optimally cooled M.2 drive could hurt performance on any of the benchmarks involving storage, such as compilation and database tests.
    I sourced the parts from a Quanta QCT S74G-2U which came already equipped with one (of two) M.2 and one (of 4) E1.S SSD. The 1TB Samsung M.2 is connected directly to Mainboard via a riser while the E1.S is connected via a Broadcom controller. Honestly, I do not even know if the M.2 has a heasink, I assumed that what QCT selected should be fine, so I did not pay attention to that by now. But you are right, considering the fact that it is capable to take PCIE gen5 SSDs cooling might be a potential issue for high performance SSDs. The system is currently running benchmarks so I cannot look it up right now. But I will back to you with more details asap.

    Comment


    • #42
      Originally posted by coder View Post
      Michael, after the Helsing test, you remark:



      The HBM is attached to the Hopper GPU, not the Grace CPU. It's therefore not going to be used by these tests. Grace is just using it's directly-connected LPDDR5.
      The superchip is memory coherent, so both CPU and GPU can use all 576GB. See Nvdia Whitepaper this for more details: https://resources.nvidia.com/en-us-g...a-grace-hopper

      Comment


      • #43
        Originally posted by GPTshop.ai View Post
        The superchip is memory coherent, so both CPU and GPU can use all 576GB. See Nvdia Whitepaper this for more details: https://resources.nvidia.com/en-us-g...a-grace-hopper
        Merely being cache coherent doesn't mean the OS will include HBM in the main memory pool used by Grace, for general-purpose CPU workloads.

        I wonder if /proc/meminfo would provide good confirmation of this (i.e. if the HBM is excluded from that output, can we conclude the OS isn't using it?). Is anyone knowledgeable enough to comment on this? I'm not a kernel developer, so my knowledge of that side of things is quite limited.

        That said, I'd be astonished if the CPU was grabbing HBM for its own use. That pool should be set aside for use by CUDA.

        Comment


        • #44
          Originally posted by coder View Post
          Merely being cache coherent doesn't mean the OS will include HBM in the main memory pool used by Grace, for general-purpose CPU workloads.

          I wonder if /proc/meminfo would provide good confirmation of this (i.e. if the HBM is excluded from that output, can we conclude the OS isn't using it?). Is anyone knowledgeable enough to comment on this? I'm not a kernel developer, so my knowledge of that side of things is quite limited.

          That said, I'd be astonished if the CPU was grabbing HBM for its own use. That pool should be set aside for use by CUDA.
          I quote from the whitepaper "In NVIDIA Grace Hopper Superchip-based systems, Address Translation Service (ATS) enables the CPU and GPU to share a single per-process page table, enabling all CPU and GPU threads to access all system-allocated memory (Figure 8), which can reside on physical CPU or GPU memory.​" Fior more details read the whitepaper its very interesting.

          Comment


          • #45
            Originally posted by GPTshop.ai View Post
            Not knowing what to expect the thing that surprised me the most is that the Grace 72 core is almost twice as fast compared to Ampere Altra max 128 core. Even though they have the same TDP and the Ampere has almost twice as many cores. I did not expect that. Ampere should be worried.
            thats really nice. its also nice because the inflation of more cores only increase the complexity of the software.

            if less cores are faster then the complexity of the software goes down as well.
            Phantom circuit Sequence Reducer Dyslexia

            Comment


            • #46
              Originally posted by qarium View Post

              thats really nice. its also nice because the inflation of more cores only increase the complexity of the software.

              if less cores are faster then the complexity of the software goes down as well.
              So the person that cheer-leads for higher core count Threadrippers now wants to use fewer cores?

              Comment


              • #47
                Originally posted by qarium View Post
                ​why do you want to encode videos on a threadripper if FPGA and ASIC solutions are multible times faster and more power efficient than a CPU?
                if you put in a AMD PRO w7900 into such a system it can encode multible streams of video in its ASIC video core at the same time in AV1 and also H265.
                I don't, in fact on numerous occasions I have said that people should not be doing CPU based encoding.

                Also hate to break this to you but AMD constantly has atrocious quality when it comes to hardware video encoding.

                NVIDIA has the highest quality consumer grade hardware based AV1 encoding, Intel and NVIDIA trade punches when it comes to consumer grade hardware H265 encoding and AMD brings up the rear.

                Now Xilinx does have excellent hardware encoders, so for commercial use i would pick up one of their cards.

                Originally posted by qarium View Post
                Nvidia GH200 is not faster on those tasks if you put in some AMD instincts. like MI100/MI200/MI300

                this system is not more higly specialised than a threadripper with MI100/MI200/MI300 instinct cards.
                So you either ignore or don't believe the claims that this system is as fast as 8 Instinct cards, which is it?

                Comment


                • #48
                  Originally posted by GPTshop.ai View Post
                  I quote from the whitepaper "In NVIDIA Grace Hopper Superchip-based systems, Address Translation Service (ATS) enables the CPU and GPU to share a single per-process page table, enabling all CPU and GPU threads to access all system-allocated memory (Figure 8), which can reside on physical CPU or GPU memory.​" Fior more details read the whitepaper its very interesting.
                  I get all that. And, for hybrid workloads that span the CPU and GPU, that stuff is great. The key question is whether the HBM address space shows up in the normal heap for CPU jobs, and I really doubt it does.

                  I'll bet there are some Nvidia utilities which show NVLink utilization. That would confirm whether or not the CPU is using HBM on the GPU's package, since they're connected via multiple NVLinks. If it's not being utilized, then a CPU-only job should show all of those links at about 0% utilization.

                  Anyway, changing topics a little bit, can you tell us what sort of clock speeds it's running at? The cpuinfo screenshot Michael posted says 3.411 GHz is the maximum speed. Is that true and what clockspeed to all-core workloads usually sustain?
                  Last edited by coder; 10 February 2024, 12:27 AM.

                  Comment


                  • #49
                    Originally posted by sophisticles View Post
                    AMD constantly has atrocious quality when it comes to hardware video encoding.
                    Not atrocious.

                    As of about 1 year ago, their h.264 quality indeed had the biggest gap:



                    However, their h.265 is much better:



                    And so is their AV1:



                    The article also states:

                    "AMD has informed us that it's working with ffmpeg to get some quality improvements into the code, and we'll have to see how that goes. We don't know if it will improve quality a lot, bringing AMD up to par with Nvidia, or if it will only be one or two points. Still, every little bit of improvement is good."


                    It would be good to know how that went, if it indeed ever happened.

                    Finally...
                    Originally posted by sophisticles View Post
                    ​on numerous occasions I have said that people should not be doing CPU based encoding.
                    That article's author claims otherwise:

                    "if you want best quality you'd generally need to opt for CPU-based encoding with a high CRF (Constant Rate Factor) of 17 or 18,"


                    Not only that, but it turns out the CPU encoding on a Raptor Lake i9 is nearly as fast as hardware encoding on an old GTX 1080 Ti!




                    Of course, I'm sure you're an expert on video content production, as with so many other things.

                    Source:

                    Comment


                    • #50
                      Originally posted by coder View Post
                      Anyway, changing topics a little bit, can you tell us what sort of clock speeds it's running at? The cpuinfo screenshot Michael posted says 3.411 GHz is the maximum speed. Is that true and what clockspeed to all-core workloads usually sustain?
                      There is some data, please see here: https://openbenchmarking.org/result/...VIDIAGH254&sor

                      Comment

                      Working...
                      X