Announcement

Collapse
No announcement yet.

NVIDIA GH200 CPU Performance Benchmarks Against EPYC Zen 4 & Xeon Emerald Rapids

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by varikonniemi View Post
    Interesting it can both lead and trail the benchmarks. This would lead me to think it is extremely unbalanced and tries to roll on some magic tricks they decided to employ to show certain benchmarks can be dominated.
    It makes me think that it's specialized hardware that's not geared towards general purpose use. Unbalanced in terms of bottlenecking is bad but unbalanced in terms of specialization doesn't mean it's bad. For example...

    If you compare a Ferrari to a decent pickup truck the Ferrari would likely win in track tests while the pickup truck will do better in torque and off-road tests.

    Like I always say, lets wait until we have the hardware ourselves before we jump to conclusions. Still I don't think the "magic tricks" conspiracy is applicable here.

    PS: I only noticed stormcrow 's post after typing this. I agree 100% with this comment.

    Comment


    • #22
      Michael - maybe my memory is wrong, but if i'm not mistaken, that gcc compiler options seems subobtimal for some of the CPUs?

      Originally posted by https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
      [..]

      -march=cpu-type
      Generate instructions for the machine type cpu-type. In contrast to -mtune=cpu-type, which merely tunes the generated code for the specified cpu-type, -march=cpu-type allows GCC to generate code that may not run at all on processors other than the one indicated. Specifying -march=cpu-type implies -mtune=cpu-type, except where noted otherwise.
      [..]

      -mtune=cpu-type
      Tune to cpu-type everything applicable about the generated code, except for the ABI and the set of available instructions. While picking a specific cpu-type schedules things appropriately for that particular chip, the compiler does not generate any code that cannot run on the default machine type unless you use a -march=cpu-type option. For example, if GCC is configured for i686-pc-linux-gnu then -mtune=pentium4 generates code that is tuned for Pentium 4 but still runs on i686 machines.
      [..]​​
      Based on https://www.phoronix.com/benchmark/r...ks/result.svgz I can't see any compiler optimization for AMD and Intel and the target is just the basic x86_64-linux-gnu - so I suspect, that it can even run on an old AMD Opteron CPU (so all the new SIMDs are not used if the benchmark itself has some CPU detection / optimization included).

      The GPTshop.ai GH200 and Ampere Altra Max M128-30 are using the aarch64-linux-gnu target ...

      So shouldn't the benchs be compiled with the best/nearest "-march" (without tune) of the specific CPU architecture?

      Comment


      • #23
        Originally posted by SomeoneElse View Post
        Michael - maybe my memory is wrong, but if i'm not mistaken, that gcc compiler options seems subobtimal for some of the CPUs?



        Based on https://www.phoronix.com/benchmark/r...ks/result.svgz I can't see any compiler optimization for AMD and Intel and the target is just the basic x86_64-linux-gnu - so I suspect, that it can even run on an old AMD Opteron CPU (so all the new SIMDs are not used if the benchmark itself has some CPU detection / optimization included).

        The GPTshop.ai GH200 and Ampere Altra Max M128-30 are using the aarch64-linux-gnu target ...

        So shouldn't the benchs be compiled with the best/nearest "-march" (without tune) of the specific CPU architecture?
        That table is just of the distro-supplied (Ubuntu) compiler build details. Not per-test compiler flags.
        Michael Larabel
        https://www.michaellarabel.com/

        Comment


        • #24
          Originally posted by Michael View Post

          That table is just of the distro-supplied (Ubuntu) compiler build details. Not per-test compiler flags.
          Thx for clarifying!

          Comment


          • #25
            Originally posted by drakonas777 View Post
            Some people do CPU-based en/decoding and rendering. TR is great for that. But generally I would agree with you, GPUs and fixed function offer far better value in the majority of massively parallel workloads. That's why Intel "many E core" Desktop strategy has no crucial practical meaning for the most part and that's why your proposition you made in the other thread that desktop CPUs should have hundreds of weak cores is an absolute dogshit, contradicting you own post here.
            Encoding on the CPU has some significant limitations, primarily related to scaling.

            While x264 is claimed to scale to 128 threads, it is well known that encoding quality starts to drastically drop off as the number of threads increases.

            With regards to video encoding, threading can be accomplished a number of different ways:

            1) Sliced based threading, where every frame is "sliced" up into different sections and each section is encoded on a different thread. This method is somewhat useful for large resolution video with static scenes, but since each thread can't reference the sata in the other threads, the practical limit is 4 slices per frame and even that results in noticeable quality degradation.

            2) You can employ frame based threading, where every frame is encoded on a separate thread and each frame reference no other frame. For instance, if you were doing I frame only encoding, this would be a productive approach.

            3) GOP based threading, where a group of pictures is encoded on one thread, another is encoded on another thread and so on. This approach is great from closed GOP encoding, but can't be used with Open GOP and closed GOP is generally considered lower quality than open GOP.

            Regardless of which approach you use, you still have the issue that parts of the video encoding pipeline are inherently single threaded and there is nothing you can do about it, for instance entropy encoding, such as CALVC and CABAC.

            What you can do with a system that has a large number of cores, such as a TR, is to start a number of encodes at the same time in order to saturate all the cores.

            But as i have mentioned in other threads, you start getting into I/O limitations, unless the system has multiple disks and you read/write from different disks for each encode.

            If you reread my post regarding the many E-cores statement, you will note that I said it was my feeling that such a processor would result in a smoother, i.e. more responsive experience for end users.

            Much like using a low latency kernel does not really help throughput it helps responsiveness, a CPU with many smaller, lower clocked cores, should in theory offer a better experience when you have hundreds of browser tabs open and you are copying hundreds of files from one encrypted folder to another.

            As for this:

            Objectively not true. Such a big exaggeration is a sign that your feelings were hurt by the fact Intel's HEDT/WS segment up until recently was a complete shit show, so you have to come up with some nonsense which attacks AMD HEDT.
            I would say you have it backwards, Intel has had a very credible HEDT/WS presence for decades, it is AMD that recently started offering a credible product, at least in some people's eyes.

            Here's the thing, I really don't care because i don't own stock in either company and I think both their product lines are going to be in serious danger when NVIDIA releases their desktop ARM based CPU.

            I saw an article that AMD is also planning on releasing an ARM based desktop CPU and who knows what Intel decides to do.

            It could be that in 5 years, ARM has become the favored desktop CPU and everyone that spent big bucks on TR based systems are kicking themselves.

            Comment


            • #26
              Originally posted by GPTshop.ai View Post
              Comparison of GH200 to alternative systems with the same amount of memory:
              • Compared to 4x AMD Mi300X, GH200 costs 4x less, consumes 4x less energy and is not far off in terms of performance.
              • Compared to 5x AMD Mi300A, GH200 costs 3x less, consumes 3x less energy and has at least the same performance.
              • Compared to 8x RadeonPRO w7900 which has significantly less memory (only 384GB), GH200 costs the same, consumes 3x less energy and has a higher performance.
              This is just amazing, and I love the glass cases.

              If i had the money I would buy one, or maybe two, just for bragging rights.

              Comment


              • #27
                Originally posted by pWe00Iri3e7Z9lHOX2Qx View Post
                People aren't really cobbling together Threadripper and Instinct systems on Amazon though. GH200 is tailor made for the LLM craze. And it isn't just raw CPU or GPU performance. You can get AWS instances of these with 4.5TB of HBM3e. The CPU to GPU interconnect is 900GB/s. They are stupid fast at what they are made to do.
                is is true that AWS does not cobble threadripper with instinct but of course they cobble Epyc​ with instinct.
                "AWS instances of these with 4.5TB of HBM3e. The CPU to GPU interconnect is 900GB/s"
                yes sounds nice but of course AMD has interconnect of instinct and epyc to.

                i did only talk about threadripper instead of Epyc because
                sophisticles calls them a scam.

                but i am pretty sure if sophisticles does his bogus benchmarks of a intel 14900 cpu agaist a GH200 its pretty sure intel wins...
                Phantom circuit Sequence Reducer Dyslexia

                Comment


                • #28
                  Originally posted by GPTshop.ai View Post
                  There are some system power metrics in the additional supplemented results: https://openbenchmarking.org/result/...VIDIAGH254&sor

                  As the builder of the system, I can also confirm personally that CPU under heavy load alone does not draw more than 400 Watts max. Total system TDP CPU+GPU+memory is around 1000W.
                  Thank you for facilitating this test! It's super interesting to me!

                  ...not that I can afford anything like this, but I've been following the evolution of ARM server performance and it's so much better to benchmark Neoverse V2 cores on bare hardware than in Amazon's cloud.

                  Comment


                  • #29
                    Originally posted by sophisticles View Post
                    Allow me to clear up the mystery for you.
                    Threadrippers are a scam because of their positioning in the market.
                    They are great if you want to build a system to run a large number of virtual machines.
                    They are great if you want to run a game or web server.
                    They are not great if you want to build system for encoding video, or editing video and/or audio because they are extremely expensive and there are more efficient and cost effective solutions for those tasks.

                    why do you want to encode videos on a threadripper if FPGA and ASIC solutions are multible times faster and more power efficient than a CPU?
                    if you put in a AMD PRO w7900 into such a system it can encode multible streams of video in its ASIC video core at the same time in AV1 and also H265.

                    "cost effective solutions"

                    in case of AMD you care about cost effective solutions but as soon as Nvidia GH200 pop up you don't care about money anymore it only costs you an arm and a leg 47500€ who cares about money anyway ?

                    Originally posted by sophisticles View Post
                    This system is gold for certain tasks and is faster than a Threadripper for those tasks,
                    Nvidia GH200 is not faster on those tasks if you put in some AMD instincts. like MI100/MI200/MI300

                    Originally posted by sophisticles View Post
                    This system is a highly specialized machine, kind of like how a diesel pickup truck(Bullshit-talk about Diesel)
                    this system is not more higly specialised than a threadripper with MI100/MI200/MI300 instinct cards.

                    Originally posted by sophisticles View Post
                    As for CUDA, it is unbeatable for certain tasks, which is why every major university offers CUDA classes, either as mandatory or electives for their computer students:


                    its not a secret that these universities are a conspiracy against your and everyone elses best interest.
                    and you can only fix this problem by plain and simple never go to a university because as you say its mandatory.

                    Phantom circuit Sequence Reducer Dyslexia

                    Comment


                    • #30
                      Originally posted by mrg666 View Post
                      I would still go with EPYC Bergamo as it seems to be the best option for both price and performance.
                      Grace isn't really there to do the heavy lifting. It's mainly a support chip for Hopper. It supplies 480 GB of memory, to supplement the 96 GB that's directly attached to the H100.

                      More importantly, Grace is designed to be used on SXM boards that can scale up to much larger cache coherent configurations, thanks to NVLink. I'm not sure, but I think you could fit 16 in a single box, which Nvidia then enables you to link together, into a cache coherent cluster. So, they scale way better than anything x86.

                      Without testing Grace at scale, you're only seeing one aspect of what it can do.

                      Comment

                      Working...
                      X