Announcement

**coder** · 21 February 2024, 04:05 AM

Originally posted by Dawn View Post

My assumption, which is possibly a bad one, would be that the GPU memory space - while coherent - would be owned by the Nvidia driver and reserved for cudaMalloc() calls and similar, as well as whatever on-device housekeeping and program memory are needed. You could pass pointers between the CPU and GPU, but "regular ol' host malloc() allocates in the GPU address range" sounds super weird to me.

Yes, this is exactly what I'm thinking.

**coder** · 21 February 2024, 04:09 AM

Originally posted by SomeoneElse View Post

Why not choosing a more mature platform for competition

I think the logic behind this match up is that the GPTshop.ai machine is sold as a workstation:

More: https://gptshop.ai/pics/pics.html

**DiamondAngle** · 21 February 2024, 08:43 AM

The assertions from the GPTshop.ai website are absurd:

Compared to a 8x Nvidia H100 system, GH200 costs 5x less, consumes 10x less energy and has roughly the same performance.
Compared to a 8x Nvidia A100 system, GH200 costs 3x less, consumes 5x less energy and has a higher performance.
Compared to a 4x AMD Mi300X system, GH200 costs 2x less, consumes 4x less energy and has probably roughly the same performance.

For GH200 Nvidia only claims around 1 petaFLOPS for 16bit datatypes and half that for 18bit (TF32 is 18 bits wide yay NVIDIA marketing) when using accelerated fmma via matrix instructions only (pretty unlikely). For general/vector performance Nvidia claims around 70 terraFLOPS.

A single MI300x gets 1.3 petaFLOPS for 16bit datatypes and half that for 18bit when using accelerated fmma via matrix instructions only. In general/vector cases the mi300x gets 80 terraFLOPS or double that if you can dual issue. Bandwith wise the mi300 also wins by a bit

So the GH200 system is close but a bit slower than one mi300 in all metircs - fine. But on what planet is the system even remotely close to a 4xMI300 system?

Never mind that the gpu on the GH200 is almost the same as a H100, in no way is GH200 anywhere close to a 8xH100 system.

**GPTshop.ai** · 22 February 2024, 07:36 AM

Originally posted by DiamondAngle View Post

The assertions from the GPTshop.ai website are absurd:

For GH200 Nvidia only claims around 1 petaFLOPS for 16bit datatypes and half that for 18bit (TF32 is 18 bits wide yay NVIDIA marketing) when using accelerated fmma via matrix instructions only (pretty unlikely). For general/vector performance Nvidia claims around 70 terraFLOPS.

A single MI300x gets 1.3 petaFLOPS for 16bit datatypes and half that for 18bit when using accelerated fmma via matrix instructions only. In general/vector cases the mi300x gets 80 terraFLOPS or double that if you can dual issue. Bandwith wise the mi300 also wins by a bit

So the GH200 system is close but a bit slower than one mi300 in all metircs - fine. But on what planet is the system even remotely close to a 4xMI300 system?

Never mind that the gpu on the GH200 is almost the same as a H100, in no way is GH200 anywhere close to a 8xH100 system.[/LIST]

Flops is not the whole story. There are bottlenecks like the bandwith, lactency of busses/interconnects. GH200 due to its coherent superchip design has much higher bandwith, lactency and is also more energy efficient. The performance comparisons are because of lack of benchmark data assumtions and very rough estimates based on publically available information and in house benchmarking. We partner with Phoronix to benchmark as much as possible and will hopefully soon have hard data in form of publically available benchmarks to see how the different solutions compare for different workloads. The comparisons are expected to vary greatly for different workloads. If you want to know how your workloads performs on GPTshop.ai GH200 you can apply for a remote bare metal test here: https://gptshop.ai

**DiamondAngle** · 22 February 2024, 06:01 PM

Originally posted by GPTshop.ai View Post

Flops is not the whole story. There are bottlenecks like the bandwith, lactency of busses/interconnects. GH200 due to its coherent superchip design has much higher bandwith, lactency and is also more energy efficient. The performance comparisons are because of lack of benchmark data assumtions and very rough estimates based on publically available information and in house benchmarking. We partner with Phoronix to benchmark as much as possible and will hopefully soon have hard data in form of publically available benchmarks to see how the different solutions compare for different workloads. The comparisons are expected to vary greatly for different workloads. If you want to know how your workloads performs on GPTshop.ai GH200 you can apply for a remote bare metal test here: https://gptshop.ai

Yeah nah. Flops are not the whole story, latency and bandwidth matters but GH inset special there either when compered to the systems you claim it outperforms. The only thing special about gh is the cpu<->gpu interconnect. But for workloads that fit into hbm on gh200 or onto a h100 this matters absolutely not at all. As fetching from cpu ram is still hugely slower for the gpu than fetching from hbm. You can expect gh to be a bit faster than h100 on a zen4 epyc host when the working set is larger than hbm but still way slower than when the workload fits into hbm entirely. The 8xh100 system on the other hand has way more hbm to play around with, which, if your workload can run in a distributed manner (which all workloads you would want to run on such a system are anyhow) can also be accessed at much higher total throughput.
Never mind that on mi300a the cpu<->gpu interconnect is both lower latency and higher bandwidth.

What you are doing is extremely blatant and obvious false advertising. You claim that essentially one gpu with a mediocre (by high end server standards) cpu attached can somehow perform the same as 8x of the very same gpu just because the way its connected to its cpu is higher bandwidth and slightly lower latency. For this to be true you would have to design an absurdly pathological workload.

**GPTshop.ai** · 22 February 2024, 09:47 PM

Originally posted by DiamondAngle View Post

But for workloads that fit into hbm on gh200 or onto a h100 this matters absolutely not at all.

Memory speed is still much faster than the bottle neck connection (CPU-GPU) which is 900 GB/s chip-chip NVlink with GH200 vs 128 GB/s with H100 PCIe and SXM5 version. The difference is a factor of 7. For workloads where data is transfered between CPUand GPU this makes huge difference.

**GPTshop.ai** · 22 February 2024, 09:49 PM

Originally posted by DiamondAngle View Post

Never mind that on mi300a the cpu<->gpu interconnect is both lower latency and higher bandwidth.

Mi300 infinty link is PCIe 5.0 128 Gb/s vs 900 GB/s chip-chip NVlink with GH200. So same story here.

**GPTshop.ai** · 22 February 2024, 09:57 PM

Originally posted by DiamondAngle View Post

What you are doing is extremely blatant and obvious false advertising. You claim that essentially one gpu with a mediocre (by high end server standards) cpu attached can somehow perform the same as 8x of the very same gpu just because the way its connected to its cpu is higher bandwidth and slightly lower latency. For this to be true you would have to design an absurdly pathological workload.

There are real life workloads where on GH200 peforms comparable to systems with 8xH100 while consuming 10x less power, costing 5x less and fitting into a desktop case.

PS: The grace CPU is not at all mediocre. In fact the benchmark here on phoronix show that it can outperform x86 while having a much lower core count and consuming much less power.

**DiamondAngle** · 23 February 2024, 08:08 AM

Originally posted by GPTshop.ai View Post

Mi300 infinty link is PCIe 5.0 128 Gb/s vs 900 GB/s chip-chip NVlink with GH200. So same story here.

This is incorrect 128 Gb/s is the throughput between different mi300a chips, the cpu and gpu on mi300a can both access the shared hbm at full bandwith (ie 5 ish TB/s)

**GPTshop.ai** · 23 February 2024, 02:04 PM

Originally posted by DiamondAngle View Post

This is incorrect 128 Gb/s is the throughput between different mi300a chips, the cpu and gpu on mi300a can both access the shared hbm at full bandwith (ie 5 ish TB/s)

I beg you pardon, but you are such a troll. You claim that my statement is false and then confirm it. Infinity link is 128 GB/s and yes, it limits the throughput between Mi300A chips. That is what I was saying. BTW because the maximum scale-up is 4 you can not have more than 512 GB of memory per node. AND!!!!!!!!!! because scale out is limited to infiniband or ethernet vs 900 GB/s NVlink at least on paper AMD Mi300 seams to be a significantly inferior solution.

Announcement

NVIDIA GH200 72 Core Grace CPU Performance vs. AMD Ryzen Threadripper Workstations

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment