Originally posted by Dawn
View Post
Announcement
Collapse
No announcement yet.
NVIDIA GH200 72 Core Grace CPU Performance vs. AMD Ryzen Threadripper Workstations
Collapse
X
-
The assertions from the GPTshop.ai website are absurd:
- Compared to a 8x Nvidia H100 system, GH200 costs 5x less, consumes 10x less energy and has roughly the same performance.
- Compared to a 8x Nvidia A100 system, GH200 costs 3x less, consumes 5x less energy and has a higher performance.
- Compared to a 4x AMD Mi300X system, GH200 costs 2x less, consumes 4x less energy and has probably roughly the same performance.
- For GH200 Nvidia only claims around 1 petaFLOPS for 16bit datatypes and half that for 18bit (TF32 is 18 bits wide yay NVIDIA marketing) when using accelerated fmma via matrix instructions only (pretty unlikely). For general/vector performance Nvidia claims around 70 terraFLOPS.
A single MI300x gets 1.3 petaFLOPS for 16bit datatypes and half that for 18bit when using accelerated fmma via matrix instructions only. In general/vector cases the mi300x gets 80 terraFLOPS or double that if you can dual issue. Bandwith wise the mi300 also wins by a bit
So the GH200 system is close but a bit slower than one mi300 in all metircs - fine. But on what planet is the system even remotely close to a 4xMI300 system?
Never mind that the gpu on the GH200 is almost the same as a H100, in no way is GH200 anywhere close to a 8xH100 system.
Comment
-
Originally posted by DiamondAngle View PostThe assertions from the GPTshop.ai website are absurd:
For GH200 Nvidia only claims around 1 petaFLOPS for 16bit datatypes and half that for 18bit (TF32 is 18 bits wide yay NVIDIA marketing) when using accelerated fmma via matrix instructions only (pretty unlikely). For general/vector performance Nvidia claims around 70 terraFLOPS.
A single MI300x gets 1.3 petaFLOPS for 16bit datatypes and half that for 18bit when using accelerated fmma via matrix instructions only. In general/vector cases the mi300x gets 80 terraFLOPS or double that if you can dual issue. Bandwith wise the mi300 also wins by a bit
So the GH200 system is close but a bit slower than one mi300 in all metircs - fine. But on what planet is the system even remotely close to a 4xMI300 system?
Never mind that the gpu on the GH200 is almost the same as a H100, in no way is GH200 anywhere close to a 8xH100 system.[/LIST]Last edited by GPTshop.ai; 22 February 2024, 07:52 AM.
- Likes 1
Comment
-
Originally posted by GPTshop.ai View Post
Flops is not the whole story. There are bottlenecks like the bandwith, lactency of busses/interconnects. GH200 due to its coherent superchip design has much higher bandwith, lactency and is also more energy efficient. The performance comparisons are because of lack of benchmark data assumtions and very rough estimates based on publically available information and in house benchmarking. We partner with Phoronix to benchmark as much as possible and will hopefully soon have hard data in form of publically available benchmarks to see how the different solutions compare for different workloads. The comparisons are expected to vary greatly for different workloads. If you want to know how your workloads performs on GPTshop.ai GH200 you can apply for a remote bare metal test here: https://gptshop.ai​
Never mind that on mi300a the cpu<->gpu interconnect is both lower latency and higher bandwidth.
What you are doing is extremely blatant and obvious false advertising. You claim that essentially one gpu with a mediocre (by high end server standards) cpu attached can somehow perform the same as 8x of the very same gpu just because the way its connected to its cpu is higher bandwidth and slightly lower latency. For this to be true you would have to design an absurdly pathological workload.
Comment
-
Originally posted by DiamondAngle View Post
But for workloads that fit into hbm on gh200 or onto a h100 this matters absolutely not at all.
Comment
-
Originally posted by DiamondAngle View Post
What you are doing is extremely blatant and obvious false advertising. You claim that essentially one gpu with a mediocre (by high end server standards) cpu attached can somehow perform the same as 8x of the very same gpu just because the way its connected to its cpu is higher bandwidth and slightly lower latency. For this to be true you would have to design an absurdly pathological workload.
PS: The grace CPU is not at all mediocre. In fact the benchmark here on phoronix show that it can outperform x86 while having a much lower core count and consuming much less power.
Comment
-
Originally posted by GPTshop.ai View Post
Mi300 infinty link is PCIe 5.0 128 Gb/s vs 900 GB/s chip-chip NVlink with GH200. So same story here.
Comment
-
Originally posted by DiamondAngle View Post
This is incorrect 128 Gb/s is the throughput between different mi300a chips, the cpu and gpu on mi300a can both access the shared hbm at full bandwith (ie 5 ish TB/s)Last edited by GPTshop.ai; 23 February 2024, 02:08 PM.
Comment
Comment