Announcement

**coder** · 01 August 2023, 12:48 PM

Originally posted by schmidtbag View Post

From what I recall, all tested applications except one seemed to yield a performance improvement.

Then maybe get your memory checked, because there are lots which didn't:

CP2K Molecular Dynamics (1P)
NAMD (1P)
NAS Parallel Benchmarks
CloverLeaf
SPECFEM3D
Timed Godot Game Engine Compilation
Timed Linux Kernel Compilation
Timed LLVM Compilation
ASTC Encoder (Preset: Fast)
Graph500
MariaDB
TensorFlow (Model: GoogLeNet, ResNet-50)
NeuralMagic DeepSparse

I'm so glad Michael included both 1P and 2P, because that helps us distinguish between raw workload scalability vs. poor SMT efficacy.

**coder** · 01 August 2023, 12:58 PM

Originally posted by kylew77 View Post

Darn fascinating Michael. It is interesting that in most cases it doesn't double performance like I had been led to believe with SMT, it is maybe 33% faster.

The results are all over the place. There's no hard-and-fast rule, but heavy floating-point workloads tend to benefit little from it, while workloads that are mostly serial, integer, and have either a rather small or extremely huge working set should benefit a lot.

It will also depend somewhat on how memory-bottlenecked the system is, and whether the threads are pretty much equally memory-hungry.

I'd love to see how the results for a few key benchmarks would scale, if the number of threads were varied from 1-16 cores per compute die. That's where I think we could see the cross-over effect I was talking about, with the compilation benchmarks. You'd have to soft-disable the unused cores, to make sure it's truly a test of SMT, though.

**coder** · 01 August 2023, 01:05 PM

Originally posted by schmidtbag View Post

One of the reasons multi-die GPUs still isn't really a thing is because the inter-die communication adds latency, which is the one true weakness of parallel workloads.

GPUs are quite good at dealing with high latency. For one thing, GDDR memory has about twice the latency of regular system DRAM.

The reason AMD cited for RDNA3 using a multi-die architecture for the cache & memory controllers is actually power and bandwidth. In fact, if GPUs were as latency-sensitive as you suggest, even that should be a deal-breaker, since the die-to-die communication between the compute die and the L3 cache/memory-controller dies should add just as much latency as having multiple compute dies.

This slide is somewhat poorly-worded, but they're saying the communication lines between a cross-section of a big GPU die is on the order of 10k, compared with CPU die-to-die communication involving only hundreds of signals.

Source: https://www.tomshardware.com/news/am...oment-for-gpus

Given Navi 31's die-to-die efficiency of 0.4 pJ/bit, you'd end up burning between 47.5 and 62.5 Watts just on die-to-die communication, if you take the lower-bound of 10k signals running at the core clockspeed of 1.9 GHz to 2.5 GHz. That's power you could otherwise spend on useful computation.

Announcement

SMT Proves Worthwhile Option For 128-Core AMD EPYC "Bergamo" CPUs

Comment

Comment

Comment