Radeon ROCm Updates Documentation Reinforcing Focus On Headless, Non-GUI Workloads

qarium replied

04 March 2021, 12:11 PM
Originally posted by coder View Post

You can find a benchmark that will show almost anything, especially on a highly-NUMA platform like TR4.
Do you understand that you're talking about CPU scaling and this is a GPU?
And are you aware that between Vega 64 and Radeon VII, AMD more than doubled the amount of memory bandwidth, even though the number of CUs actually dropped by 4? That's because Radeon VII was designed to be a GPU-compute card, and many of the workloads it was designed to run are highly bandwidth-intensive.

ok what you don't get is this: if you already use the fastest memory possible like HBM4 at 4096bit interface
you just can't make any faster so easily.
you have to find other ways to improve the performance and lower the costs for the same performance.
AMD already works on chiplet design for CDNA with shared memory IO crossbar
believe it or not.
this all does allready happening.
the only point i make is that some customers do not need 2 chip dies of CDNA or even 4 instead they need 1 RDNA and 1 CDNA.

thats very easy to validate just ask the people who buy 6900XT and MI100 and put it into same pc.

Originally posted by coder View Post

You can't have a game where the NPCs act differently, according to what GPU you have.

Wrong just wrong you can make an AI-NPC game who has smarter NPC on gast GPU and more stupid NPC on slower GPUS.

Originally posted by coder View Post

The AI has to behave the same, on any PC that at least meets the minimum specifications.

wrong there is no such minimum specification.
if you make a game like this no one will care if you have this minimum specification or not.

you just say: NPC is smarter on fast gpu systems and more stupid on slower gpu systems and it is the fastest if you also have MI100 or similar CDNA chip

Originally posted by coder View Post

The only things games can really do with more compute power is make the graphics look nicer and run at a higher framerate. And RDNA is optimized to do both.

wrong wrong wrong... games can use AI for NPC and by this they can utilize the CDNA chip.
Leave a comment:
coder replied

04 March 2021, 11:29 AM
Originally posted by Qaridarium View Post

i just compared 16 to 16 cores.. sure you can benchmark 2990WX with 32cores and 4chips to.

but lets see this; https://cpu.userbenchmark.com/Compar.../m560423vs4086
in generall the 5950X is +42% faster,... but even if you force 64thread workload the 2990WX is only +35% faster.
and it is 16 cores vs 32 cores.
so your argument that 4 ram channels is better than 2 ram channels... in chiplet design...

You can find a benchmark that will show almost anything, especially on a highly-NUMA platform like TR4.

Originally posted by Qaridarium View Post

the argument that you need double ram speed just because you have the double die chips is wrong-

you need more ram speed thats true but it is not the "double" ram speed.

Do you understand that you're talking about CPU scaling and this is a GPU?

And are you aware that between Vega 64 and Radeon VII, AMD more than doubled the amount of memory bandwidth, even though the number of CUs actually dropped by 4? That's because Radeon VII was designed to be a GPU-compute card, and many of the workloads it was designed to run are highly bandwidth-intensive.

Originally posted by Qaridarium View Post

and if you only sell 1 game like Half-Life ALYX or Cyberpunk 2077 who really Utilize such a RDNA/CDNA combination
and for example makes the NPC AI characters super smart with deep learning amd could make it a hit.

You can't have a game where the NPCs act differently, according to what GPU you have. The AI has to behave the same, on any PC that at least meets the minimum specifications. The only things games can really do with more compute power is make the graphics look nicer and run at a higher framerate. And RDNA is optimized to do both.

Last edited by coder; 04 March 2021, 11:33 AM.
Leave a comment:
qarium replied

04 March 2021, 09:42 AM
Originally posted by stalkerg View Post

Yes, the main question will AMD support a universal (GUI/headless) compute base on a fully open source stack (mesa) or not?
As I understand currently it's just a possible side effect and not as one of the goals.

they work on it. to bring ROCm to all chips. and they also work on GUI mode to.
the problem what people dont unterstand is a product cycle of gpu is 4 years.
they had money shortage in 2016-2018... what the people experience right now is this problem rooted in 2016-2018
they now hire more and more linux developers to fix all these problems.

but even if you hire a lot of people it needs like 1-2 years to become reality.
Leave a comment:
qarium replied

04 March 2021, 09:36 AM
Originally posted by coder View Post

You pick the smallest threadripper of that generation?

i just compared 16 to 16 cores.. sure you can benchmark 2990WX with 32cores and 4chips to.

but lets see this; https://cpu.userbenchmark.com/Compar.../m560423vs4086
in generall the 5950X is +42% faster,... but even if you force 64thread workload the 2990WX is only +35% faster.
and it is 16 cores vs 32 cores.
so your argument that 4 ram channels is better than 2 ram channels... in chiplet design...

Originally posted by coder View Post

The platform scaled up to 2x the cores, so that undermines your own argument, right from the start. Also, that platform supported more PCIe lanes, which could be why someone would buy an entry-level Threadripper -- more for the I/O than the cores.
But, the larger point is that you're again seizing on CPUs as an analogy for GPUs, yet we know they're not!

right you are right we do not know it yet. but your idea that if you have IQ cross bar to share the ram between 2 chips needs 1TB/s because one chip needs 0,5TBs is in my point of view not true. but for sure we do not know it yet for sure.

Originally posted by coder View Post

Notwithstanding the above point, I'm sure you know that it's a deeply flawed comparison, since you're comparing Zen+ cores with a highly-NUMA memory topology to Zen2 and Zen3 cores without such bottlenecks (and probably faster RAM speed, as well).

faster ram speed is not true if you ECC ram you can use 3200mhz ECC ram on a 2950X and the same ram one 5950x
because there is no faster DDR4 ECC ram.
the argument that you need double ram speed just because you have the double die chips is wrong-

you need more ram speed thats true but it is not the "double" ram speed.

Originally posted by coder View Post

To even make such a comparison shows that you're more interested in winning the argument than trying to figure out whether your proposal even makes sense. And that tells me I'm just wasting my time.

in the CPU space the proposal already makes sense because in CPU they already make chiplet design.
but for sure we do not know it yet for GPUs-
i think you do not waste your time only because we talk about tech here.
i did this argument because this is the examples i know from CPU space.

to win an argument or lose an argument i just don't care.

Originally posted by coder View Post

Though I never really embraced the idea, the more we've explored its various aspects, the more convinced I've become that it not only lacks a strong value proposition, but it's also not even terribly practical.

ok... this i really don't get it. many people buy a 6900XT AND a MI100 and put it in the same PC...
many people want a lot of VRAM and even run out of memory with 16GB vram this means xGMI or shared RAM IO Crossbar would make them very happy.
right now many want 3D graphics and 64FP compute sure they can buy radeon7pro but in future there is no way back from the split of RDNA and CDNA.... so in future if they want 3D graphics and 64FP compute they need 2 cards or 2 chips in chiplet design.

Originally posted by coder View Post

Furthermore, there's simply no way AMD could ever sell enough of these to offset the engineering costs of designing it and providing the software support. Workstation cards are expensive, in spite of simply being rebadged consumer or server GPUs, primarily because they don't sell in high volumes and have added support costs. Once you add custom ASICs and packaging into the mix, you're in a completely different price tier.

i really don't get it AMD sells it already and people buy 2 cards already 6900XT+MI100.

and if you only sell 1 game like Half-Life ALYX or Cyberpunk 2077 who really Utilize such a RDNA/CDNA combination
and for example makes the NPC AI characters super smart with deep learning amd could make it a hit.
and then downsize the same success int he notebook space or even future playstation6...

i think something like this could really hit the market.
Leave a comment:
stalkerg replied

04 March 2021, 06:31 AM
Yes, the main question will AMD support a universal (GUI/headless) compute base on a fully open source stack (mesa) or not?
As I understand currently it's just a possible side effect and not as one of the goals.
Likes 1
Leave a comment:
stalkerg replied

04 March 2021, 06:25 AM
bridgman I read all this and still not understand. Sorry, it's still too confusing.

1. We have open-source ROCm (without the latest changes) for Vega, CDNA cads (and probably for RDNA in the future, some fixes already in AMDGPU-PRO). But open-source ROCm alone not cover GUI like apps, only headless.
2. closed source AMDGPU-PRO (include inside some versions of ROCm, ROCr, mesa and etc) cover all cases include GUI

Because I am playing games from time to time on Gentoo my main question - when and what you do to cover GPGPU (OpenCL and etc) with GUI without AMDGPU-PRO?
You tried to explain something base on AMDGPU-PRO, ROCm, Mesa relationships, and it's too complicated those outside the HPC computing theme.
We have perfect Mesa with radeonsi, RADV, ACO - what we should do to add OpenCL (CUDA and etc)? If ROCm not going into this path I think should start help Clover guys.

We need a table like the current state and what AMD as a company wants to see (this table should include Clover and other opensource things).
Leave a comment:
coder replied

04 March 2021, 12:28 AM
Originally posted by Qaridarium View Post

yes exactly this is the case and you can proof it in real world.
just see a threadripper 2950X with 4 ram channel of DDR4 and a 3950X and 5950X with only 2 ram channel.
if your theory is right and you need more ram channel and ram speed if you add more chiplet design cores then the 2950X would be faster because it has 4 ram channel.

You pick the smallest threadripper of that generation? The platform scaled up to 2x the cores, so that undermines your own argument, right from the start. Also, that platform supported more PCIe lanes, which could be why someone would buy an entry-level Threadripper -- more for the I/O than the cores.

But, the larger point is that you're again seizing on CPUs as an analogy for GPUs, yet we know they're not!

Originally posted by Qaridarium View Post

the 3950X is 17% faster in 64 thread multicore performance than a 2950X

UserBenchmark: AMD Ryzen 9 3950X vs TR 2950X

https://cpu.userbenchmark.com/Compare/AMD-Ryzen-TR-2950X-vs-AMD-Ryzen-9-3950X/m569025vs4057

Based on 60,339 user benchmarks for the AMD Ryzen 9 3950X and the Ryzen TR 2950X, we rank them both on effective speed and value for money against the best 1,442 CPUs.

and the 5950X is 23% faster than a 2950X

Notwithstanding the above point, I'm sure you know that it's a deeply flawed comparison, since you're comparing Zen+ cores with a highly-NUMA memory topology to Zen2 and Zen3 cores without such bottlenecks (and probably faster RAM speed, as well).

To even make such a comparison shows that you're more interested in winning the argument than trying to figure out whether your proposal even makes sense. And that tells me I'm just wasting my time.

Though I never really embraced the idea, the more we've explored its various aspects, the more convinced I've become that it not only lacks a strong value proposition, but it's also not even terribly practical.

Furthermore, there's simply no way AMD could ever sell enough of these to offset the engineering costs of designing it and providing the software support. Workstation cards are expensive, in spite of simply being rebadged consumer or server GPUs, primarily because they don't sell in high volumes and have added support costs. Once you add custom ASICs and packaging into the mix, you're in a completely different price tier.
Leave a comment:
qarium replied

03 March 2021, 09:52 PM
Originally posted by coder View Post

...if it solves any problem better than the first or second option, but since you so far seem to be treating the two chiplets independently, I don't see how they're worth combining just to share some RAM.

RAM of this kind is very very expensive. and for example you want to use 32GB vram if you have 2 cards both need 32GB vram
if you make a chiplet design with shared ram you only need 32gb vram.

lets say a dualgpu card save you 100€ on electronic circuit board then you save like 500€ because of shared ram.
and a hypothetical 4000€ card drops to 3400€ (just an example-)
Leave a comment:
qarium replied

03 March 2021, 09:46 PM
Originally posted by coder View Post

You're increasing the number of dies and you think you don't also need to increase memory bandwidth?

yes exactly this is the case and you can proof it in real world.
just see a threadripper 2950X with 4 ram channel of DDR4 and a 3950X and 5950X with only 2 ram channel.
if your theory is right and you need more ram channel and ram speed if you add more chiplet design cores then the 2950X would be faster because it has 4 ram channel.

but this is not the case the 3950X is 17% faster in 64 thread multicore performance than a 2950X

UserBenchmark: AMD Ryzen 9 3950X vs TR 2950X

https://cpu.userbenchmark.com/Compare/AMD-Ryzen-TR-2950X-vs-AMD-Ryzen-9-3950X/m569025vs4057

Based on 60,339 user benchmarks for the AMD Ryzen 9 3950X and the Ryzen TR 2950X, we rank them both on effective speed and value for money against the best 1,442 CPUs.

and the 5950X is 23% faster than a 2950X

UserBenchmark: AMD Ryzen 9 5950X vs TR 2950X

https://cpu.userbenchmark.com/Compare/AMD-Ryzen-TR-2950X-vs-AMD-Ryzen-9-5950X/m569025vs4086

Based on 43,195 user benchmarks for the AMD Ryzen 9 5950X and the Ryzen TR 2950X, we rank them both on effective speed and value for money against the best 1,442 CPUs.

so your theory is just wrong if you have 2 chiplet chips instead of only 1 die you do not need the double of ram speed or ram channels.
yes the 2950 is also chiplet design but it has 4 ram channel and the 5950X is 23% faster with only 2 ram channel.

this means if you have 0,5TB/s for a 6900XT and you add a CDNA chiplet chip to the design you do not need 1TB/s only because it is 1 more chip. yes it would be nice to have more than 0,5TB/s but if you only go from GDDR6 to GDDR6x a 256bit interface would go to 0,605TB/s this would be a perfect fit for 2 gpu dies. you do not need the 1,2TB/s you get from HBM2+ with 4096 bit interface.

Originally posted by coder View Post

And without knowing what algorithms the compute die is running, how can you know what sort of hit rate the Infinity Cache will have on it? Graphics has good cache behavior, but a lot of compute algorithms people run on GPUs are data-hungry.

i don't know any computer numbers connected to infinity cache yet...
Leave a comment:
coder replied

03 March 2021, 09:30 PM
Originally posted by Qaridarium View Post

2 seperate cards are the worst case ... a dual gpu card can have xGMI link and if you put it in a single package as a chiplet design you can have shared memory by "build a memory crossbar"

...if it solves any problem better than the first or second option, but since you so far seem to be treating the two chiplets independently, I don't see how they're worth combining just to share some RAM.

Originally posted by Qaridarium View Post

you are right the xGMI stuff is not about getting more ram speed but it helps if you just need more vRAM instead of speed.
many people if you read in the forum they already have radeon7 with 16GB vRAM and they run out of VRAM memory.
with a xGMI dual GPU card they would get 32gb vRAM if every single gpu get 16GB vram.

It was not meant to expand graphics RAM, since it's like 1/8th as fast and many times greater latency than using memory on the local card. The problem it was designed to solve is inter-GPU communication (i.e. for teaming GPUs on training neural networks and other multi-GPU problems).
Leave a comment:

Announcement

Radeon ROCm Updates Documentation Reinforcing Focus On Headless, Non-GUI Workloads

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: