Radeon ROCm Updates Documentation Reinforcing Focus On Headless, Non-GUI Workloads

coder replied

03 March 2021, 09:20 PM
Originally posted by Qaridarium View Post

most of the compute data would never hit the memory crossbar because it stay in the inifinity cache and every gpu chiplet has its own infinity cache.
this means you do not even need 1TB/s you would get insane performance with only 0,5TB/s.

the 5900XT does only have 0,5TB/s + infinity cache.

You're increasing the number of dies and you think you don't also need to increase memory bandwidth?

And without knowing what algorithms the compute die is running, how can you know what sort of hit rate the Infinity Cache will have on it? Graphics has good cache behavior, but a lot of compute algorithms people run on GPUs are data-hungry.
Leave a comment:
qarium replied

03 March 2021, 03:59 PM
Originally posted by coder View Post

Do they share RAM or not? In the post before last, you claimed the ability to share RAM was a justification for integrating these as chiplets in a single package. If they share RAM, then don't expect to use existing software stacks to program them. ...unless, the RAM is statically-partitioned, in which case I ask what's the point vs. a dual-GPU card or just 2 separate cards?

you can share it with xGMI but yes to put it in a single package you could do shared RAM yes.

"If they share RAM, then don't expect to use existing software stacks to program them."

why? they can just implement a legacy mode what is statically-partitioned and as soon as the software suppots the chiplet design they can switch it to a shared ram modell.

"I ask what's the point vs. a dual-GPU card or just 2 separate cards?"

2 seperate cards are the worst case ... a dual gpu card can have xGMI link and if you put it in a single package as a chiplet design you can have shared memory by "build a memory crossbar"

Originally posted by coder View Post

You're just throwing around terms. xGMI is like a quarter of the bandwidth "regular" RDMA2 cards require. Presumably, adding more chiplets will require even more memory bandwidth.

you are right the xGMI stuff is not about getting more ram speed but it helps if you just need more vRAM instead of speed.
many people if you read in the forum they already have radeon7 with 16GB vRAM and they run out of VRAM memory.
with a xGMI dual GPU card they would get 32gb vRAM if every single gpu get 16GB vram.

thats a really nice feature- not everything is about more speed.
Leave a comment:
qarium replied

03 March 2021, 03:49 PM
Originally posted by coder View Post

They're still subject to the laws of physics and limits of semiconductor tech, and if you're proposing to build a memory crossbar that runs at like 1 TB/s, that's going to burn a lot of power just so you can glue two chiplets together for benefits that are questionable, at best.

if you see my design;
IO chip with the vram interface like HBM3
one RDNA chip and one CDNA chip
2 infinity cache means 1 for the RDNA chip and one for the CDNA chip

most of the compute data would never hit the memory crossbar because it stay in the inifinity cache and every gpu chiplet has its own infinity cache.
this means you do not even need 1TB/s you would get insane performance with only 0,5TB/s.

the 5900XT does only have 0,5TB/s + infinity cache.
Leave a comment:
coder replied

03 March 2021, 03:37 PM
Originally posted by Qaridarium View Post

yes right. Xilinx can build a IO chip of insane speed and performance.

They're still subject to the laws of physics and limits of semiconductor tech, and if you're proposing to build a memory crossbar that runs at like 1 TB/s, that's going to burn a lot of power just so you can glue two chiplets together for benefits that are questionable, at best.
Leave a comment:
coder replied

03 March 2021, 03:35 PM
Originally posted by Qaridarium View Post

you think the software become more hard to develop... wrong it is more easy to develop because MESA/ACO is already done for RDNA and ROCm is already done for CDNA. means it would very easy to write software for it.
it only becomes a little harder if you want do graphics on both chips or you want do compute on both chips.
yes you can do compute on both chips but the software is not yet done for RDNA.

Do they share RAM or not? In the post before last, you claimed the ability to share RAM was a justification for integrating these as chiplets in a single package. If they share RAM, then don't expect to use existing software stacks to program them. ...unless, the RAM is statically-partitioned, in which case I ask what's the point vs. a dual-GPU card or just 2 separate cards?

Originally posted by Qaridarium View Post

i don't think so to put these components together with something like xGMI is very common.
no special treatment is necessary...

You're just throwing around terms. xGMI is like a quarter of the bandwidth "regular" RDMA2 cards require. Presumably, adding more chiplets will require even more memory bandwidth.
Leave a comment:
qarium replied

03 March 2021, 03:20 PM
Originally posted by MadeUpName View Post

Remember when AMD dropped $35B to buy Xilinx recently? Guess what their specialty is?

yes right. Xilinx can build a IO chip of insane speed and performance.
Leave a comment:
MadeUpName replied

03 March 2021, 02:59 PM
Originally posted by coder View Post

At those kind of data rates, I think an I/O chip would become a bottleneck.

Remember when AMD dropped $35B to buy Xilinx recently? Guess what their specialty is?
Leave a comment:
qarium replied

03 March 2021, 02:32 PM
Originally posted by coder View Post

At those kind of data rates, I think an I/O chip would become a bottleneck. And if you look at the estimated power requirements of NVswitch chips that Nvidia uses to scale their GPU servers, a very hot one, as well.

its not a bottleneck because every chip would have its own infinity cache to reduce the global vram usage.

this GPU would had 5 chips:

IO chip with the vram interface like HBM3
one RDNA chip and one CDNA chip
2 infinity cache means 1 for the RDNA chip and one for the CDNA chip

Originally posted by coder View Post

In my opinion, the only scalable way to do chiplet-based GPUs is with a mesh topology, like Nvidia's prototype, or maybe at a smaller scale (i.e. 1-4 chiplets) using an approach like the 1st-gen EPYC.

do some mesh and like 4 chips will be very very hard and only possible for CDNA only cards.

see my solution i write above it alrady had 5 chips... one IO and one RDNA and one CDNA and 2 infinity cache chips.

Originally posted by coder View Post

In a world where AMD is struggling even to get RDNA/RDNA2 cards fully-supported by their main compute stack, how do you figure they're going to take on the additional complexity of such a solution? And would they deliver the software for it, before the hardware became obsolete?

you think the software become more hard to develop... wrong it is more easy to develop because MESA/ACO is already done for RDNA and ROCm is already done for CDNA. means it would very easy to write software for it.
it only becomes a little harder if you want do graphics on both chips or you want do compute on both chips.
yes you can do compute on both chips but the software is not yet done for RDNA.

"before the hardware became obsolete?"

why do you think something like this becomes obsolete ?

Originally posted by coder View Post

And what about games or other software? Are they really going to devote the resources and testing to support such a solution that only very few people would actually have?

why do you think this? RDNA MESA is allrady done for RDNA and ROCm is already done for CDNA...
and it is just like you put an 6900XT and a MI100 CDNA card in your pc...

why do you think you need some kind of special software in it what does not exist and need to be developed?

the truth is: the software is already done for such a solution.

Originally posted by coder View Post

Splitting work between an APU and dGPU makes sense, because that's a common configuration (especially with Intel CPUs). But a weird, hybrid device like you're proposing would require special treatment to handle properly.

i really don't know why do you think a "special treatment" is necessary

i don't think so to put these components together with something like xGMI is very common.
no special treatment is necessary...
Leave a comment:
coder replied

03 March 2021, 01:48 PM
Originally posted by Qaridarium View Post

but lets face it you or AMD can do it smarter: lets say you put on a IO chip like in the 3950X cpu and the IO chip does the HBM3 or ddr6x means you put 32gb vram on the card and both chips RDNA and CDNA use this memory IO chip interface.
now you put in 2 chips of infinity cache one for the RDNA chip and one for the CDNA chip.

At those kind of data rates, I think an I/O chip would become a bottleneck. And if you look at the estimated power requirements of NVswitch chips that Nvidia uses to scale their GPU servers, a very hot one, as well.

In my opinion, the only scalable way to do chiplet-based GPUs is with a mesh topology, like Nvidia's prototype, or maybe at a smaller scale (i.e. 1-4 chiplets) using an approach like the 1st-gen EPYC.

Originally posted by Qaridarium View Post

"Would it be a single device with a mix of execution resources? We don't even have APIs for that (e.g. describing a mix of resources and specifying which workloads should run where, etc.)!"

yes in the end you can even do this. like we now have CPU and GPU and APU you could make an API who automatically use the best hardware for the task. if a game use AI NPC it can run on the CDNA compute if the game does graphics it runs on the RDNA part.

In a world where AMD is struggling even to get RDNA/RDNA2 cards fully-supported by their main compute stack, how do you figure they're going to take on the additional complexity of such a solution? And would they deliver the software for it, before the hardware became obsolete?

And what about games or other software? Are they really going to devote the resources and testing to support such a solution that only very few people would actually have?

Splitting work between an APU and dGPU makes sense, because that's a common configuration (especially with Intel CPUs). But a weird, hybrid device like you're proposing would require special treatment to handle properly.
Leave a comment:
qarium replied

03 March 2021, 01:00 PM
Originally posted by coder View Post

The elephant in the room is software. How would such a device be presented to the applications? Would it be a single device with a mix of execution resources? We don't even have APIs for that (e.g. describing a mix of resources and specifying which workloads should run where, etc.)!
Or, would it appear as 2 devices that happen to share a package? In this case, what's even the point of putting them in the same package, instead of just putting the two existing packages on the same board? And if they share a package, presumably memory as well? How is that going to be partitioned (and, again, what was even the point)?
There are just so many issues with this idea, and it's still not clear there's a big market for it or that it solves any problems any better than just installing separate RDNA and CDNA cards in your machine.

all your points are rightfull thoughts but can be explained easily. funny thing is your end will be my start.

"just installing separate RDNA and CDNA cards in your machine"

yes exactly this is a very good idea so you can have a 6900XT for graphics with MESA/ACO and you can buy a MI100 CDNA card and you put it in as a second card for compute with ROCm.
but just lets calculate the costs of this: 1500€ for the 6900XT and 12500€ for the mi100 CDNA card. means you end up at 14000€
lets say you put this on the same card but everthing else is the same you maybe save 100€ for the electronic circuit board

not bad i say. now you think 16GB vram for the 6900XT and 16GB vram for a hypothetical CDNA card sounds boring because you can only use 16gb... now you use xGMI and glue these both together now you can use up to 32GB vram.
to get 32GB vram with xGMi sounds good to me.

but lets face it you or AMD can do it smarter: lets say you put on a IO chip like in the 3950X cpu and the IO chip does the HBM3 or ddr6x means you put 32gb vram on the card and both chips RDNA and CDNA use this memory IO chip interface.
now you put in 2 chips of infinity cache one for the RDNA chip and one for the CDNA chip.

see now you have a very smart solution it save transostors because the second chip is only CDNA and not full GCN/RDNA you also save memory because to get 32gb VRAM for each card to use at max you do not need to put in 64gb vram like in a simple dual gpu design... you can do even smarter stuff now you can select what code you run on RDNA and what code run on CDNA if a code (32bit floating point)runs better on RDNA you put it there if a code runs better on CDNA (64bit floating point) than you put it into the CDNA chip. and because of the infinity cache per chip you can run both graphic tasks and compute cards at the same time with near same speed.

"Would it be a single device with a mix of execution resources? We don't even have APIs for that (e.g. describing a mix of resources and specifying which workloads should run where, etc.)!"

yes in the end you can even do this. like we now have CPU and GPU and APU you could make an API who automatically use the best hardware for the task. if a game use AI NPC it can run on the CDNA compute if the game does graphics it runs on the RDNA part.

Last edited by qarium; 03 March 2021, 01:04 PM.
Leave a comment:

Announcement

Radeon ROCm Updates Documentation Reinforcing Focus On Headless, Non-GUI Workloads

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: