Announcement

**coder** · 27 February 2021, 01:23 AM

Originally posted by AlB80 View Post

It's not an issue to put tens of thousands of ALUs and provide one instruction per clock for 64-thread waves.

You're looking at this from a very particular software perspective and missing the bigger picture. The bigger picture is that GPUs are large chips, made on cutting edge manufacturing processes, and that makes them very expensive. They don't have more compute than they need -- they have a balanced amount of compute and bandwidth, for the workloads they were designed to handle.

If computation were as cheap as you say, why is there a performance difference between the RX 6800 XT and 6900 XT, even though both use the same memory bus width and clock speed? Same for the RX 5700 vs. RX 5700 XT and RX 5600 vs. RX 5600 XT and RX 5500 vs. 5500 XT? I guess the people who buy the more expensive versions are paying for nothing and all the benchmarks are Fake News? And it's not just AMD -- Nvidia RTX GPUs with the same bandwidth span all the way from the RTX 2060 Super to the RTX 2080!

If computation were free, then they'd just make each GPU die with enough compute power that it's entirely bandwidth-constrained, and the only spec you'd need to look at would be its memory bandwidth. Sadly, that's not the case.

Given that you have a fixed amount of compute and a fixed amount of bandwidth, the task of a GPU programmer is to focus on optimizing whichever is the bottleneck. If your workload is computationally cheap, then you're bandwidth-limited and your goal is to manage data movement to minimize the amount of time the compute sits idle. That's different than saying computation is free -- just that, in your case, it's not the main bottleneck. Now, were you computing digits of pi or rendering fractals, you wouldn't even care about bandwidth -- it would be all about optimizing compute.

It also happens to be the case that graphics, in particular, is very bandwidth-intensive, which results in a lot of emphasis being placed on data movement. However, just because they drill it into you to optimize your data movement to keep the compute resources busy doesn't mean that compute isn't also a bottleneck, in many cases. The fact is that there's only a fixed amount of compute power and so, as a programmer, your challenge for milking the most performance from it is going to be to ensure that the compute engines stay busy. That's different than saying that if there were more compute engines, your code wouldn't run any faster. Maybe so, if you're doing some extremely cheap computation that's very bandwidth-intensive, but it's not necessarily so. There are reasons they deploy the compute:bandwidth ratio that they do, and it's about striking a balance.

**bridgman** · 27 February 2021, 01:35 AM

Originally posted by coder View Post

So, do CDNA 64-bit instructions use entirely separate registers than the 32-bit instructions, or do they operate on pairs of 32-bit registers?

They use pairs of 32-bit registers for 64-bit instructions.

Originally posted by AlB80 View Post

FP64 values takes two 32-bit slots. I don't see any other option other than an 8-clock cycle (8-clocks x 8-threads = wave) and as result 1/2 rate FP64.
btw. Low tier GCNs have 1/16 rate FP64 = single FP64 ALU per SIMD or 64-clock cycle.

Yep, from an ALU perspective 1/2 rate FP64 is what makes sense. There's something else that makes 1/2 rate more expensive from a silicon perspective than you would expect... someone explained it to me but it was a lot of years ago. Will see if I can dig it up.

**AlB80** · 27 February 2021, 01:38 AM

Originally posted by coder View Post

As I said on the first page of this thread: you know what's really expensive? fp64 multipliers. That's because the requisite silicon area increases as a square of the significand size (the same reason BFloat16 is preferred over IEEE 754 fp16) and with fp32 -> fp64, you're going from 24 to 53 bits.

Expensive fp64 multipliers are useless without data.

So, it's not as if fp64 support is only (or even primarily) constrained by register size. And if there's enough demand to build the extra multipliers, then there's certainly enough to justify widening registers, without necessarily worrying about trying to pack in more fp32, as well.

Ok. I understood you. You think that GCN has been completely reworked, all data buses are doubled and so on.
Let's compare this to another hypothetical situation where amd just cuts FP32 ALUs to 8 per SIMD.
|
Doubled buses (16xFP64 4-clock cycle, 16xFP32 4-clock cycle)
Pro:
1. High FP64 rate and lowest footprint (main target).
Contra:
1. Complicated (expensive) development.
2. FP64 performance gets higher penalty due to FP64 ALU latency (3 waves to fully occupy SIMD).
|
Cutted FP32 ALUs (8xFP64 8-clock cycle, 8xFP32 8-clock cycle)
Pro:
1. High FP64 rate and low footprint (slightly worse).
2. Minimum rework.
Contra:
1. FP64 single thread takes more time due to long cycle (nothing special, current CDNA has the same behavior).

**AlB80** · 27 February 2021, 02:35 AM

Originally posted by coder View Post

If computation were free, then they'd just make each GPU die with enough compute power that it's entirely bandwidth-constrained, and the only spec you'd need to look at would be its memory bandwidth. Sadly, that's not the case.

Sadly, I said "Data providing", but you heard "memory bandwidth".
Let's look at RX6900
Memory bw: 256bit * 16Gbps = 4096 Gbps
ALU bw: 40WGP * 4096bit * ~2GHz = 327680 Gbps or about 80 times higher than memory bw.
GPU design = memory hierarchy.

Given that you have a fixed amount of compute and a fixed amount of bandwidth, the task of a GPU programmer is to focus on optimizing whichever is the bottleneck. If your workload is computationally cheap, then you're bandwidth-limited and your goal is to manage data movement to minimize the amount of time the compute sits idle. That's different than saying computation is free -- just that, in your case, it's not the main bottleneck.

Did I say free computation? C'mon.

Originally posted by AlB80

Data providing is the main goal for GPU developers. It's a big issue. Everything else is local and secondary.

Now, were you computing digits of pi or rendering fractals, you wouldn't even careabout bandwidth -- it would be all about optimizing compute.

Memory hierarchy provides cheap data. But this is a rare and not representative case.
Probably, it looks like a magic for you. Because you only think about computation power and memory bandwidth.

It also happens to be the case that graphics, in particular, is very bandwidth-intensive, which results in a lot of emphasis being placed on data movement. However, just because they drill it into you to optimize your data movement to keep the compute resources busy doesn't mean that compute isn't also a bottleneck, in many cases. The fact is that there's only a fixed amount of compute power and so, as a programmer, your challenge for milking the most performance from it is going to be to ensure that the compute engines stay busy. That's different than saying that if there were more compute engines, your code wouldn't run any faster. Maybe so, if you're doing some extremely cheap computation that's very bandwidth-intensive, but it's not necessarily so. There are reasons they deploy the compute:bandwidth ratio that they do, and it's about striking a balance.

Raw balance is 80:1

**coder** · 27 February 2021, 06:35 AM

Originally posted by AlB80 View Post

Sadly, I said "Data providing", but you heard "memory bandwidth".
Let's look at RX6900
Memory bw: 256bit * 16Gbps = 4096 Gbps
ALU bw: 40WGP * 4096bit * ~2GHz = 327680 Gbps or about 80 times higher than memory bw.
GPU design = memory hierarchy.

I look at it in terms of Floating point ops per Byte.

If we take the MI100, it's rated at 11500 fp64 GFLOPS (not sure if that's peak or sustained) and 1229 GB/sec = 9.36 fp64 ops/B

Of course, scale that up by 8 Bytes per fp64 and you get 74.9, which is close to your figure. Perhaps the sustained rate is a bit lower.

Originally posted by AlB80 View Post

Memory hierarchy provides cheap data. But this is a rare and not representative case.

In compute workloads, there are many cases where you can work mostly out of local memory. Convolutions, FFTs, and even large matrices can be paged through local memory to reduce the strain on the external memory bus.

At a finer granularity, the beauty of Nvidia's Tensor cores and AMD's Matrix cores is that they reduce the strain on register bandwidth for the amount of computation performed. But I believe the limiting factor on how many of those compute resources they decide to provide isn't primarily based on the number of available register bits needed to feed them, but rather how much silicon they want to spend on those compute engines.

Originally posted by AlB80 View Post

Probably, it looks like a magic for you. Because you only think about computation power and memory bandwidth.

I don't follow.

**vegabook** · 27 February 2021, 07:12 AM

Originally posted by coder View Post

They tried this. It was called the Radeon VII and listed for only $700, shipping for most of 2019. For that modest sum, you'd get 16 GB of HBM2 @ 1 TB/s, and 4:1 ratio of 32-bit to 64-bit (dailed back from the native 2:1 ratio the 7nm Vega 20 could support). It was also their fastest gaming card (overall, but was sometimes beat by the RX 5700 XT), until the RDNA2 Navi 21-based cards launched, in late 2020.

Should you be interested, you can still buy one as a Radeon Pro VII, for a street price of $1900 (if you can find it in stock). For the extra $$$, you get PCIe 4.0 (instead of 3.0) and full 2:1 32-bit to 64-bit ratio. The card is still limited to 60 CUs, however, as I guess Apple is consuming too many of the chips able to use all 64 CUs.

Unfortunately, it lacks the newer iteration of Rapid Packed Math primitives that even RDNA cards have, and the Matrix Cores + BFloat16 support (IMO, only good for AI) that the CDNA cards now pack. But it will do dual-duty as a decent 4k gaming card and a strong fp64 compute card -- something neither RDNA nor CDMA can claim.

There is a brisk market for Radeon VII on ebay, I think mainly driven by professional graphics artists wanting them to accelerate production rendering in some of their man apps. Last I checked, new ones were going for about $1500 - over 2x their original selling price. Towards the end of 2019, I even saw some on Newegg for < $600!

True. The next best fp64 card, after the Radeon Pro VII, is Nvidia's $3000 Titan V, which has just 12 GB of HBM2 running at about only 675 GB/sec.

You basically read my mind, indeed I actually own a Radeon VII for all the reasons you cite above. And yes I'd like to get some of the additional RDNA/CDNA features in my next purchase, and that's why I'm hoping AMD will bring out a successor. Actually I think the main reason Radeon VII are so expensive is that they can do 100Mh/z on ether mining which (currently) nets you over 200 USD per month. Sigh....

**qarium** · 27 February 2021, 01:20 PM

Originally posted by vegabook View Post

So you're saying that what I'm saying is possible, right?

"Originally posted by vegabook View Post
Understood on silicon budget. My sense though is that a nice little "prosumer" card sitting somewhere around 1000 dollars (maybe even a bit more), with decent if not groundbreaking FP64, would be a nice little earner and certainly cred-booster for AMD. This seems to be a gap in the market that Nvidia has neglected. Hey, maybe sell a low end version of Radeon Vii Pro or your next CDNA stuff, and explicitly licence it for individuals only (no data centres allowed)."

yes with the second generation CDNA with chiplet design all this will happen.
at the same performance the price drop from 2000€ to 1700€
and because it is not some monopilte gpu die but instead it is 2-4 small dies this makes it easy to build low-end cards to from the same silicon gpu dies.
so lets say 1700€ is still to expensive who has 2 gpu dies you get a 850€ card with 1 gpu die and a 3400€ card with 4 gpu dies.

chiplet design makes it easy to build a highend card (3400€)and a lowend (850€) card with the exact same silicon gpu die

thats why AMD does chiplet design for the CDNA.

**pal666** · 27 February 2021, 04:11 PM

Originally posted by AlB80 View Post

If you think so, then your GPU is the slowest.

it's a fact and has nothing to do with gpus https://en.wikipedia.org/wiki/Time_c...e_complexities

Originally posted by AlB80 View Post

It's not an issue to put tens of thousands of ALUs and provide one instruction per clock for 64-thread waves

it requires much more transistors to do 1 64bit multiplication per clock than to do 8 8 bit multiplications per clock. transistors cost money. you don't want to pay money for extra transistors no game uses. that's the only reason for different 64bit speeds of gaming videocards

Originally posted by AlB80 View Post

But it's a big problem to feed everything with data. Thus GPU architectures are data driven.

you don't have to explain this, mu last project was proprietary hardware accelerator(like gpu but different). it all depends on task, some are memory bound, some are not

Originally posted by AlB80 View Post

Each CU can generates thousands of memory requests in a few clocks. MCU processes it and packs the data back into a 2048-bit wave. Then pal666 describes process of data gathering by typing "memory provides data". Ok.

yes, moron, "providing data" has nothing to do with "calculation". memory provides data, usually with help of cache. calculation is completely separate affair(and which one is slower depends on task)

**pal666** · 27 February 2021, 04:25 PM

Originally posted by AlB80 View Post

Expensive fp64 multipliers are useless without data.

moron, they have their data at exactly same rate(per byte) as fp32 bit multipliers have. data is not the issue here. and no quote of flops rate takes into account data at all, they all are calculated in ideal case when everything is in cache. real numbers in real applications will be lower, in part due to data not being ready, but those number will be wildly different depending on task at hand. the issue we are discussing is transistor requirement for fp64 multiplier. it is cheaper to do no fp64 multipliers at all and emulate it as a series of 32bit instructions. which will need same amount of data but will be many times slower

**AlB80** · 27 February 2021, 10:43 PM

Originally posted by pal666 View Post

moron, they have their data at exactly same rate(per byte) as fp32 bit multipliers have.

Full rate FP64 has the same rate (per byte) as full rate FP32? O rly? Dumb correspondent has been detected.

Announcement

AMD Radeon "Aldebaran" GPU Support Published For Next-Gen CDNA

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment