Announcement

**pal666** · 26 February 2021, 07:04 PM

Originally posted by AlB80 View Post

SIMD width is 512bit. It can provide data for 16xFP32 or for 8xFP64.

data providing is not a computation, it's memory footprint(i.e. you need same amount of memory to keep 1/x of x times wider values). to make something useful with this data you have to do some computation. computation often has non-linear complexity with respect to bit width. and often one doublewidth operation requires different circuit than two singlewidth concatenated

**AlB80** · 26 February 2021, 07:23 PM

Originally posted by pal666 View Post

data providing is not a computation, it's memory footprint(i.e. you need same amount of memory to keep 1/x of x times wider values). to make something useful with this data you have to do some computation. computation often has non-linear complexity with respect to bit width. and often one doublewidth operation requires different circuit than two singlewidth concatenated

Data providing is the main goal for GPU developers. It's a big issue. Everything else is local and secondary.

**pal666** · 26 February 2021, 07:32 PM

Originally posted by AlB80 View Post

Data providing is the main goal for GPU developers. It's a big issue. Everything else is local and secondary.

memory provides data, gpu performs calculations on it. only very simple calculations work the way you assume.

**bridgman** · 26 February 2021, 07:32 PM

Originally posted by AlB80 View Post

SIMD width is 512bit. It can provide data for 16xFP32 or for 8xFP64. Thus fully equipped SIMD always have 1/2 rate for FP64 and 2x for FP16. That's way it works.

That may be true for CPUs but it's not how our HW operates. CPU SIMDs execute out of a single instruction stream where variable-width instructions can be accomodated, but GPU SIMDs are actually executing multiple threads in lock step so processing 16 data elements some times and 8 data elements another time is not an option.

GFX9 and CDNA SIMDs are physically 512-bit wide and process 64 FP32 operations in a 4 clock cycle, or logically 2048-bit wide at 1/4 speed.

We do have packed instructions that allow 2xFP16 operations per lane rather than 1xFP32 operation per lane, but it's just a different instruction and we still execute 64 instances of that instruction in a 4-clock cycle. FP64 instructions are still processed 64-wide, they just take (a lot) longer than 4 cycles.

RDNA SIMDs are 1024-bit wide and process 32 FP32 operations in a single clock.

**vegabook** · 26 February 2021, 10:56 PM

Originally posted by Qaridarium View Post

the chiplet design of the second version of CDNA will lover the cost of a AMD Radeon Pro VII,
from 2045€ to maybe 1700€ https://geizhals.de/amd-radeon-pro-v...loc=at&hloc=de
the AMD cpus 3950X and 5950X did same. if you watch what nativ monolit 16core cpus cost it was 300€ more (minimum)
the chiplet design also helps to build low-end version because then they can only put 1 chiplet on the gpu instead of 2-4.
so if you only buy the low-end version with only 1 chip die and not 2-4 the card costs maybe 700-1000€

this means you can be sure AMD is working hard to keep the costs down.

see my post above REAL market difference between 6900XT and 3090 is 1100€ right now.

So you're saying that what I'm saying is possible, right?

**AlB80** · 26 February 2021, 11:09 PM

Originally posted by pal666 View Post

memory provides data, gpu performs calculations on it. only very simple calculations work the way you assume.

If you think so, then your GPU is the slowest.
It's not an issue to put tens of thousands of ALUs and provide one instruction per clock for 64-thread waves. But it's a big problem to feed everything with data. Thus GPU architectures are data driven. Optimized uArch (SIMD selects one ready wave from 8-10), optimized ISA (special scalar commands to mark and stall unready waves), optimized memory hierarchy (megabytes of vector registers, LDSs and caches) and optimized code.
Each CU can generates thousands of memory requests in a few clocks. MCU processes it and packs the data back into a 2048-bit wave. Then pal666 describes process of data gathering by typing "memory provides data". Ok.

**coder** · 26 February 2021, 11:59 PM

Originally posted by vegabook View Post

My sense though is that a nice little "prosumer" card sitting somewhere around 1000 dollars (maybe even a bit more), with decent if not groundbreaking FP64, would be a nice little earner and certainly cred-booster for AMD.

They tried this. It was called the Radeon VII and listed for only $700, shipping for most of 2019. For that modest sum, you'd get 16 GB of HBM2 @ 1 TB/s, and 4:1 ratio of 32-bit to 64-bit (dailed back from the native 2:1 ratio the 7nm Vega 20 could support). It was also their fastest gaming card (overall, but was sometimes beat by the RX 5700 XT), until the RDNA2 Navi 21-based cards launched, in late 2020.

Should you be interested, you can still buy one as a Radeon Pro VII, for a street price of $1900 (if you can find it in stock). For the extra $$$, you get PCIe 4.0 (instead of 3.0) and full 2:1 32-bit to 64-bit ratio. The card is still limited to 60 CUs, however, as I guess Apple is consuming too many of the chips able to use all 64 CUs.

Unfortunately, it lacks the newer iteration of Rapid Packed Math primitives that even RDNA cards have, and the Matrix Cores + BFloat16 support (IMO, only good for AI) that the CDNA cards now pack. But it will do dual-duty as a decent 4k gaming card and a strong fp64 compute card -- something neither RDNA nor CDMA can claim.

There is a brisk market for Radeon VII on ebay, I think mainly driven by professional graphics artists wanting them to accelerate production rendering in some of their man apps. Last I checked, new ones were going for about $1500 - over 2x their original selling price. Towards the end of 2019, I even saw some on Newegg for < $600!

Originally posted by vegabook View Post

This seems to be a gap in the market that Nvidia has neglected.

True. The next best fp64 card, after the Radeon Pro VII, is Nvidia's $3000 Titan V, which has just 12 GB of HBM2 running at about only 675 GB/sec.

**coder** · 27 February 2021, 12:21 AM

Originally posted by AlB80 View Post

Data providing is the main goal for GPU developers. It's a big issue. Everything else is local and secondary.

As I said on the first page of this thread: you know what's really expensive? fp64 multipliers. That's because the requisite silicon area increases as a square of the significand size (the same reason BFloat16 is preferred over IEEE 754 fp16) and with fp32 -> fp64, you're going from 24 to 53 bits.

So, it's not as if fp64 support is only (or even primarily) constrained by register size. And if there's enough demand to build the extra multipliers, then there's certainly enough to justify widening registers, without necessarily worrying about trying to pack in more fp32, as well.

**coder** · 27 February 2021, 12:26 AM

Originally posted by bridgman View Post

FP64 instructions are still processed 64-wide, they just take (a lot) longer than 4 cycles.

So, do CDNA 64-bit instructions use entirely separate registers than the 32-bit instructions, or do they operate on pairs of 32-bit registers?

**AlB80** · 27 February 2021, 12:37 AM

Originally posted by bridgman View Post

That may be true for CPUs but it's not how our HW operates. CPU SIMDs execute out of a single instruction stream where variable-width instructions can be accomodated, but GPU SIMDs are actually executing multiple threads in lock step so processing 16 data elements some times and 8 data elements another time is not an option.

FP64 values takes two 32-bit slots. I don't see any other option other than an 8-clock cycle (8-clocks x 8-threads = wave) and as result 1/2 rate FP64.
btw. Low tier GCNs have 1/16 rate FP64 = single FP64 ALU per SIMD or 64-clock cycle.

GFX9 and CDNA SIMDs are physically 512-bit wide and process 64 FP32 operations in a 4 clock cycle, or logically 2048-bit wide at 1/4 speed.
We do have packed instructions that allow 2xFP16 operations per lane rather than 1xFP32 operation per lane, but it's just a different instruction and we still execute 64 instances of that instruction in a 4-clock cycle. FP64 instructions are still processed 64-wide, they just take (a lot) longer than 4 cycles.

Instruction latency has no direct hit for rate due to pipelined ALUs. f.e. FP32 multiplication has 5-clock latency. It explains why a single wave can not fully occupy SIMD16 with 4-clock cycle. Instruction dependency and ALU latency can block next cycle.

RDNA SIMDs are 1024-bit wide and process 32 FP32 operations in a single clock.

That's another story.

Announcement

AMD Radeon "Aldebaran" GPU Support Published For Next-Gen CDNA

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment