Originally posted by

**skeevy420**View Posthttps://www.amd.com/system/files/doc...whitepaper.pdf

The classic GCN compute cores contain a variety of pipelines optimized for scalar and vector instructions. In particular, each CU contains

a scalar register file, a scalar execution unit, and a scalar data cache to handle instructions that are shared across the wavefront, such as

common control logic or address calculations. Similarly, the CUs also contain four large vector register files, four vector execution units that

are optimized for FP32, and a vector data cache. Generally, the vector pipelines are 16-wide and each 64-wide wavefront is executed over

four cycles.

Add or MFMA. The MFMA family performs mixed-precision arithmetic and operates on KxN matrices using four different types of input

data: 8-bit integers (INT8), 16-bit half-precision FP (FP16), 16-bit brain FP (bf16), and 32-bit single-precision (FP32). All MFMA instructions

produce either 32-bit integer (INT32) or FP32 output, which reduces the likelihood of overflowing during the final accumulation stages of a

matrix multiplication.

a scalar register file, a scalar execution unit, and a scalar data cache to handle instructions that are shared across the wavefront, such as

common control logic or address calculations. Similarly, the CUs also contain four large vector register files, four vector execution units that

are optimized for FP32, and a vector data cache. Generally, the vector pipelines are 16-wide and each 64-wide wavefront is executed over

four cycles.

**The AMD CDNA architecture builds on GCN’s foundation of scalars and vectors and adds matrices as a first class citizen while**

simultaneously adding support for new numerical formats for machine learning and preserving backwards compatibility for any software written for the GCN architecture.These Matrix Core Engines add a new family of wavefront-level instructions, the Matrix Fused Multiply-simultaneously adding support for new numerical formats for machine learning and preserving backwards compatibility for any software written for the GCN architecture.

Add or MFMA. The MFMA family performs mixed-precision arithmetic and operates on KxN matrices using four different types of input

data: 8-bit integers (INT8), 16-bit half-precision FP (FP16), 16-bit brain FP (bf16), and 32-bit single-precision (FP32). All MFMA instructions

produce either 32-bit integer (INT32) or FP32 output, which reduces the likelihood of overflowing during the final accumulation stages of a

matrix multiplication.

## Comment