Announcement

**bridgman** · 18 November 2017, 03:40 PM

OK, let's see...

First thing to get clear is that there is no connection between "command submission" (sending work to the GPU) and the "command table" in AtomBIOS.

We only use AtomBIOS for a small number of operations, mostly chip initialization and getting information about the hardware (eg what the connectors are, or how to set PLL divider values for specific clock frequencies)... for everything else the driver goes straight to hardware including anything to do with ring buffers.

AtomBIOS has four main components:

- command tables (callable functions written as a series of bytecode sequences interpreted by driver or vbios)
- data tables (data structures describing the details of each GPU subsystem including connectors etc..)
- bytecode interpreter only used by BIOS - interpreter used by the driver is in atom.c/atom.h plus callbacks in amdgpu_device.c (search for "cail")
- x86 assembly wrapper to conform to standard VBIOS calling conventions

The driver implements a number of functions which read atombios data tables and/or call into atombios command tables, mostly in the atombios*.c/h files, then the rest of the driver just calls those functions.

OK, let's step back a bit - go to the RadeonFeature page ( https://www.x.org/wiki/RadeonFeature ), click on "Documentation", open the "R5xx Acceleration" doc and start reading from chapter 5. The 5xx family is quite a bit older than your RX480 but the same general principles apply and that's what you need to learn first.

Back in the R5xx days life was simple - the GPU had a single ring buffer for graphics operations (aka "the graphics ring"), supported in HW by the Command Processor (CP) block which is described in chapter 5. These days UVD, VCE and SDMA all have their own ring buffers and CP supports a large number of additional rings for compute operations, via what engineering calls "MEC" (micro-engine compute) blocks within CP and marketing calls ACE (Asynchronous Compute Element).

PM4 generally refers to the packets which driver places in the ring and GPU reads/executes. Strictly speaking it stands for "Programming Model 4".

Rings (sometimes called queues) are used for one-way communication between driver and GPU, allowing driver to queue up a sequence of operations which the GPU will eventually execute. The driver does not know exactly when the operations will occur but it does control the sequence of operations. This will be important in a minute. The GPU can provide information back to the driver in the form of interrupts and directly writing to memory in response to commands from the driver (originally sent via the ring) so while the ring itself is one-way the overall driver/GPU interaction is two way.

Let's step back a bit further.

There are six main ways a driver interacts with GPU hardware, with the first two mostly being used during startup and shutdown:

1. Register accesses via AtomBIOS calls (aka execution of AtomBIOS command tables)
2. Register accesses where driver reads and writes directly (aka MMIO)
3. Register accesses where driver puts packets in the ring which write specific values to specific registers
4. Operations which can only be performed via packets in the ring, eg draw (graphics) and dispatch (compute)
5. Data structures used when executing #4 eg texture buffers, data arrays or the buffer(s) where results will be written
6. Shader programs used when executing #4, see ISA guides for details (your 480 runs Volcanic Islands ISA)

All of the work associated with preparing and managing #5 and #6 is done by userspace drivers (eg Mesa); the userspace code calls into the kernel driver via a "command submission" IOCTL and passes pointers to the buffers containing data and shader code plus the target buffer(s). The driver also prevents those buffers from being moved around in memory while the GPU is using them to process a command.

The shader programs run on Compute Units in the GPU - for simplicity think of them as 64-wide SIMD engines although they actually include both scalar and vector(SIMD) engines, and each CU includes 4 16-wide SIMDs that each process 64 data elements in 4 clocks.

Going back to the driver... read a few chapters of 5xx acceleration, then start looking at driver code:

- amdgpu_drv.c has most of the interface between driver and the drm (Direct Rendering Manager) code which oversees all graphics drivers
- amdgpu_device.c is arguably "the driver" - it calls into SOC-level code (eg vi.c) which in turn calls into IP block code (eg gfx_v8_0.c) for HW specifics

One of the challenges with learning to write drivers for modern GPUs is that the HW is fairly complex; 20 years ago there was a pretty close mapping between HW registers and OpenGL state information, but these days GPUs are mostly highly parallel floating point engines with some fixed-function hardware and complex driver code to make them do actual graphics work.

Where I'm going with this is that looking for a "Hello, world" equivalent is going to be tough and you will be better off tweaking existing driver code (eg adding printk/printf operations at interesting points in the kernel/userspace drivers) then running it to learn how it works.

Good luck and have fun.

**AlainCo** · 18 November 2017, 03:50 PM

Thank you very much. I understand better my misunderstading, and I now have a plan about documentation.
I don't plan to use the GPU for computation (yet), but just for initialization. I've seen the 5xx doc but could not imagine it was the one to read.
I hope also I could better understand the amdgpu module, to use it.

**bridgman** · 18 November 2017, 04:50 PM

If you start with amdgpu_drv.c : amdgpu_pci_probe and keep track of what gets loaded into all the function pointers the driver functionality should become a lot more clear.

Just think about three levels of code - core stuff, SOC-specific stuff (vi.c for you) and IP (HW)-block-specific stuff. IIRC the initial setup has three steps:

1. Core driver selects chip ID (POLARIS10 for you) based on PCI ID

2. Core driver selects SOC-specific code based on chip ID - SOC code supports generations of HW so multiple chip IDs

3. SOC-specific code sets up the specific HW block handlers required for that chip (again using chip ID)

The amdgpu_early_init function in amdgpu_device.c chooses the appropriate SOC-level code (for POLARIS10 it calls vi_set_ip_blocks()) then vi_set_ip_blocks in vi.c sets up the specific IP block handlers required by your RX480 (mostly loading a bunch of function pointers).

Each of the xxx_acceleration documents just covers the differences between new generation and previous generation, so starting with 5xx and rolling forward is the way to go.... it's also worth reading the 6xx/7xx acceleration doc but you'll need to read 5xx first to get the basics of how the command processor works etc...

The diagram on page 8 of the 6xx doc is pretty useful IMO. Main difference between 5xx and 6xx+ is that 5xx had separate vertex & pixel/fragment shader HW blocks while 6xx and everything newer have a single "unified shader" block. The next big change was SI, where we moved from VLIW SIMD units to scalar SIMD units (which we call vector ALUs to distinguish them from the non-SIMD "scalar ALU").

Once you understand how the shader cores relate to the other fixed-function HW and how shader programs relate to draw/dispatch commands (and how none of that has anything to do with the BIOS) most of that pounding headache you are probably feeling right now will go away.

Announcement

Sending Atombios command from Linux with amdgpu module

Sending Atombios command from Linux with amdgpu module

Comment

Comment

Comment