AMD Radeon HD 7970 On Linux

bridgman replied

24 December 2011, 01:12 PM
There may be some cases (eg recursion) where memory allocation from within a shader program could be useful, but AFAIK this round of changes doesn't go there. For now my understanding is that the memory manager works pretty much the way it did before, only rather than patching command buffers with physical addresses the page tables are used to map valid accesses through to the correct locations and either block accesses to other addresses or map them onto a dummy page (not sure which).

I'm not sure what address space the GPU uses under the new scheme; hopefully will get a chance to look through the code over the holidays.

Last edited by bridgman; 24 December 2011, 01:14 PM.
Leave a comment:
Drago replied

24 December 2011, 09:45 AM
Originally posted by curaga View Post

The IOMMU link says 64-bit.

@bridgman:

Thanks, that's what I was worried about, increased attack surface. Previously you'd need to commandeer the driver, now you can just write a "normal" program.

Could you elaborate a bit more on the page tables, how are they set up to allow "legal" access? Will the gpu programs have malloc(ON_VRAM) and malloc(ON_RAM), or will all memory need to be specifically passed from cpu-space?

Frankly, I don't know why GPU shader program would have to do memory allocation. As I understand it, it can do memory address calculations(indexing, v-tables, etc.) with in memory allocated from the driver. SI will just assure that you don't address outside this dedicated region.
Leave a comment:
curaga replied

24 December 2011, 07:05 AM
Originally posted by wizard69 View Post

Thought so! By the way does SI generate 32 or 64 bit addresses?

The IOMMU link says 64-bit.

@bridgman:

Thanks, that's what I was worried about, increased attack surface. Previously you'd need to commandeer the driver, now you can just write a "normal" program.

Could you elaborate a bit more on the page tables, how are they set up to allow "legal" access? Will the gpu programs have malloc(ON_VRAM) and malloc(ON_RAM), or will all memory need to be specifically passed from cpu-space?
Leave a comment:
wizard69 replied

23 December 2011, 08:13 PM
Thanks for the response.

Originally posted by bridgman View Post

Don't know but I'll ask around.

We pushed code for "multiple ring support" a couple of months ago :

Prep Work For Open-Source Radeon Compute, UVD - Phoronix

http://www.phoronix.com/scan.php?page=news_item&px=MTAwNzg

Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

A number of things will use that code in the future, but one of them is allowing compute operations to go through a separate command queue from graphics operations so that the hardware can flip between tasks at a fairly fine-grained level. The multiple ring support started with Cayman but GCN is the first generation where I expect we will really use it.

My understanding of SI is thin but I thought or maybe I was hoping that compute threads could run independent of the graphical workload on one or more of the "cores". Much like a process might run on a separate i86 core.

Correct, I can't answer

Thought so! By the way does SI generate 32 or 64 bit addresses?

We are trying to get all the invasive changes (multiple rings, memory management etc..) pushed out in time for the merge window. Hopefully the remaining changes for GCN will be specific to new HW, but I don't think we have discussed getting them in post-merge yet.

BTW from this point on I'm probably going to switch from talking about GCN to talking about SI (the first generation of GCN parts), partly because it's one less letter (I'm big into efficiency) and partly because that's the terminology we use internally and I'm getting tired of typing SI, backpacing over it and typing GCN instead.

I know there is a lot of whining here about slow Linux support but waiting for a more polished support package isn't a bad idea. Better to merge in after 3.3 than to have buggy support.

By the way give everybody on the GCN team a slap on the back for me. This looks like a major accomplishment and the first product looks to be very impressive as a generation one implementation.
Leave a comment:
bridgman replied

23 December 2011, 05:53 PM
Originally posted by curaga View Post

Well, the AMD PR so far implied that the card could bypass the cpu and access system ram on its own.

True, but that is not new. The new part is...

Originally posted by curaga View Post

It was also mentioned that new frameworks could be used instead of just opencl and dx11 (c++, pointer support etc).

Yep. AFAIK the big change with SI is that shader programs are able to contain and generate addresses, so...

Originally posted by curaga View Post

I believe the cs checker can only validate what gets sent to the gpu. Could you not write a GPGPU program that calculates a pointer address at runtime (= on the gpu, thus after the CS checker)?

Correct, and that's why we had to change the memory management design as a pre-requisite to implementing SI support.

On previous GPUs you could more or less control system memory accesses by controlling the commands used to set up the GPU, so the cs checker could do that. Starting with SI, we run all system memory accesses through the on-chip page tables; the page tables protect system memory, and cs checker protects the page tables.

The open source drivers have used the page tables for a few years; they're just getting used more now because they're the most practical way to (a) deal with relocations (translating handles to physical addresses) in shader programs and (b) limit shader program access to specific areas of memory. The alternative was to re-patch the shader programs every time a buffer moved and to inspect each shader program to make sure it couldn't go outside its allocated buffers.

Last edited by bridgman; 23 December 2011, 06:23 PM.
Leave a comment:
curaga replied

23 December 2011, 03:02 PM
Well, the AMD PR so far implied that the card could bypass the cpu and access system ram on its own. It was also mentioned that new frameworks could be used instead of just opencl and dx11 (c++, pointer support etc).

I believe the cs checker can only validate what gets sent to the gpu. Could you not write a GPGPU program that calculates a pointer address at runtime (= on the gpu, thus after the CS checker)?
Leave a comment:
bridgman replied

23 December 2011, 12:55 PM
Originally posted by curaga View Post

Bridgman, Anandtech states GCN cards have an IOMMU and can access the full system ram.

- how will this affect the driver?
- GPU malware. What, if anything, can the driver/kernel/compiler to against these?

The IOMMU will actually be in the CPU/NB, not the GPU :

AMD's Graphics Core Next Preview: AMD's New GPU, Architected For Compute

http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/6

Note that current GPUs can access the full system RAM already, and one of the important jobs of the kernel driver is making sure that the GPU only accesses the bits it is supposed to access. If you see "CS checker" mentioned in the driver discussions or patches that's the relevant piece.

The cool thing is that future GPUs will be able to use future IOMMUs to manage system RAM accesses rather than having to maintain a parallel implementation using different hardware. As a consequence, the GPU will need to work with virtual addresses rather than the pre-translated physical addresses it uses today. The memory management changes we are hoping to push for the upcoming merge window are a first step in that direction.

One of the design challenges is making sure that future GPUs can still work well on hardware which does not have an ATS/PRI-capable IOMMU, and the initial code we are pushing out will be aimed at the more general case ie running on existing CPU/NB hardware without relying on having an IOMMU or ATS/PRI support.

Last edited by bridgman; 23 December 2011, 12:59 PM.
Leave a comment:
curaga replied

23 December 2011, 12:10 PM
Bridgman, Anandtech states GCN cards have an IOMMU and can access the full system ram.

- how will this affect the driver?
- GPU malware. What, if anything, can the driver/kernel/compiler to against these?
Leave a comment:
bridgman replied

23 December 2011, 11:43 AM
Originally posted by wizard69 View Post

Would it be possible to work something out with Phoronix to write and article focused onthe new compute capabilities of Southern Islands under Linux.

Don't know but I'll ask around.

Originally posted by wizard69 View Post

One thing I'm interested in is how compute loads impact graphics usage, are we to the point where long running compute jobs do not impact graphics significantly. I guess that is a question about support for threads. However any info that you are free to spill that gives us a better mental image of improvements to Southern Islands compute performance would be welcomed.

We pushed code for "multiple ring support" a couple of months ago :

Prep Work For Open-Source Radeon Compute, UVD - Phoronix

http://www.phoronix.com/scan.php?page=news_item&px=MTAwNzg

Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

A number of things will use that code in the future, but one of them is allowing compute operations to go through a separate command queue from graphics operations so that the hardware can flip between tasks at a fairly fine-grained level. The multiple ring support started with Cayman but GCN is the first generation where I expect we will really use it.

Originally posted by wizard69 View Post

This question you probably can't answer but when will we see a Fusion processor using Southern Islands technology?

Correct, I can't answer

Originally posted by FireBurn View Post

Linus has always stated that post RC merges are allowed for bug fixes and support for bringing up new hardware - which this would qualify as

We are trying to get all the invasive changes (multiple rings, memory management etc..) pushed out in time for the merge window. Hopefully the remaining changes for GCN will be specific to new HW, but I don't think we have discussed getting them in post-merge yet.

BTW from this point on I'm probably going to switch from talking about GCN to talking about SI (the first generation of GCN parts), partly because it's one less letter (I'm big into efficiency) and partly because that's the terminology we use internally and I'm getting tired of typing SI, backpacing over it and typing GCN instead.

Last edited by bridgman; 23 December 2011, 11:59 AM.
Leave a comment:
bridgman replied

23 December 2011, 11:36 AM
Originally posted by liam View Post

Obviously you would know better than the guys at Anandtech, but I got a different impression as to the amount of difference in the compute bitstream from VLIW->non-VLIW SIMD. They seemed to say that the compiler was the ALL IMPORTANT COMPONENT in order to get decent utilization (meaning it was neccessary to keep vast amounts of the program branches in memory and be REALLY good at best guesses for dependencies). Though they didn't say this part I assume that since the compiler was so important for <=NI it becomes less so with >=SI, and should even need to be rewritten for the new architecture.

Everything you are saying is correct, but you are assuming that the open source graphics driver already has that ALL IMPORTANT COMPILER for r6xx-NI and (quite reasonably) wondering why we don't use it for compute as well ? The answer is simple -- it doesn't have one.

The Catalyst driver has a fancy compiler but the r600/r600g open source drivers do not (at least they didn't the last time I looked). The current TGSI-VLIW compiler takes advantage of the fact that most of the TGSI instructions are 3- or 4-wide vector operations and translates them directly into 3- or 4-slot VLIW instructions.

The Catalyst shader compiler for pre-GCN analyzes dependencies and packs multiple operations into a single VLIW instruction... last time I looked the open source shader compiler did not. Either way, there is a lot of VLIW-specific code in the pre-GCN shader compilers which is not needed for GCN, to the point where it's easier to start over and leverage more conventional compilers which would have sucked on VLIW (absent something like LunarGLASS) but which should fit GCN peachy-fine.

Originally posted by liam View Post

What you seem to be saying is that <=NI will make good use of the IR->VLIW, which makes sense (also, graphics stays the same).

Honestly, we weren't expecting great efficiency at first no matter which path we took -- LLVM IR to VLIW or LLVM IR to TGSI to VLIW. We went with the LLVM IR to VLIW route for a few reasons :

- it was the shortest path to getting GPU acceleration into clover
- since the TGSI to VLIW path didn't have much relevent optimization we didn't think we would lose performance by going direct from LLVM IR to VLIW
- it gave us a way to test out the LLVM to GPU instruction code on available hardware before we had GCN boards
- it produced code which was more in line with what other developers were looking for in order to build other compute stacks

Neither approach would take much advantage of VLIW hardware for compute at first. If the graphics shader compiler gets more optimized in the future (or already is and we missed it ) we would probably try the LLVM IR to TGSI to VLIW path, but I think we would have started with this approach anyways because of the other benefits above.

Originally posted by liam View Post

For >=SI EVERYTHING goes through the new code (which I didn't know, but I also didn't know there were 2 code drops). Again, what surprises me is that the same compiler code used to generate VLIW is also being used to generate the new SIMD code.

I don't remember if it was 2 code drops or 1 drop with 2 parts. I think it was 1 drop with some LLVM patches and some Mesa/Gallium3D driver patches.

Originally posted by liam View Post

Again, if what Anandtech says is accurate it seems like the old compiler was so hideously complex thay you'd want to jettison it as soon as possible instead of making it able to output to yet another kind of architecture (so presumable it now addresses VLIW4/5/SIMD).

That's essentially what we are doing (even though in our case the old compiler was not hideously complex ).

For GCN both graphics and compute will go through the new LLVM paths.

Last edited by bridgman; 23 December 2011, 11:58 AM.
Leave a comment:

Announcement

AMD Radeon HD 7970 On Linux

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: