Announcement

**nanonyme** · 27 May 2016, 02:00 AM

Originally posted by slacka View Post

Does this or could it boost Nouveau too? Are the patches AMD specific?

Sounded to me this is about taking advantage of hardware improvements introduced in RadeonSI GPU's

**juno** · 27 May 2016, 02:52 AM

BNieuwenhuizen I have some problems with the terminology here. What does "offchip" mean?
Do you now use the actual hardware tesselation engines? How did tess work before, using geometry shaders running on the CUs (="onchip"?)?

Originally posted by nanonyme View Post

Sounded to me this is about taking advantage of hardware improvements introduced in RadeonSI GPU's

GeForces actually do have more/stronger hardware tesselation units. But of course, that's extra work and not covered by this patch series.

**starshipeleven** · 27 May 2016, 04:15 AM

Originally posted by juno View Post

BNieuwenhuizen I have some problems with the terminology here. What does "offchip" mean?

It usually means that the operation is done by the GPU but with off-chip caches. off-chip caches are slower than on-chip caches but probably faster than GRAM, so you must keep this in mind when using them.

**geearf** · 27 May 2016, 04:52 AM

SOM (with a 280x):

Before the patch:
Min: 14.21
Average: 34.31
Max: 75.89

After:
Min: 21.89
Average: 39.70
Max: 86.46

The average doesn't change as much as I'd hope for, but the low doesn't go under 20 which is great, with the patched version I didn't notice a slowdown in the benchmark whereas I did before at the explosion.

**BNieuwenhuizen** · 27 May 2016, 05:31 AM

Originally posted by juno View Post

BNieuwenhuizen I have some problems with the terminology here. What does "offchip" mean?
Do you now use the actual hardware tesselation engines? How did tess work before, using geometry shaders running on the CUs (="onchip"?)?

There are two parts in this. The first is that a subgroup can process 64 VS/TCS invocations at the same time. However it was previously set to process just one patch per workgroup which resulted in having ~3 invocations per subgroup in most games. Increasing this is a very significant performance boost for games.

However, if you have very large tessellation factors (for reference, this starts to matter at e.g. tessmark >= 32x, while I don't think heaven uses tess factors >= 16x) this actually slowed stuff down.

To understand why we have to look at TCS->TES io. All TCS outputs previously got passed to the TES in LDS memory. This is per CU memory, which means "onchip" storage. The result is that if we run a TCS subgroup on a certain CU we have to schedule all TES subgroups of the corresponding patches on the same CU. With large tess factors, we have many more TES subgroups than TCS subgroups and the CU might still have unrelated subgroup running, so that may take a long time to finish. This results in an imbalance between CU workloads. I'm not completely sure what all the bottlenecks here are. It might be just not enough TCS subgroups to keep all CU's busy or there might be some limitations in some queue lengths of the tessellation hardware.

Anyway, if we pass the TCS->TES values through VRAM (in practice data will be in L2 though) we can now schedule the TES subgroups on different CU's which solves the imbalance. This most importantly solves tessmark getting slower due to the earlier change and at least with VI with some of the VI specific changes also improves performance a bit.

**Maitreya** · 27 May 2016, 07:15 AM

Thank you for explaining that Bas, It's much clearer now!

Announcement

Offchip Tessellation Lands In Mesa For RadeonSI Gallium3D

Comment

Comment

Comment

Comment

Comment

Comment