Originally posted by slacka
View Post
Announcement
Collapse
No announcement yet.
Offchip Tessellation Lands In Mesa For RadeonSI Gallium3D
Collapse
X
-
BNieuwenhuizen I have some problems with the terminology here. What does "offchip" mean?
Do you now use the actual hardware tesselation engines? How did tess work before, using geometry shaders running on the CUs (="onchip"?)?
Originally posted by nanonyme View PostSounded to me this is about taking advantage of hardware improvements introduced in RadeonSI GPU's
- Likes 1
Comment
-
Originally posted by juno View PostBNieuwenhuizen I have some problems with the terminology here. What does "offchip" mean?
Comment
-
SOM (with a 280x):
Before the patch:
Min: 14.21
Average: 34.31
Max: 75.89
After:
Min: 21.89
Average: 39.70
Max: 86.46
The average doesn't change as much as I'd hope for, but the low doesn't go under 20 which is great, with the patched version I didn't notice a slowdown in the benchmark whereas I did before at the explosion.
Comment
-
Originally posted by juno View PostBNieuwenhuizen I have some problems with the terminology here. What does "offchip" mean?
Do you now use the actual hardware tesselation engines? How did tess work before, using geometry shaders running on the CUs (="onchip"?)?
However, if you have very large tessellation factors (for reference, this starts to matter at e.g. tessmark >= 32x, while I don't think heaven uses tess factors >= 16x) this actually slowed stuff down.
To understand why we have to look at TCS->TES io. All TCS outputs previously got passed to the TES in LDS memory. This is per CU memory, which means "onchip" storage. The result is that if we run a TCS subgroup on a certain CU we have to schedule all TES subgroups of the corresponding patches on the same CU. With large tess factors, we have many more TES subgroups than TCS subgroups and the CU might still have unrelated subgroup running, so that may take a long time to finish. This results in an imbalance between CU workloads. I'm not completely sure what all the bottlenecks here are. It might be just not enough TCS subgroups to keep all CU's busy or there might be some limitations in some queue lengths of the tessellation hardware.
Anyway, if we pass the TCS->TES values through VRAM (in practice data will be in L2 though) we can now schedule the TES subgroups on different CU's which solves the imbalance. This most importantly solves tessmark getting slower due to the earlier change and at least with VI with some of the VI specific changes also improves performance a bit.
- Likes 7
Comment
Comment