Broadcom Open-Sources VideoCore IV 3D Graphics Stack

robclark replied

06 March 2014, 09:03 AM
Originally posted by ssvb View Post

Yes, with poor X server DDX drivers you can easily hit scenarios when the performance gets killed. The solution is not to use poor drivers. Admittedly this is rather difficult. Especially on ARM hardware, where certain performance killing anti-patterns are surprisingly popular

Well, technically you are right.. but really the only good example (from performance standpoint) of DDX driver is intel SNA, and really no one but intel can afford that massive DDX driver investment. Go compare the # of LoC of SNA to intel's mesa driver some day!

So yes, some of the things that I am saying are not possible, what I should actually be saying that they are not *practical*

(Possibly once glamor is in better shape, that will be the way forward for mobile DDX drivers.. or rather, nearly *all* DDX drivers.. at least then the massive driver investment to handle all the different x11 render paths can be done once and shared across all drivers. Still doesn't help with the overlay situation, though.)

Originally posted by ssvb View Post

Zero copy is always better than one. Without compositing enabled in the X11 window manager and when hardware overlays are available to be controlled by the DDX driver, this is already in use - http://ssvb.github.io/2013/02/01/new...dx-driver.html. The window decorations do not matter because they are rendered by the X server just like for any other window (DRI2 buffers are not involved). The rectangular area drawn by a GLES application is living in a hardware overlay, with scanout configured directly from the current DRI2 buffer, alternating buffers on vblank to avoid tearing. We need to do a copy to the framebuffer (or to the window backing pixmap) only when somebody really wants to read from there. http://en.wikipedia.org/wiki/Lazy_evaluation for the win
As long as the GLES applications don't rely on the EGL_NATIVE_RENDERABLE feature, everyone should be happy. There are some shortcomings in the current implementation though, but nothing really unsolvable.

yes.. I've seen that. It is a really cute hack. But it will never be possible to make it perfect. (Moving windows around, stacking order, multiple gl apps, $random_users_favorite_windowmanager, etc.) The best you'll be able to do is gracefully fall back to a slow path.

Weston otoh can easily do the same thing, with no hacks. And once atomic modeset is upstream in kernel, it will be able to do it pixel-perfect. This is why I'm so pro-wayland. Yes, there are things that given enough time/effort/compromize/etc can be hacked into x11. But why, when wayland lets you do it cleanly/easily?

Originally posted by ssvb View Post

With the x11 compositing window manager and redirected windows, everything surely gets more complicated. But in theory the overhead of dealing with window decorations should be not so dramatic as an extra buffer copy per frame, see https://github.com/ssvb/xf86-video-fbturbo/issues/3. However this has not been really implemented yet, so I could be overlooking something.

IIRC, compiz has (or at least used to have) an option to choose window decorations in same texture vs different textures. The latter would avoid the copy. There are some artifacts w/ wobbly windows if you do this (but then you can also just disable wobbly windows). Last time I checked compiz defaulted to decorations in same texture (ie. copy).

Last edited by robclark; 06 March 2014, 09:07 AM.
Leave a comment:
pq__ replied

06 March 2014, 05:56 AM
Originally posted by ssvb View Post

Zero copy is always better than one. Without compositing enabled in the X11 window manager and when hardware overlays are available to be controlled by the DDX driver, this is already in use - http://ssvb.github.io/2013/02/01/new...dx-driver.html. The window decorations do not matter because they are rendered by the X server just like for any other window (DRI2 buffers are not involved). The rectangular area drawn by a GLES application is living in a hardware overlay, with scanout configured directly from the current DRI2 buffer, alternating buffers on vblank to avoid tearing. We need to do a copy to the framebuffer (or to the window backing pixmap) only when somebody really wants to read from there. http://en.wikipedia.org/wiki/Lazy_evaluation for the win

For comparison, on Weston on rpi, you will always get as close to zero-copy EGL clients as imaginable. The only copy that might happen is done by the firmware in secret, if it deems that the element scenegraph is too complex. For a full-screen app like a game, with only few additional elements like a mouse cursor, I believe the firmware should avoid the secret copy.

It does not matter whether Weston has to composite (show something else at the same time), or whether the GL rendered window is partially obscured or not, or how the applications are coded as long as they are native Wayland apps (no need to be rpi-specific apps). On Wayland, there is no case where we would need to do an additional lazy copy because something wants to read something.

Btw. how do you guarantee, that the decorations and your DRI2 buffer stay in sync wrt. size while resizing the window? So that you don't accidentally show a picture where they are at disagreeing sizes?
Leave a comment:
ssvb replied

06 March 2014, 01:49 AM
Originally posted by robclark View Post

Sure.. with something like wayland you are at the mercy of what the client does as to whether it is "fully" accelerated or not. Although if there is going to be software involved (fallbacks or otherwise), I much prefer it to happen one time per frame / window update, upstream of the acceleration, never having to block waiting for the gpu. Versus x11 where you can have scenarios (depending on what your hw can do and whatever drawing operation your DDX driver can accel (ie. with EXA there are some operations that are always sw fallback). With x11, you can hit scenarios where you are alternating between cpu and gpu access to pixmaps, killing performance.

Yes, with poor X server DDX drivers you can easily hit scenarios when the performance gets killed. The solution is not to use poor drivers. Admittedly this is rather difficult. Especially on ARM hardware, where certain performance killing anti-patterns are surprisingly popular

gl/gles integration performance differences should amount to client side decorations vs not. With hardware rendered clients (ie. opengl(es)), with either wayland or x11, there should be exactly the same number of copies (ie. one)..

Zero copy is always better than one. Without compositing enabled in the X11 window manager and when hardware overlays are available to be controlled by the DDX driver, this is already in use - http://ssvb.github.io/2013/02/01/new...dx-driver.html. The window decorations do not matter because they are rendered by the X server just like for any other window (DRI2 buffers are not involved). The rectangular area drawn by a GLES application is living in a hardware overlay, with scanout configured directly from the current DRI2 buffer, alternating buffers on vblank to avoid tearing. We need to do a copy to the framebuffer (or to the window backing pixmap) only when somebody really wants to read from there. http://en.wikipedia.org/wiki/Lazy_evaluation for the win
As long as the GLES applications don't rely on the EGL_NATIVE_RENDERABLE feature, everyone should be happy. There are some shortcomings in the current implementation though, but nothing really unsolvable.

and that is regardless of of compositing vs non-compositing window manager in x11 case. The problem is server side (or really window manager side) decorations screw things up and you end up require an extra copy.

With the x11 compositing window manager and redirected windows, everything surely gets more complicated. But in theory the overhead of dealing with window decorations should be not so dramatic as an extra buffer copy per frame, see https://github.com/ssvb/xf86-video-fbturbo/issues/3. However this has not been really implemented yet, so I could be overlooking something.

But *technically* x11 can do client side decorations too.. so in theory it is a moot point. Although in practice client side decorations seem predominant in wayland and visa versa in x11.

Yeah, this all kind of resembles the big-endian vs. little-endian quarrel from http://en.wikipedia.org/wiki/Lilliput_and_Blefuscu Mostly pointless and just a matter of preference.
Leave a comment:
robclark replied

05 March 2014, 03:24 PM
Originally posted by entropy View Post

Too easy!

Thanks.

no prob.. been a while since I had a chance to use lmgt (and it still makes me chuckle after all these years) ;-)
Leave a comment:
entropy replied

05 March 2014, 03:03 PM
Originally posted by robclark View Post

http://bit.ly/1g0xFSX

Let me get that ...

http://bit.ly/1g0xLde

For all those people who find it more convenient to bother you with their question rather than google it for themselves.

;-)

Too easy!

Thanks.
Leave a comment:
robclark replied

05 March 2014, 12:34 PM
Originally posted by entropy View Post

Is it already possible to run Qt and GTK programs on Wayland/Weston?

Let me get that ...

http://bit.ly/1g0xFSX

For all those people who find it more convenient to bother you with their question rather than google it for themselves.

Let me get that ...

http://bit.ly/1g0xLde

For all those people who find it more convenient to bother you with their question rather than google it for themselves.

;-)
Leave a comment:
entropy replied

05 March 2014, 11:58 AM
Is it already possible to run Qt and GTK programs on Wayland/Weston?
Leave a comment:
robclark replied

05 March 2014, 11:47 AM
Originally posted by Philip View Post

I think you can only give the hardware two control lists at a time (typically one for binning, one for rendering), and you need to wait for an interrupt before submitting the next one. In theory your kernel driver can maintain a queue of jobs and the ISR can immediately feed it a new one. The released code looks pretty dumb though - the kernel driver's ISR just wakes up a userspace thread that's waiting in an ioctl, so it'll be affected by random scheduler latency. (The userspace thread uses ANDROID_PRIORITY_URGENT_DISPLAY presumably to make that less bad.)

Usually you should only have one bin/render pair of jobs per frame though, so I guess that's probably bearable.

Interesting. Well it shouldn't be an issue for quake3, all your draws should end up in the same cmdlist (ie. nothing should trigger a flush mid-frame) so probably the best thing is just get it working first. Thankfully q3 seems pretty well behaved.. not the sort of gl app that makes driver writers want to tear their hair out ;-)

But more long term, when folks r/e'ing videocore get to the point where they can make their own firmware, then it could be interesting to implement a simple task on VPU so that you can queue up pairs of cmdlists from the arm.

Originally posted by Philip View Post

Texture conversion might be a performance problem: "The TMUs require most types of textures to be arranged in memory in T-format or [for smaller images] LT-format", so you need to convert most uploaded textures from raster to T-format. (The documentation indicates it supports raster order only for RGBA32 and YUYV, not e.g. RGB565 or LUMINANCE; and raster order will give poorer rendering performance because of its SDRAM access patterns, so you really want to use T-format). The VPU has vector assembly to do that conversion. The RPi's ARM11 doesn't even have NEON, so the conversion will probably be rather painful there.

again, shouldn't be too much of an issue other than startup time for new level.. q3 seems pretty good about loading up all it's textures up front. Long term, a similar helper on the VPU for using vector instructions for texture conversion might be interesting. Make it part of the same cmdlist-dispatcher task to keep it synchronized w/ cmdlist dispatch, perhaps, so arm side can just fire-and-forget.
Leave a comment:
Philip replied

05 March 2014, 06:47 AM
I think you can only give the hardware two control lists at a time (typically one for binning, one for rendering), and you need to wait for an interrupt before submitting the next one. In theory your kernel driver can maintain a queue of jobs and the ISR can immediately feed it a new one. The released code looks pretty dumb though - the kernel driver's ISR just wakes up a userspace thread that's waiting in an ioctl, so it'll be affected by random scheduler latency. (The userspace thread uses ANDROID_PRIORITY_URGENT_DISPLAY presumably to make that less bad.)

Usually you should only have one bin/render pair of jobs per frame though, so I guess that's probably bearable.

Texture conversion might be a performance problem: "The TMUs require most types of textures to be arranged in memory in T-format or [for smaller images] LT-format", so you need to convert most uploaded textures from raster to T-format. (The documentation indicates it supports raster order only for RGBA32 and YUYV, not e.g. RGB565 or LUMINANCE; and raster order will give poorer rendering performance because of its SDRAM access patterns, so you really want to use T-format). The VPU has vector assembly to do that conversion. The RPi's ARM11 doesn't even have NEON, so the conversion will probably be rather painful there.

Code that doesn't benefit from an RTOS or vector assembly (e.g. the shader compiler) will quite possibly be faster on the ARM though, since essentially it's a 1GHz RISC processor (the ARM) vs a 250MHz RISC processor (the VPU).
Leave a comment:
robclark replied

04 March 2014, 09:04 PM
Originally posted by daniels View Post

gles is a pretty severe api, requiring a hell of a lot of state tracking. things like glGetError() - which you should be doing - switch from requiring roundtrips to being in local cache. you also cut out on a lot of copies of things like vertex data. so yes, you do add some cpu load, but you also remove some overheads

just a side note: it looks from what I've seen in the released docs that the gpu is directly consuming command lists from the gl driver (ie. it is not relying on videocore to offload the register banging). So I don't think having the driver arm-side is such a big performance loss. And once you factor in round trips and copies... well I think it is not a foregone conclusion that having the driver on the arm will be slower compared to having it on videocore.
Leave a comment:

Announcement

Broadcom Open-Sources VideoCore IV 3D Graphics Stack

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: