Announcement

**daniels** · 04 March 2014, 08:44 PM

Originally posted by ssvb View Post

Which performance barrier? And how running the 3D driver on the ARM core (wasting the precious CPU cycles) instead of offloading it to the VPU core is going to help performance?

gles is a pretty severe api, requiring a hell of a lot of state tracking. things like glGetError() - which you should be doing - switch from requiring roundtrips to being in local cache. you also cut out on a lot of copies of things like vertex data. so yes, you do add some cpu load, but you also remove some overheads.

Originally posted by ssvb View Post

Let's look at it this way. The Raspberry Pi foundation invested money in Wayland/Weston by contracting dedicated professional developers. This is probably the right choice in the long run. But they hacked Weston to add a special Raspberry Pi specific hardware DispmanX layers support in it. In the X11 world that would be an equivalent to adding a custom composite extension variant to Xorg and implementing a custom compositing window manager making use of it. For a fair apples-to-apples comparison, this custom Raspberry Pi specific Xorg extension would need to be implemented by somebody. The next crucial thing is the EGL support for running OpenGL ES applications. And again, for a fair apples-to-apples comparison, both Wayland and X11 EGL support would need to be implemented. However the real world is anything but fair

The Raspberry Pi foundation naturally does not have any obligation to sponsor the competing X11 improvements just for the sake of contest. They just picked the new horse.

not strictly accurate, since this didn't require any modifications to clients or protocol. you could - i suppose - develop a protocol to export the entire window tree from the server to the compositor, then work out a way to do zerocopy buffer sharing, and then you could be implementing the same backend. even then, you get stuck in a nest of unavoidable race conditions, the window tree you're exporting is massive, etc. dri1 used to work this way with its shared area; lock contention on the sarea was then found to be an enormous performance bottleneck. adding the extra layer of indirection would necessarily reduce performance.

none of the demos done to show off wayland showed gles support, btw, so that's one unfair comparison you can strike from the list.

**daniels** · 04 March 2014, 08:50 PM

Originally posted by brad0 View Post

You aren't really doing much to convince me you're not completely clueless.

how many mobile gpus have you reverse-engineered recently?

**robclark** · 04 March 2014, 09:04 PM

Originally posted by daniels View Post

gles is a pretty severe api, requiring a hell of a lot of state tracking. things like glGetError() - which you should be doing - switch from requiring roundtrips to being in local cache. you also cut out on a lot of copies of things like vertex data. so yes, you do add some cpu load, but you also remove some overheads

just a side note: it looks from what I've seen in the released docs that the gpu is directly consuming command lists from the gl driver (ie. it is not relying on videocore to offload the register banging). So I don't think having the driver arm-side is such a big performance loss. And once you factor in round trips and copies... well I think it is not a foregone conclusion that having the driver on the arm will be slower compared to having it on videocore.

**Philip** · 05 March 2014, 06:47 AM

I think you can only give the hardware two control lists at a time (typically one for binning, one for rendering), and you need to wait for an interrupt before submitting the next one. In theory your kernel driver can maintain a queue of jobs and the ISR can immediately feed it a new one. The released code looks pretty dumb though - the kernel driver's ISR just wakes up a userspace thread that's waiting in an ioctl, so it'll be affected by random scheduler latency. (The userspace thread uses ANDROID_PRIORITY_URGENT_DISPLAY presumably to make that less bad.)

Usually you should only have one bin/render pair of jobs per frame though, so I guess that's probably bearable.

Texture conversion might be a performance problem: "The TMUs require most types of textures to be arranged in memory in T-format or [for smaller images] LT-format", so you need to convert most uploaded textures from raster to T-format. (The documentation indicates it supports raster order only for RGBA32 and YUYV, not e.g. RGB565 or LUMINANCE; and raster order will give poorer rendering performance because of its SDRAM access patterns, so you really want to use T-format). The VPU has vector assembly to do that conversion. The RPi's ARM11 doesn't even have NEON, so the conversion will probably be rather painful there.

Code that doesn't benefit from an RTOS or vector assembly (e.g. the shader compiler) will quite possibly be faster on the ARM though, since essentially it's a 1GHz RISC processor (the ARM) vs a 250MHz RISC processor (the VPU).

**robclark** · 05 March 2014, 11:47 AM

Originally posted by Philip View Post

I think you can only give the hardware two control lists at a time (typically one for binning, one for rendering), and you need to wait for an interrupt before submitting the next one. In theory your kernel driver can maintain a queue of jobs and the ISR can immediately feed it a new one. The released code looks pretty dumb though - the kernel driver's ISR just wakes up a userspace thread that's waiting in an ioctl, so it'll be affected by random scheduler latency. (The userspace thread uses ANDROID_PRIORITY_URGENT_DISPLAY presumably to make that less bad.)

Usually you should only have one bin/render pair of jobs per frame though, so I guess that's probably bearable.

Interesting. Well it shouldn't be an issue for quake3, all your draws should end up in the same cmdlist (ie. nothing should trigger a flush mid-frame) so probably the best thing is just get it working first. Thankfully q3 seems pretty well behaved.. not the sort of gl app that makes driver writers want to tear their hair out ;-)

But more long term, when folks r/e'ing videocore get to the point where they can make their own firmware, then it could be interesting to implement a simple task on VPU so that you can queue up pairs of cmdlists from the arm.

Originally posted by Philip View Post

Texture conversion might be a performance problem: "The TMUs require most types of textures to be arranged in memory in T-format or [for smaller images] LT-format", so you need to convert most uploaded textures from raster to T-format. (The documentation indicates it supports raster order only for RGBA32 and YUYV, not e.g. RGB565 or LUMINANCE; and raster order will give poorer rendering performance because of its SDRAM access patterns, so you really want to use T-format). The VPU has vector assembly to do that conversion. The RPi's ARM11 doesn't even have NEON, so the conversion will probably be rather painful there.

again, shouldn't be too much of an issue other than startup time for new level.. q3 seems pretty good about loading up all it's textures up front. Long term, a similar helper on the VPU for using vector instructions for texture conversion might be interesting. Make it part of the same cmdlist-dispatcher task to keep it synchronized w/ cmdlist dispatch, perhaps, so arm side can just fire-and-forget.

**entropy** · 05 March 2014, 11:58 AM

Is it already possible to run Qt and GTK programs on Wayland/Weston?

**robclark** · 05 March 2014, 12:34 PM

Originally posted by entropy View Post

Is it already possible to run Qt and GTK programs on Wayland/Weston?

Let me get that ...

http://bit.ly/1g0xFSX

For all those people who find it more convenient to bother you with their question rather than google it for themselves.

Let me get that ...

http://bit.ly/1g0xLde

For all those people who find it more convenient to bother you with their question rather than google it for themselves.

;-)

**entropy** · 05 March 2014, 03:03 PM

Originally posted by robclark View Post

http://bit.ly/1g0xFSX

Let me get that ...

http://bit.ly/1g0xLde

For all those people who find it more convenient to bother you with their question rather than google it for themselves.

;-)

Too easy!

Thanks.

**robclark** · 05 March 2014, 03:24 PM

Originally posted by entropy View Post

Too easy!

Thanks.

no prob.. been a while since I had a chance to use lmgt (and it still makes me chuckle after all these years) ;-)

**ssvb** · 06 March 2014, 01:49 AM

Originally posted by robclark View Post

Sure.. with something like wayland you are at the mercy of what the client does as to whether it is "fully" accelerated or not. Although if there is going to be software involved (fallbacks or otherwise), I much prefer it to happen one time per frame / window update, upstream of the acceleration, never having to block waiting for the gpu. Versus x11 where you can have scenarios (depending on what your hw can do and whatever drawing operation your DDX driver can accel (ie. with EXA there are some operations that are always sw fallback). With x11, you can hit scenarios where you are alternating between cpu and gpu access to pixmaps, killing performance.

Yes, with poor X server DDX drivers you can easily hit scenarios when the performance gets killed. The solution is not to use poor drivers. Admittedly this is rather difficult. Especially on ARM hardware, where certain performance killing anti-patterns are surprisingly popular

gl/gles integration performance differences should amount to client side decorations vs not. With hardware rendered clients (ie. opengl(es)), with either wayland or x11, there should be exactly the same number of copies (ie. one)..

Zero copy is always better than one. Without compositing enabled in the X11 window manager and when hardware overlays are available to be controlled by the DDX driver, this is already in use - http://ssvb.github.io/2013/02/01/new...dx-driver.html. The window decorations do not matter because they are rendered by the X server just like for any other window (DRI2 buffers are not involved). The rectangular area drawn by a GLES application is living in a hardware overlay, with scanout configured directly from the current DRI2 buffer, alternating buffers on vblank to avoid tearing. We need to do a copy to the framebuffer (or to the window backing pixmap) only when somebody really wants to read from there. http://en.wikipedia.org/wiki/Lazy_evaluation for the win

As long as the GLES applications don't rely on the EGL_NATIVE_RENDERABLE feature, everyone should be happy. There are some shortcomings in the current implementation though, but nothing really unsolvable.

and that is regardless of of compositing vs non-compositing window manager in x11 case. The problem is server side (or really window manager side) decorations screw things up and you end up require an extra copy.

With the x11 compositing window manager and redirected windows, everything surely gets more complicated. But in theory the overhead of dealing with window decorations should be not so dramatic as an extra buffer copy per frame, see https://github.com/ssvb/xf86-video-fbturbo/issues/3. However this has not been really implemented yet, so I could be overlooking something.

But *technically* x11 can do client side decorations too.. so in theory it is a moot point. Although in practice client side decorations seem predominant in wayland and visa versa in x11.

Yeah, this all kind of resembles the big-endian vs. little-endian quarrel from http://en.wikipedia.org/wiki/Lilliput_and_Blefuscu

Mostly pointless and just a matter of preference.

Announcement

Broadcom Open-Sources VideoCore IV 3D Graphics Stack

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment