What isn't eminently clear to me is, why is there such a big push to use OpenGL or XRender as a backend in web browsers, especially for simpler scenes that are just HTML/CSS with some images? Any driver worth its salt (even Intel, now, with SNA) has hardware-backed 2D acceleration, so just using Cairo with the default backend eventually uses hardware acceleration, and that path has been well-tested and stabilized for years (evidenced by its rock solid stability on most hardware supported by the Linux kernel). The path between Cairo, the X server, the DDX and KMS is a venerable, fast path for even rapidly-changing dynamic scenes with HTML5. Granted, it's not suitable for rendering crazy amounts of polygons and 3d scenes, but that's what WebGL is for, which is a separate topic entirely.
I've observed that the latency and overhead involved in preparing the pipeline for the more sophisticated backends such as OpenGL and XRender causes more CPU usage and worse performance than just using the existing 2D paths.
That's why Microsoft came up with Direct2D for Windows, because they realized that implementing the entire GUI stack on top of Direct3D would be too inefficient, because Direct3D is designed to handle a bigger job.
One of the presentations that was on Phoronix a couple years ago (by Intel related to Larabee, I believe) said it best. And I think that even with the evolution of the hybrid CPU/GPU (Bulldozer, Tegra, etc) there's still a need to distinguish between these two workloads.
One workload is comparable to the real life workload of transporting one person from their origin to their destination. Assuming that this person needs to arrive at their destination in the minimum amount of time, while consuming the least amount of resources (but time is given higher priority over resource consumption), the most optimal way to get them there is to put them in a small vehicle, such as a fuel-efficient car, and zip them down the highway to their destination. This captures the use case of most traditional 2D applications, such as HTML/CSS rendering, 2D native GUIs in GTK or Qt, and so on. There are a lot of individual requests being sent, and there is a need for a very fast response time while consuming minimal resources.
The other workload is comparable to the real life workload of a company that needs to transport ten forklift pallets full of bricks from one place to another, once per week. Compared to the first example, this workload has different requirements: we know that it would be prohibitively inefficient (too resource-costly) to put a few bricks in the trunk of a dozen 2-door sedans and haul them over using 12 different drivers. So instead, we load them up on a very large flatbed truck that consumes a great deal of diesel fuel, BUT it ends up being significantly cheaper and faster than using a bunch of small cars. We also know that we can better manage any time inefficiencies that might be introduced in this slower mode of transit, because we are regularly delivering the same workload (ten forklift pallets of bricks) every week, so if we anticipate delays, we can create an artificial surplus by shipping early, and so on. This workload captures the use case of "heavy" 2D applications such as Flash, Clutter, 3d games, and so on. Because of the huge size of the work to be done, we send it out in larger batches, and employ a more expensive (in terms of power and latency) resource -- the full GPU -- in order to get the work done.
While I think it is possible to build a hardware chip that handles both workloads fine, that is not my point in this article. My point is that we should be using an API that maps well to the workload's requirements. It's as simple as that.
Real life performance data has continued to show that we see little benefit from using the "heavier" APIs for the "lighter" workloads: I think even past Phoronix articles showed that the software rasterizer is as fast or faster than the OpenGL backend for e.g. QT4. This kind of thing is very interesting to me, because you're using two different APIs to render the same workload, but the internal paths that each backend takes is very different: one of them turns on the entire GPU and starts mapping textures and maybe even compiling and executing shaders. The other path just uses some very basic graphics operations on the GPU and does a lot of rasterization in memory, although the final result is usually still zero-copy DMA transferred into the framebuffer somehow.
And aside from performance, it's pretty clear that even the most carefully optimized implementations of "3d stack for everything" -- such as Windows 7's WDDM 1.1 based Aero -- consume at least 10 - 15% more energy than a 2d path. This actual figure will depend on power characteristics of your graphics hardware, and I think 10 - 15% is a figure they came up with for Intel IGPs... if you were to look at, say, power-hungry Nvidia cards like the GTX 580, you might see even more dramatic power differences between a workload that's constantly hitting the GPU, and a workload that just dithers around on the relatively lower-power CPU.
So if you care at all about latency or power consumption, it's probably a good idea to configure your web browser, Gtk and Qt stacks to continue to use software rasterization. If you just want to use the latest developing technologies for the fun of it, enjoy using OpenGL for web browsing, which is a greatly over-engineered and unnecessary step that provides little measurable benefit.
I've observed that the latency and overhead involved in preparing the pipeline for the more sophisticated backends such as OpenGL and XRender causes more CPU usage and worse performance than just using the existing 2D paths.
That's why Microsoft came up with Direct2D for Windows, because they realized that implementing the entire GUI stack on top of Direct3D would be too inefficient, because Direct3D is designed to handle a bigger job.
One of the presentations that was on Phoronix a couple years ago (by Intel related to Larabee, I believe) said it best. And I think that even with the evolution of the hybrid CPU/GPU (Bulldozer, Tegra, etc) there's still a need to distinguish between these two workloads.
One workload is comparable to the real life workload of transporting one person from their origin to their destination. Assuming that this person needs to arrive at their destination in the minimum amount of time, while consuming the least amount of resources (but time is given higher priority over resource consumption), the most optimal way to get them there is to put them in a small vehicle, such as a fuel-efficient car, and zip them down the highway to their destination. This captures the use case of most traditional 2D applications, such as HTML/CSS rendering, 2D native GUIs in GTK or Qt, and so on. There are a lot of individual requests being sent, and there is a need for a very fast response time while consuming minimal resources.
The other workload is comparable to the real life workload of a company that needs to transport ten forklift pallets full of bricks from one place to another, once per week. Compared to the first example, this workload has different requirements: we know that it would be prohibitively inefficient (too resource-costly) to put a few bricks in the trunk of a dozen 2-door sedans and haul them over using 12 different drivers. So instead, we load them up on a very large flatbed truck that consumes a great deal of diesel fuel, BUT it ends up being significantly cheaper and faster than using a bunch of small cars. We also know that we can better manage any time inefficiencies that might be introduced in this slower mode of transit, because we are regularly delivering the same workload (ten forklift pallets of bricks) every week, so if we anticipate delays, we can create an artificial surplus by shipping early, and so on. This workload captures the use case of "heavy" 2D applications such as Flash, Clutter, 3d games, and so on. Because of the huge size of the work to be done, we send it out in larger batches, and employ a more expensive (in terms of power and latency) resource -- the full GPU -- in order to get the work done.
While I think it is possible to build a hardware chip that handles both workloads fine, that is not my point in this article. My point is that we should be using an API that maps well to the workload's requirements. It's as simple as that.
Real life performance data has continued to show that we see little benefit from using the "heavier" APIs for the "lighter" workloads: I think even past Phoronix articles showed that the software rasterizer is as fast or faster than the OpenGL backend for e.g. QT4. This kind of thing is very interesting to me, because you're using two different APIs to render the same workload, but the internal paths that each backend takes is very different: one of them turns on the entire GPU and starts mapping textures and maybe even compiling and executing shaders. The other path just uses some very basic graphics operations on the GPU and does a lot of rasterization in memory, although the final result is usually still zero-copy DMA transferred into the framebuffer somehow.
And aside from performance, it's pretty clear that even the most carefully optimized implementations of "3d stack for everything" -- such as Windows 7's WDDM 1.1 based Aero -- consume at least 10 - 15% more energy than a 2d path. This actual figure will depend on power characteristics of your graphics hardware, and I think 10 - 15% is a figure they came up with for Intel IGPs... if you were to look at, say, power-hungry Nvidia cards like the GTX 580, you might see even more dramatic power differences between a workload that's constantly hitting the GPU, and a workload that just dithers around on the relatively lower-power CPU.
So if you care at all about latency or power consumption, it's probably a good idea to configure your web browser, Gtk and Qt stacks to continue to use software rasterization. If you just want to use the latest developing technologies for the fun of it, enjoy using OpenGL for web browsing, which is a greatly over-engineered and unnecessary step that provides little measurable benefit.
Comment