Announcement

Collapse
No announcement yet.

E-450 graphics performance issues

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #76
    I believe that, but I can't imagine function call overhead contributes notably to CS checking CPU load.

    Comment


    • #77
      Originally posted by brent View Post
      I did more than just enabling tracing/fallback reporting, of course.

      Can you post your profiling output? Where is most of the time spent? What were you profiling?

      Originally posted by brent View Post
      Anyway, the kind of fallback you describe should be quite efficient (data flows only in one direction, no ping-pong), no? Also, it does not explain performance that is an order of magnitude slower than pixman and scales (nearly) linearly with GPU clock.

      It depends on the surface. If you want to read/write to a tiled surface with the CPU, you'll need to blit the surface to a linear buffer and then map it for the CPU to access it. If the surface is not tiled, you could just map it, but then the driver would have to wait for the GPU to finish with it before accessing it. Reading from uncached vram with the CPU is also slow.


      Originally posted by brent View Post
      I mean that radeon_cs_emit() is done synchronously. Mesa seems to do it in a worker thread. radeon_cs_emit can take a few milliseconds to complete, so I guess that's worthwile.
      That submits the command buffer to the kernel for processing by the GPU. It would be optimal to separate it out into a separate thread, but the overhead is not that great. Also if this were the bottleneck, it wouldn't be tied to GPU clocks as you claim.

      Comment


      • #78
        Originally posted by agd5f View Post
        Can you post your profiling output? Where is most of the time spent? What were you profiling?
        I experimented with system profilers like oprofile and sysprof, but they weren't particularly useful. 2D rendering is IO/GPU bound, not CPU-bound (unless fallbacks are hit). So in the end I placed timers to measure the wall clock time of various functions over the code, and I'm still at it. I don't know if there is a better way to profile wall clock time as opposed to CPU time. I don't think it's interesting to post the output, it's just a messy log with wall time per function printed.

        One particular culprit I found is RADEONSolidPixmap. This function creates a special, small 1x1 scratch pixmap. These pixmaps are created and destroyed all the time for typical cases of EXA Composite. A call to this function doesn't consume much CPU time, but still takes about 70us average on my system, for creating the pixmap, mapping the BO, etc. I added a simple cache (quite a hack, proof of concept) for these scratch pixmaps, and this speeded up the gnome-terminal-vim perf-trace more than 3x already.


        It depends on the surface. If you want to read/write to a tiled surface with the CPU, you'll need to blit the surface to a linear buffer and then map it for the CPU to access it. If the surface is not tiled, you could just map it, but then the driver would have to wait for the GPU to finish with it before accessing it. Reading from uncached vram with the CPU is also slow.
        Yes, but I specifically meant the case that EXA completely circumvents acceleration (isn't that what EXA_MIXED_PIXMAPS is all about?).
        In this case, I'd assume that EXA software renders into a mapped GTT linear surface and afterwards uses it like any other pixmap for Copy/Composite/etc, which should be quite efficient. Well, or something like that.

        That submits the command buffer to the kernel for processing by the GPU. It would be optimal to separate it out into a separate thread, but the overhead is not that great. Also if this were the bottleneck, it wouldn't be tied to GPU clocks as you claim.
        I never said it's the primary bottleneck, but it certainly is a bottleneck.

        Comment


        • #79
          Originally posted by brent View Post
          I experimented with system profilers like oprofile and sysprof, but they weren't particularly useful. 2D rendering is IO/GPU bound, not CPU-bound (unless fallbacks are hit). So in the end I placed timers to measure the wall clock time of various functions over the code, and I'm still at it. I don't know if there is a better way to profile wall clock time as opposed to CPU time. I don't think it's interesting to post the output, it's just a messy log with wall time per function printed.

          One particular culprit I found is RADEONSolidPixmap. This function creates a special, small 1x1 scratch pixmap. These pixmaps are created and destroyed all the time for typical cases of EXA Composite. A call to this function doesn't consume much CPU time, but still takes about 70us average on my system, for creating the pixmap, mapping the BO, etc. I added a simple cache (quite a hack, proof of concept) for these scratch pixmaps, and this speeded up the gnome-terminal-vim perf-trace more than 3x already.
          That's a good start. Ideally we'd just pass the solid color into the pixel shader as constant, but that requires rewriting the composite pixel and vertex shaders for which at this point starts to get non-trivial without a shader compiler.

          Comment


          • #80
            Originally posted by agd5f View Post
            That's a good start. Ideally we'd just pass the solid color into the pixel shader as constant, but that requires rewriting the composite pixel and vertex shaders for which at this point starts to get non-trivial without a shader compiler.
            What are they written in?

            Comment


            • #81
              Originally posted by curaga View Post
              What are they written in?
              GPU shader assembly.

              Comment


              • #82
                http://cgit.freedesktop.org/xorg/dri.../r600_shader.c

                Yay. Is there no way to dump that from mesa?

                Comment


                • #83
                  Re the static thing, there were a total of ~90 functions and structs that should've been static and weren't, yikes Let's see if this becomes one's first kernel code.

                  Comment


                  • #84
                    Originally posted by curaga View Post
                    http://cgit.freedesktop.org/xorg/dri.../r600_shader.c

                    Yay. Is there no way to dump that from mesa?
                    If you mean integrate the shader compiler from mesa into the ddx, that is a lot of work. If you mean use the 3D driver to generate the GPU asm, that should be possible. You'll have to write the program in either GLSL or TGSI, then you can dump the shaders. Afterwards they may require a bit of tweaking to handle differences in pipeline state between the 3D driver and the ddx.

                    Comment


                    • #85
                      Originally posted by agd5f View Post
                      If you mean integrate the shader compiler from mesa into the ddx, that is a lot of work. If you mean use the 3D driver to generate the GPU asm, that should be possible. You'll have to write the program in either GLSL or TGSI, then you can dump the shaders. Afterwards they may require a bit of tweaking to handle differences in pipeline state between the 3D driver and the ddx.
                      The latter, yes.

                      Comment


                      • #86
                        Originally posted by agd5f View Post
                        It's not really an issue with 2D vs. 3D engines. 2D engines suck for RENDER too. The reason vesa or old drivers seem faster for certain things is because they use shadowfb or XAA (which ends up being shadowfb because offscreen acceleration has been disabled for years due to bit rot in XAA). Shadowfb is pure software rendering. Pure CPU rendering is almost always faster than mixed CPU/GPU rendering since there is no ping-ponging between GPU and CPU rendering.
                        You can enable shadowfb in the radeon driver if you want to compare by setting Option "NoAccel" "True" in the device section of your xorg config.
                        Option "NoAccel" "True" also disables Xv acceleration, rendering this driver configuration not very useful in practice. But indeed it performs roughly the same as xf86-video-fbdev.

                        Option "RenderAccel" "False" could be potentially interesting, except that it makes xf86-video-ati-6.14.6 slower than xf86-video-fbdev for me (unfortunately no E-450 here, just Intel Core i7 860 + Radeon HD 6770):
                        Code:
                        old: fbdev
                        new: ati-norenderaccel
                        Speedups
                        ========
                         xlib          xfce4-terminal-a1  3418.55 (3418.83 0.78%) -> 2878.58 (2893.62 0.72%):  1.19x speedup
                        ▎
                        Slowdowns
                        =========
                         xlib          firefox-asteroids  4320.09 (4324.56 1.87%) -> 4587.17 (4597.68 1.00%):  1.06x slowdown
                        
                         xlib              chromium-tabs  145.03 (145.58 8.13%) -> 161.65 (162.10 3.91%):  1.11x slowdown
                        ▏
                         xlib          firefox-particles  67684.50 (67720.43 0.05%) -> 78508.14 (78510.06 0.05%):  1.16x slowdown
                        ▏
                         xlib             grads-heat-map  175.32 (175.54 6.12%) -> 211.18 (211.68 4.02%):  1.20x slowdown
                        ▎
                         xlib       firefox-planet-gnome  6715.22 (6726.93 1.44%) -> 8711.32 (8755.91 0.66%):  1.30x slowdown
                        ▎
                         xlib          firefox-talos-gfx  15171.14 (15172.71 0.35%) -> 22849.30 (22902.49 0.19%):  1.51x slowdown
                        ▌
                         xlib                  evolution  1378.87 (1382.31 6.99%) -> 2249.10 (2250.09 2.28%):  1.63x slowdown
                        ▋
                         xlib                    poppler  1524.61 (1559.51 9.64%) -> 2549.49 (2576.42 1.60%):  1.67x slowdown
                        ▋
                         xlib             poppler-reseau  460.96 (461.49 1.55%) -> 899.66 (900.38 0.89%):  1.95x slowdown
                        █
                         xlib         gnome-terminal-vim  2255.49 (2260.89 1.33%) -> 5221.35 (5224.08 0.53%):  2.31x slowdown
                        █▍
                         xlib                  ocitysmap  1198.46 (1199.31 1.59%) -> 2847.14 (2848.15 1.12%):  2.38x slowdown
                        █▍

                        Comment


                        • #87
                          How is this coming along? I have the 6320 2 years later and it's running incredibly slow

                          Comment

                          Working...
                          X