Announcement

Collapse
No announcement yet.

E-450 graphics performance issues

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #71
    Originally posted by brent View Post
    I've now done a fair share of profiling and tracing. Fallbacks aren't the main issue that's holding back 2D performance. Everything important is accelerated, generally there's no migration ping-pong, etc. It's just that acceleration is very slow, and defunct power management that is forcing the GPU clock to the lowest and slowest power state doesn't help.
    Just turning on trace fallbacks in the driver is not enough. You actually need to profile to see where time is being spent. There are a number of things that EXA does not even attempt to accelerate so you won't see fallbacks in the driver; the core EXA code just does the operation with the CPU.

    Originally posted by brent View Post
    I'm not quite sure why rendering is so low as I'm not very familiar with the R600+ architecture. Part of it seems to be synchronous CS flushing, at least.
    I'm not sure what you mean by that. Can you give an example? You need to synchronize caches when the domain switches between GPU and CPU or when a read or write domain changes. E.g., if you've been using the GPU to write to a surface and then you need to access it with the CPU, you'll need flush the GPU destination caches and then potentially migrate the surface to a buffer the CPU can access (for example if the surface is tiled, you need to blit it to a linear surface, or if the surface is in a part of vram that is not accessible by the CPU, you need to migrate it to CPU-accessible vram or system memory). The other case is if you render to a surface with the GPU and then need to read from the surface with GPU in a subsequent operation. You need to flush the destination caches in between. Overlapping surfaces are handled with a temp surface so there's no extra syncing involved. Depending on the operation, some of it happens explicitly in the ddx, some if it is handled in the kernel.

    Comment


    • #72
      Originally posted by agd5f View Post
      Just turning on trace fallbacks in the driver is not enough. You actually need to profile to see where time is being spent. There are a number of things that EXA does not even attempt to accelerate so you won't see fallbacks in the driver; the core EXA code just does the operation with the CPU.
      I did more than just enabling tracing/fallback reporting, of course.

      Anyway, the kind of fallback you describe should be quite efficient (data flows only in one direction, no ping-pong), no? Also, it does not explain performance that is an order of magnitude slower than pixman and scales (nearly) linearly with GPU clock.

      I'm not sure what you mean by that. Can you give an example? You need to synchronize caches when the domain switches between GPU and CPU or when a read or write domain changes.
      I mean that radeon_cs_emit() is done synchronously. Mesa seems to do it in a worker thread. radeon_cs_emit can take a few milliseconds to complete, so I guess that's worthwile.
      Last edited by brent; 26 July 2012, 02:14 PM.

      Comment


      • #73
        Looking forward to that patch in the ddx


        I've been trying to figure out the mystery that is glxgears for a bit, personally. It only uses 80% of one core, and the gpu too is only at 70%. What the heck would cause it to be neither gpu nor cpu limited?

        I did find that the radeon DRM code has several dozen functions that should be static (or static inline) and aren't, as well as a lot of missing const that could be there to result in a bit better code. But adding a few static declarations for the most used functions didn't improve gears fps, so I haven't bothered to submit those.

        Comment


        • #74
          Well, function call overhead is probably the smallest problem in the whole stack.

          Comment


          • #75
            Originally posted by brent View Post
            Well, function call overhead is probably the smallest problem in the whole stack.
            Nope, the CS checking part takes 5-8% of glxgears cpu. And it calls a lot of functions. But it's not the bottleneck here, even though it's the nr. 1 cpu user.

            Comment


            • #76
              I believe that, but I can't imagine function call overhead contributes notably to CS checking CPU load.

              Comment


              • #77
                Originally posted by brent View Post
                I did more than just enabling tracing/fallback reporting, of course.

                Can you post your profiling output? Where is most of the time spent? What were you profiling?

                Originally posted by brent View Post
                Anyway, the kind of fallback you describe should be quite efficient (data flows only in one direction, no ping-pong), no? Also, it does not explain performance that is an order of magnitude slower than pixman and scales (nearly) linearly with GPU clock.

                It depends on the surface. If you want to read/write to a tiled surface with the CPU, you'll need to blit the surface to a linear buffer and then map it for the CPU to access it. If the surface is not tiled, you could just map it, but then the driver would have to wait for the GPU to finish with it before accessing it. Reading from uncached vram with the CPU is also slow.


                Originally posted by brent View Post
                I mean that radeon_cs_emit() is done synchronously. Mesa seems to do it in a worker thread. radeon_cs_emit can take a few milliseconds to complete, so I guess that's worthwile.
                That submits the command buffer to the kernel for processing by the GPU. It would be optimal to separate it out into a separate thread, but the overhead is not that great. Also if this were the bottleneck, it wouldn't be tied to GPU clocks as you claim.

                Comment


                • #78
                  Originally posted by agd5f View Post
                  Can you post your profiling output? Where is most of the time spent? What were you profiling?
                  I experimented with system profilers like oprofile and sysprof, but they weren't particularly useful. 2D rendering is IO/GPU bound, not CPU-bound (unless fallbacks are hit). So in the end I placed timers to measure the wall clock time of various functions over the code, and I'm still at it. I don't know if there is a better way to profile wall clock time as opposed to CPU time. I don't think it's interesting to post the output, it's just a messy log with wall time per function printed.

                  One particular culprit I found is RADEONSolidPixmap. This function creates a special, small 1x1 scratch pixmap. These pixmaps are created and destroyed all the time for typical cases of EXA Composite. A call to this function doesn't consume much CPU time, but still takes about 70us average on my system, for creating the pixmap, mapping the BO, etc. I added a simple cache (quite a hack, proof of concept) for these scratch pixmaps, and this speeded up the gnome-terminal-vim perf-trace more than 3x already.


                  It depends on the surface. If you want to read/write to a tiled surface with the CPU, you'll need to blit the surface to a linear buffer and then map it for the CPU to access it. If the surface is not tiled, you could just map it, but then the driver would have to wait for the GPU to finish with it before accessing it. Reading from uncached vram with the CPU is also slow.
                  Yes, but I specifically meant the case that EXA completely circumvents acceleration (isn't that what EXA_MIXED_PIXMAPS is all about?).
                  In this case, I'd assume that EXA software renders into a mapped GTT linear surface and afterwards uses it like any other pixmap for Copy/Composite/etc, which should be quite efficient. Well, or something like that.

                  That submits the command buffer to the kernel for processing by the GPU. It would be optimal to separate it out into a separate thread, but the overhead is not that great. Also if this were the bottleneck, it wouldn't be tied to GPU clocks as you claim.
                  I never said it's the primary bottleneck, but it certainly is a bottleneck.

                  Comment


                  • #79
                    Originally posted by brent View Post
                    I experimented with system profilers like oprofile and sysprof, but they weren't particularly useful. 2D rendering is IO/GPU bound, not CPU-bound (unless fallbacks are hit). So in the end I placed timers to measure the wall clock time of various functions over the code, and I'm still at it. I don't know if there is a better way to profile wall clock time as opposed to CPU time. I don't think it's interesting to post the output, it's just a messy log with wall time per function printed.

                    One particular culprit I found is RADEONSolidPixmap. This function creates a special, small 1x1 scratch pixmap. These pixmaps are created and destroyed all the time for typical cases of EXA Composite. A call to this function doesn't consume much CPU time, but still takes about 70us average on my system, for creating the pixmap, mapping the BO, etc. I added a simple cache (quite a hack, proof of concept) for these scratch pixmaps, and this speeded up the gnome-terminal-vim perf-trace more than 3x already.
                    That's a good start. Ideally we'd just pass the solid color into the pixel shader as constant, but that requires rewriting the composite pixel and vertex shaders for which at this point starts to get non-trivial without a shader compiler.

                    Comment


                    • #80
                      Originally posted by agd5f View Post
                      That's a good start. Ideally we'd just pass the solid color into the pixel shader as constant, but that requires rewriting the composite pixel and vertex shaders for which at this point starts to get non-trivial without a shader compiler.
                      What are they written in?

                      Comment

                      Working...
                      X