Announcement

Collapse
No announcement yet.

E-450 graphics performance issues

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    @bridgeman the push for increased GL version support probably should have been expected... considering that many open source developers have a implement first make it fast later mentality... and also that implementing is easier than optimizing. I think it might also be important for getting the driver up to feature parity so that driver development can occur in sync with hardware design.

    Comment


    • #62
      Yep, it's a sad fact that some 2d operations on the 3d engine suck. Add to that the likely better algorithms in the blobs.

      When R600 and GF8 were released, this was publicly known and anyone who needed good 2d just bought the previous gen (and this was on Windows too!). It's my understanding that even the latest gen loses in 2d to the cards with dedicated 2d, such as R500, GF7, or the old cards such as Matrox ones. (yep, tab switching on a Matrox is still faster than on a recent AMD, as is software (vesa) )


      Are you using a compositor anyhow? I tried to replicate your enter-pressing test, but all it did was bring X from 1% of one core to 3% of one core. But I'm not using a compositor, nor a bloated terminal such as Gnome terminal (mrxvt if you're curious, with antialiased fonts etc).

      Comment


      • #63
        Originally posted by curaga View Post
        Yep, it's a sad fact that some 2d operations on the 3d engine suck. Add to that the likely better algorithms in the blobs.

        When R600 and GF8 were released, this was publicly known and anyone who needed good 2d just bought the previous gen (and this was on Windows too!). It's my understanding that even the latest gen loses in 2d to the cards with dedicated 2d, such as R500, GF7, or the old cards such as Matrox ones. (yep, tab switching on a Matrox is still faster than on a recent AMD, as is software (vesa) )


        Are you using a compositor anyhow? I tried to replicate your enter-pressing test, but all it did was bring X from 1% of one core to 3% of one core. But I'm not using a compositor, nor a bloated terminal such as Gnome terminal (mrxvt if you're curious, with antialiased fonts etc).
        It's not really an issue with 2D vs. 3D engines. 2D engines suck for RENDER too. The reason vesa or old drivers seem faster for certain things is because they use shadowfb or XAA (which ends up being shadowfb because offscreen acceleration has been disabled for years due to bit rot in XAA). Shadowfb is pure software rendering. Pure CPU rendering is almost always faster than mixed CPU/GPU rendering since there is no ping-ponging between GPU and CPU rendering. You can enable shadowfb in the radeon driver if you want to compare by setting Option "NoAccel" "True" in the device section of your xorg config.

        Comment


        • #64
          In other words, too many fallbacks, and something like SNA for radeon should be done?

          Comment


          • #65
            Originally posted by curaga View Post
            In other words, too many fallbacks, and something like SNA for radeon should be done?
            or glamor.

            Comment


            • #66
              I've now done a fair share of profiling and tracing. Fallbacks aren't the main issue that's holding back 2D performance. Everything important is accelerated, generally there's no migration ping-pong, etc. It's just that acceleration is very slow, and defunct power management that is forcing the GPU clock to the lowest and slowest power state doesn't help.

              I'm not quite sure why rendering is so low as I'm not very familiar with the R600+ architecture. Part of it seems to be synchronous CS flushing, at least.

              Comment


              • #67
                What kind of functions are you benchmarking ? Some things like overlapping blits require frequent flushes to keep the texture and CB caches consistent in the overlap areas.

                Comment


                • #68
                  Yes, overlapped Copy operations are slow, and sometimes unnecessarily so*. However, it's also slow without many flushes, for instance when doing a lot of small Composite operations, a good example for this case is text rendering in gnome-terminal. I've experimented with increasing the size of the VBO and that seems to help a small bit, but not much. Generally, I use cairo-perf-trace to benchmark.

                  * When doing a copy inside a single pixmap, the DDX does a two-stage copy with two flushes even if the areas don't overlap. In this case no copy to the temporary is needed and one flush is enough. I've fixed that in my tree and it's noticeable faster in some cases, e.g. scrolling in gedit.

                  Comment


                  • #69
                    Originally posted by brent View Post
                    * The DDX does a two-stage copy blit with two flushes in a single pixmap even if the areas don't overlap. In this case no copy to the temporary is needed and one flush is enough. I've fixed that in my tree and it's noticeable faster in some cases, e.g. scrolling in gedit.
                    Yeah, that's right... IIRC you said that previously.

                    Comment


                    • #70
                      Not quite, I didn't notice that it is using a temporary even if that is not needed at the time!

                      Comment


                      • #71
                        Originally posted by brent View Post
                        I've now done a fair share of profiling and tracing. Fallbacks aren't the main issue that's holding back 2D performance. Everything important is accelerated, generally there's no migration ping-pong, etc. It's just that acceleration is very slow, and defunct power management that is forcing the GPU clock to the lowest and slowest power state doesn't help.
                        Just turning on trace fallbacks in the driver is not enough. You actually need to profile to see where time is being spent. There are a number of things that EXA does not even attempt to accelerate so you won't see fallbacks in the driver; the core EXA code just does the operation with the CPU.

                        Originally posted by brent View Post
                        I'm not quite sure why rendering is so low as I'm not very familiar with the R600+ architecture. Part of it seems to be synchronous CS flushing, at least.
                        I'm not sure what you mean by that. Can you give an example? You need to synchronize caches when the domain switches between GPU and CPU or when a read or write domain changes. E.g., if you've been using the GPU to write to a surface and then you need to access it with the CPU, you'll need flush the GPU destination caches and then potentially migrate the surface to a buffer the CPU can access (for example if the surface is tiled, you need to blit it to a linear surface, or if the surface is in a part of vram that is not accessible by the CPU, you need to migrate it to CPU-accessible vram or system memory). The other case is if you render to a surface with the GPU and then need to read from the surface with GPU in a subsequent operation. You need to flush the destination caches in between. Overlapping surfaces are handled with a temp surface so there's no extra syncing involved. Depending on the operation, some of it happens explicitly in the ddx, some if it is handled in the kernel.

                        Comment


                        • #72
                          Originally posted by agd5f View Post
                          Just turning on trace fallbacks in the driver is not enough. You actually need to profile to see where time is being spent. There are a number of things that EXA does not even attempt to accelerate so you won't see fallbacks in the driver; the core EXA code just does the operation with the CPU.
                          I did more than just enabling tracing/fallback reporting, of course.

                          Anyway, the kind of fallback you describe should be quite efficient (data flows only in one direction, no ping-pong), no? Also, it does not explain performance that is an order of magnitude slower than pixman and scales (nearly) linearly with GPU clock.

                          I'm not sure what you mean by that. Can you give an example? You need to synchronize caches when the domain switches between GPU and CPU or when a read or write domain changes.
                          I mean that radeon_cs_emit() is done synchronously. Mesa seems to do it in a worker thread. radeon_cs_emit can take a few milliseconds to complete, so I guess that's worthwile.
                          Last edited by brent; 07-26-2012, 02:14 PM.

                          Comment


                          • #73
                            Looking forward to that patch in the ddx


                            I've been trying to figure out the mystery that is glxgears for a bit, personally. It only uses 80% of one core, and the gpu too is only at 70%. What the heck would cause it to be neither gpu nor cpu limited?

                            I did find that the radeon DRM code has several dozen functions that should be static (or static inline) and aren't, as well as a lot of missing const that could be there to result in a bit better code. But adding a few static declarations for the most used functions didn't improve gears fps, so I haven't bothered to submit those.

                            Comment


                            • #74
                              Well, function call overhead is probably the smallest problem in the whole stack.

                              Comment


                              • #75
                                Originally posted by brent View Post
                                Well, function call overhead is probably the smallest problem in the whole stack.
                                Nope, the CS checking part takes 5-8% of glxgears cpu. And it calls a lot of functions. But it's not the bottleneck here, even though it's the nr. 1 cpu user.

                                Comment

                                Working...
                                X