Announcement

Collapse
No announcement yet.

E-450 graphics performance issues

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Anyway, the TL;DR version: I'm not particularly familiar with Radeon GPU programming, but I'd really like to improve the power management situation, and I'd appreciate some pointers on where to start looking. Or better yet, some bits and pieces of documentation.

    Comment


    • #32
      Originally posted by brent View Post
      About clock gating: Well, I don't really see what's so very hard about that. There must be some internal documentation, or at least code in fglrx. Power management is a pretty isolated feature set, and probably doesn't amount to a lot of code, either.
      Power management runs into ever corner of the chip and touches every driver component if you want to get all possible power savings. The power management code alone in the proprietary driver is bigger than the entire open source graphics driver stack, although it's getting smaller for the last couple of GPU generations.

      Originally posted by brent View Post
      Just have someone sit down and at least give some basic pointers about how it is supposed to work, please? I browsed through the RV630 register documentation and family 14h BKDG, but there's no real pointer on where to start with GPU clock gating. I am tapping in the dark.
      Clock gating used to be fairly simple and independent of the other power management activities, and so it was supported in the open source drivers fairly early in the project. That is no longer the case -- as far as we can see clock gating pretty much requires everything else to be in place as well.

      Originally posted by brent View Post
      IMHO it's strange how everything related to CPU and NB is documented quite thoroughly, while the GPU is hardly documented at all.
      Not really... CPUs don't generally have "drivers" in the traditional sense -- most of the standard API is the instruction set and everything else is hard-coded in OS and BIOS by third parties, so excruciatingly detailed documentation is required. For GPUs, the standard APIs are much higher level at the top of a large driver stack -- OpenGL, DirectX, OpenCL etc... -- so the focus is on writing and supplying drivers rather than writing documentation for third party developers.

      It's not quite that black-and-white these days, since (a) HW vendors are helping more with the OS code and (b) community-developed open source graphics drivers are aiming a lot higher in functionality and feature set, but there's still a huge difference in "where the API is defined".

      Comment


      • #33
        JB,

        When you say you're trying to figure out how the hardware works, what exactly do you mean?

        A: Joe Engineer designed that part of the hardware, but Engineer was reallocated to another team and is not available for consultation?
        B: The design departments are not forthcoming with their documentation?
        C: The documentation never existed, or does not exist in a form that can be distributed (Documentation exists as comments in a SPICE model for example)?
        D: All of the above, and some others, and the full story would make me cry?

        F

        Comment


        • #34
          More like :

          A: Joe Engineer designed that part of the hardware but moved to another department 18 months ago and is on a month-long sabbatical... when he gets back he is available for consultation but only remembers the stuff you already know and not much else...

          B: The design departments are forthcoming with the documentation, but the documentation is oriented towards designing hardware, not designing software... remember the software was designed three years ago in conjunction with the hardware team and both teams are now working on hardware 3-4 generations past Evergreen...

          C: The internal documentation is not written to be distributed; it's intended to make sure the hardware design is complete, robust, testable and implementable, but it's nothing at all like what driver developers would want. The hardware documentation is all about how the hardware internals are going to work, and the software documentation is all about how the Catalyst internals are going to work.

          D: Between the documentation, picking Joe Engineer's brains, picking through diagnostics / BIOS / driver code it still doesn't work the way you expect... again, this is why getting open source driver development aligned with hardware and proprietary driver development is such a priority.

          That said, I think the "figuring out" part is mostly in the past and now we're at the "figuring out if we can release it" stage.
          Last edited by bridgman; 07-15-2012, 10:50 PM.

          Comment


          • #35
            And now for something completely different: I'm still trying to figure out why 2D performance is so bad. Even with the power state switching fixed, 2D performance is noticeably worse than on an Atom netbook with GMA 950 - a GPU that is much slower than the Radeon 6320 and also much less capable. I don't have an Atom netbook available at the moment, unfortunately. However, x11perf -scroll500 just achieves ~500 iterations per second, -comppixwin500 does ~1300 per second. A PC with an older NVidia GPU is over an order of magnitude faster for both of these. This doesn't seem right, and it's not like this is only visible in benchmarks. Smooth scrolling in Firefox is sluggish/not smooth, and scrolling in gnome-terminal is very slow and laggy. Even worse, with acceleration disabled (Option "NoAccel" "true") everything is much faster, both in benchmarks and in practice.

            Is this the kind of performance to expect from the drivers or is there something funny going on with my system? What's the best way to debug this?

            Comment


            • #36
              Quick answer without investigation is that older GPUs had dedicated 2D hardware optimized for lines, circles, text drawing and scrolling. Modern GPUs don't have that (r5xx / rs6xx were the last ATI/AMD parts with 2D hardware, don't know for NVidia but probably the same timeframe, not sure about Intel). 3D engines could do most of the operations as fast as or faster than 2D hardware, and spending die area on the 3D engine aligned much better with what customers were looking for.

              Unfortunately scrolling (where source and destination rectangles overlap) turns out to be one of the harder things to do with the 3D engine. There was a lot of discussion during the initial driver support for r6xx and higher, finding the right balance between "go fast" and "no corruption". The technical issue is that blits are done with a 3D engine by using a texture for the source rectangle and a render target for the destination rectangle. Each has its own cache, so writes to the destination rectangle don't update the cached copy of the source rectangle which you want when doing a blit. Even worse, the render operation is scattered across multiple SIMDs and there is no guarantee that the copy operations for one scanline will be completed before the operations for the next one begin. The hardware guarantees correctly ordered writes between triangles/quads (IIRC) but not within a single primitive.

              IIRC the algorithm of choice ended up breaking the rectangles into narrow horizontal slices the size of the vertical scroll distance, blitting rectangles one at a time, and flushing caches between rectangles. That was reliable and reasonably fast; not sure if any improvement since then has been figured out.

              Note that the above information could be 3 years out of date.

              Comment


              • #37
                How is the CPU involved into that? I can easily up the processor usage on both cores on my laptop to 60% with just opening a terminal and pressing down the Enter-key. According to your description this shouldn't happen, or am I wrong?

                Comment


                • #38
                  I don't think my description talked about CPU load, did it ?

                  IIRC all that rectangle-blitting and cache-flushing (and waiting for cache flushing) did eat up some CPU time, but not sure if 60% is reasonable.

                  Comment


                  • #39
                    Originally posted by bridgman View Post
                    Yeah, but if that's the case I wish people would actually say that. I might even agree
                    It is to be commended that AMD has some OSS policy and are releasing documentation that helps in this regard. Yes, it's a pity that the OSS Radeon driver is so far behind in therms of performance and features. And those are both mostly due to the incompleteness of the documentation. And it is amazing what the Nouveau people can achieve without any documentation at all! But I'm sticking with AMD, precisely because of their Linux policy. It might not bear fruit quickly but it's something. As a mostly Linux user I try to buy hardware that has good Linux support. And when I buy graphics solutions I buy AMD even if it less powerfull - the AMD attitude towrads OSS makes up for it. The reason I'm writing this is that said OSS policy doesn't seem to be enforced too vigorously. That is of course just an outside perspective, but it really doesn't look like it's important for AMD. Again that's just how it looks. But it should be. Because there are people out there that make purchasing decision based on that. I know they aren't too many but I'd like to believe they are getting more numerous. So bridgman, please tell this to your superiors. Your efforts are not in vain. If only things could happen a bit quicker...

                    Comment


                    • #40
                      Originally posted by bridgman View Post
                      Quick answer without investigation is that older GPUs had dedicated 2D hardware optimized for lines, circles, text drawing and scrolling. Modern GPUs don't have that (r5xx / rs6xx were the last ATI/AMD parts with 2D hardware, don't know for NVidia but probably the same timeframe, not sure about Intel). 3D engines could do most of the operations as fast as or faster than 2D hardware, and spending die area on the 3D engine aligned much better with what customers were looking for.
                      Yes, I am aware of that, but on Intel and NVidia 2D performance through 3D hardware does not have performance issues.

                      Unfortunately scrolling (where source and destination rectangles overlap) turns out to be one of the harder things to do with the 3D engine. There was a lot of discussion during the initial driver support for r6xx and higher, finding the right balance between "go fast" and "no corruption". The technical issue is that blits are done with a 3D engine by using a texture for the source rectangle and a render target for the destination rectangle. Each has its own cache, so writes to the destination rectangle don't update the cached copy of the source rectangle which you want when doing a blit. Even worse, the render operation is scattered across multiple SIMDs and there is no guarantee that the copy operations for one scanline will be completed before the operations for the next one begin. The hardware guarantees correctly ordered writes between triangles/quads (IIRC) but not within a single primitive.
                      Interesting, I'll take a look at the EXA code.

                      Anyway, I guess what baffles me the most is that most people have such low expectations that they are fine with drivers that perform so bad... and even go so far to say that 2D performs great! This is my first AMD GPU in years, and I only got it because people affirmed me that the open source drivers perform well now.

                      Comment


                      • #41
                        Originally posted by bridgman View Post
                        IIRC the algorithm of choice ended up breaking the rectangles into narrow horizontal slices the size of the vertical scroll distance, blitting rectangles one at a time, and flushing caches between rectangles. That was reliable and reasonably fast; not sure if any improvement since then has been figured out.
                        FYI, the current algorithm seems to be a bit better - it does the copy in two steps with a temporary buffer. Still doesn't explain why it is so slow.

                        Comment


                        • #42
                          I have performance issues on r200 also, lines test from gtkperf - with dri disabled it is good... the same thing is with x11perf tests... and all that compared with ums/exa. I think it is something with load new something , seeing the same behasvior when load some games, needs many seconds 10 or more to load new scene (dont know how to explain - it loads but slow like using swrast these 10+ seconds). Again when i compared that with ums/exa and mesa 7.5.2 all that is fine and smooth.

                          Maybe something was wrongly setuped in kms or exa with kms, who knows

                          Also textured video have some stairway efect, diagonal tearing maybe... So, these are the main bugs for me.

                          And some alternative gui toolkits are much slower, like scrolling in fltk, min/max windows in fox toolkit, menus in softmaker office, etc...
                          Last edited by dungeon; 07-16-2012, 06:08 AM.

                          Comment


                          • #43
                            Originally posted by brent View Post
                            FYI, the current algorithm seems to be a bit better - it does the copy in two steps with a temporary buffer. Still doesn't explain why it is so slow.
                            Well, OK, it's simple: the whole command stream is flushed two times for that. No batching = slow.

                            Comment


                            • #44
                              Originally posted by bridgman View Post
                              I don't think my description talked about CPU load, did it ?

                              IIRC all that rectangle-blitting and cache-flushing (and waiting for cache flushing) did eat up some CPU time, but not sure if 60% is reasonable.
                              That was the impression I got from your description, all the stuff is done on the video chip. I just made the test on my 6-core Phenom with HD6870 and I get similar results (all cores between 17% and 25%). Playing a Youtube video without hardware acceleration uses less CPU.
                              Something is weird with my hardware or your drivers.

                              Comment


                              • #45
                                The problem is mostly twofold and I've already mostly addressed these already on various threads:
                                1. Modern toolkits are using more advanced RENDER features. It's not possible to accelerate these and still be RENDER spec compliant on older GPUs because RENDER semantics don't map well to 3D hardware. It is possible accelerate them on modern GPUs, but the complexity starts to rival a 3D driver. It that case it starts to make more sense to take advantage of the 3D driver (better state tracking, integrated shader compiler) with something like glamor.
                                2. EXA was designed years ago and does not provide the necessary infrastructure to accelerate mode advanced RENDER features without an overhaul.
                                Thus, you end up with SW fallbacks for certain operations which means data ping-ponging between GPU and CPU buffers which almost always ends up being slower than pure CPU rendering or pure GPU rendering. You can try the glamor support in git which should improve things going forward as glamor picks up support for accelerating more and more operations using OpenGL.

                                Comment

                                Working...
                                X