Announcement

Collapse
No announcement yet.

2d tiling + sb -> no improvement in fill rate, curious

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • 2d tiling + sb -> no improvement in fill rate, curious

    After upgrading my ddx, I finally got 2d tiling on my RV710. It was supposed to be the thing to increase fillrate on bw-limited cards.

    The mesa-demos fill bench had the exact same numbers with and without 2d tiling. Adding SB on top of 2d tiling improved some numbers, but that too had some curious results in the last test.

    This card, according to specs, is capable of 2.3 gigapixels/sec. It has only gotten about half that on the open drivers for years, tiling was supposed to improve it, it didn't. Any ideas on why it had no difference welcome.

    Everything was measured on the default power profile, which equals high profile on this card.

    3.7.10, mesa 9.1.1, ddx 7.1.0, libdrm 2.4.44

    The numbers, both with and without 2d tiling:
    Simple fill: 1.3 billion pixels/second
    Blended fill: 1.1 billion pixels/second
    Textured fill: 1.1 billion pixels/second
    Shader1 fill: 1.1 billion pixels/second
    Shader2 fill: 543.8 million pixels/second
    With SB:
    Simple fill: 1.3 billion pixels/second
    Blended fill: 1.1 billion pixels/second
    Textured fill: 1.2 billion pixels/second
    Shader1 fill: 1.2 billion pixels/second
    Shader2 fill: 588.0 million pixels/second
    SB gave some minor improvement. However, note the shader2 value: almost exactly half of shader1.

    Shader2 consists of shader1 + many no-ops that should be optimized out. By printing the results with R600_DEBUG=sb,sbstat,ps I could see both shaders were optimized to the exact same instructions.


    So, we have two curious things here:
    - why is the fillrate still only half of hw ability
    - why is the exact same shader half the speed, when only the pre-optimized shader differs

  • #2
    Noting that the SB test was done on mesa git, not 9.1.1.

    Comment


    • #3
      Originally posted by curaga View Post
      - why is the exact same shader half the speed, when only the pre-optimized shader differs
      This test (Shader2 fill) is weird. I just ran those with compositing desktop and without:

      with:
      Code:
      Simple fill: 7.6 billion pixels/second
         Blended fill: 7.6 billion pixels/second
         Textured fill: 7.6 billion pixels/second
         Shader1 fill: 7.6 billion pixels/second
         Shader2 fill: 4.6 billion pixels/second
      without:
      Code:
      Simple fill: 7.6 billion pixels/second
         Blended fill: 7.6 billion pixels/second
         Textured fill: 7.6 billion pixels/second
         Shader1 fill: 7.6 billion pixels/second
         Shader2 fill: 3.8 billion pixels/second
      So it's slower when there's less other workload on the GPU!?

      Comment


      • #4
        What's your card? I'm curious on how far 7.6 is from the specs of the hw.

        Comment


        • #5
          Originally posted by curaga View Post
          What's your card? I'm curious on how far 7.6 is from the specs of the hw.
          ATI Radeon HD5770 (Evergreen/Juniper). Spec is 12 AFAIK.

          /edit:
          I just checked the source of fill and if you remove line 181-184 (where it calls swap buffers every 128 iteration) I get this and the output is still correct:

          Code:
             Simple fill: 8.2 billion pixels/second
             Blended fill: 7.9 billion pixels/second
             Textured fill: 8.1 billion pixels/second
             Shader1 fill: 8.4 billion pixels/second
             Shader2 fill: 5.2 billion pixels/second
          Shader2 is still slower but at least the other ones are closer to spec

          /edit2:
          and without the glFinish() it is still rendering correctly but the result is this:

          Code:
             Simple fill: 16.1 billion pixels/second
             Blended fill: 13.5 billion pixels/second
             Textured fill: 13.3 billion pixels/second
             Shader1 fill: 13.3 billion pixels/second
             Shader2 fill: 9.6 billion pixels/second
          which is above spec. But I'm not sure if it is allowed to do that :-D
          Last edited by droste; 05-26-2013, 05:29 PM.

          Comment


          • #6
            Well without a swap, the driver is allowed to detect that you're overwriting the same buffer, and skip all rendering but the last.

            Removing the glfinish gets you invalid results, since the timing is cpu-side.

            Comment


            • #7
              Originally posted by curaga View Post
              Well without a swap, the driver is allowed to detect that you're overwriting the same buffer, and skip all rendering but the last.
              Well yes. But either swapping after every draw is the correct thing to do, if you want to benchmark this or no swapping. But what's the reasoning for swapping every 128th iteration?

              Originally posted by curaga View Post
              Removing the glfinish gets you invalid results, since the timing is cpu-side.
              Yeah makes sense.

              Comment


              • #8
                Originally posted by droste View Post
                Well yes. But either swapping after every draw is the correct thing to do, if you want to benchmark this or no swapping. But what's the reasoning for swapping every 128th iteration?
                The comment says to please old drivers - so I gather it assumes both dumb (no overwriting check) and limited (no long queues) driver.

                Comment


                • #9
                  Testing another random bench: http://www.graphics.stanford.edu/cou...-fall/as1.html

                  This one renders straight to front, using fixed mode. It also gave 1.3Gpix/s, confirming this number.

                  Comment


                  • #10
                    Originally posted by curaga View Post
                    The comment says to please old drivers - so I gather it assumes both dumb (no overwriting check) and limited (no long queues) driver.
                    Yes of course. My point is, it is distorting the result. Nonetheless all of it doesn't explain why Shader2 Fill is so slow.

                    Originally posted by curaga View Post
                    Testing another random bench: http://www.graphics.stanford.edu/cou...-fall/as1.html

                    This one renders straight to front, using fixed mode. It also gave 1.3Gpix/s, confirming this number.
                    Code:
                    --------------------------------------------------
                    Vendor:      X.Org
                    Renderer:    Gallium 0.4 on AMD JUNIPER
                    Version:     3.0 Mesa 9.2.0 (git-44a117a)
                    Visual:      RGBA=<8,8,8,0>  Z=<24>  double=1
                    Geometry:    800x800+7+28
                    Screen:      1920x1080
                    --------------------------------------------------
                    Fill Rate:      13466.93 MPix/second
                    Triangle Rate:  54.30 Mtri/second
                    For me it shows above spec speed

                    Comment


                    • #11
                      According to wikipedia, 5770's pixel pushing ability is 13.6, not 12.

                      Comment


                      • #12
                        OK so it's slightly below (0,2) spec. But I can not confirm the results of the fill demo with the other benchmark. Now the question is, which of them is the correct result ;-)

                        Comment


                        • #13
                          How is the fillrate computed ? If the measured time includes shader compilation, overhead of shader optimizations for the second shader may outperform the benefit of tiling.

                          Comment


                          • #14
                            No the shader is compiled before the measurement starts:

                            http://cgit.freedesktop.org/mesa/dem...rc/perf/fill.c

                            glUseProgram() is called before PerfMeasureRate() is called.

                            /edit:
                            and PerfShaderProgram() too, which calls the compile and link.
                            Last edited by droste; 05-28-2013, 06:38 PM.

                            Comment


                            • #15
                              Originally posted by curaga View Post
                              Shader2 consists of shader1 + many no-ops that should be optimized out.
                              Even though there are no-ops, not all of them are optimized away with sb or llvm, though it seems llvm eliminates a bit more of them. I think that explains lower performance with shader2.

                              Originally posted by curaga View Post
                              By printing the results with R600_DEBUG=sb,sbstat,ps I could see both shaders were optimized to the exact same instructions.
                              There are also some additional shaders in the dump (for blits etc) aside from the shaders explicitly requested by the app, so it's not always easy to say what bytecode in the dump belongs to what app's shader, possibly you looked at the wrong shaders. As far as I can see, bytecode for shader2 is in fact still longer even after optimizations.

                              As for the other results, did you turn vsync off (vblank_mode=0)? By default I have 7.6 with simple fill on my HD5750, without vsync - 10.9, which is pretty close to 11.2 in the card specs on amd.com.

                              Comment

                              Working...
                              X