Announcement

Collapse
No announcement yet.

2d tiling + sb -> no improvement in fill rate, curious

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    According to wikipedia, 5770's pixel pushing ability is 13.6, not 12.

    Comment


    • #12
      OK so it's slightly below (0,2) spec. But I can not confirm the results of the fill demo with the other benchmark. Now the question is, which of them is the correct result ;-)

      Comment


      • #13
        How is the fillrate computed ? If the measured time includes shader compilation, overhead of shader optimizations for the second shader may outperform the benefit of tiling.

        Comment


        • #14
          No the shader is compiled before the measurement starts:



          glUseProgram() is called before PerfMeasureRate() is called.

          /edit:
          and PerfShaderProgram() too, which calls the compile and link.
          Last edited by droste; 28 May 2013, 06:38 PM.

          Comment


          • #15
            Originally posted by curaga View Post
            Shader2 consists of shader1 + many no-ops that should be optimized out.
            Even though there are no-ops, not all of them are optimized away with sb or llvm, though it seems llvm eliminates a bit more of them. I think that explains lower performance with shader2.

            Originally posted by curaga View Post
            By printing the results with R600_DEBUG=sb,sbstat,ps I could see both shaders were optimized to the exact same instructions.
            There are also some additional shaders in the dump (for blits etc) aside from the shaders explicitly requested by the app, so it's not always easy to say what bytecode in the dump belongs to what app's shader, possibly you looked at the wrong shaders. As far as I can see, bytecode for shader2 is in fact still longer even after optimizations.

            As for the other results, did you turn vsync off (vblank_mode=0)? By default I have 7.6 with simple fill on my HD5750, without vsync - 10.9, which is pretty close to 11.2 in the card specs on amd.com.

            Comment


            • #16
              Originally posted by droste View Post
              No the shader is compiled before the measurement starts

              glUseProgram() is called before PerfMeasureRate() is called.

              /edit:
              and PerfShaderProgram() too, which calls the compile and link.
              In fact final stages of shader compilation (from TGSI to hardware bytecode, this includes sb if it's enabled) are performed on the first use (during the first draw call that uses the shader). Anyway, fill demo performs few iterations with increasing count of draw calls until it gets reliable results, so in this case compilation time shouldn't affect the time of the last iteration.

              Comment


              • #17
                Ah yes the vblanking... now it looks way better:

                Code:
                   Simple fill: 13.4 billion pixels/second
                   Blended fill: 13.4 billion pixels/second
                   Textured fill: 13.4 billion pixels/second
                   Shader1 fill: 13.4 billion pixels/second
                   Shader2 fill: 6.0 billion pixels/second
                which is exactly the same as the other fill test.

                But now the Shader2 test is way slower in comparison to the Shader1 test ;-)

                Comment


                • #18
                  Originally posted by vadimg View Post
                  There are also some additional shaders in the dump (for blits etc) aside from the shaders explicitly requested by the app, so it's not always easy to say what bytecode in the dump belongs to what app's shader, possibly you looked at the wrong shaders. As far as I can see, bytecode for shader2 is in fact still longer even after optimizations.
                  The dump only included two shaders that did texture fetches; these must therefore be the two shaders of the app.

                  As for the other results, did you turn vsync off (vblank_mode=0)? By default I have 7.6 with simple fill on my HD5750, without vsync - 10.9, which is pretty close to 11.2 in the card specs on amd.com.
                  All my measurements were done with vblank_mode=0.

                  Comment


                  • #19
                    Originally posted by droste View Post
                    But now the Shader2 test is way slower in comparison to the Shader1 test ;-)
                    Yes, more complicated shader results in higher amount of work for GPU, and this means lower performance. That's why shader2 test is not really a benchmark that measures maximum fill rate, it's more like a benchmark for the quality of some specific optimizations in shader compiler. Even with proprietary compiler the resulting shader has more alu instructions than shader1, 14 alu vliw instructions instead of 4.

                    Originally posted by curaga View Post
                    The dump only included two shaders that did texture fetches; these must therefore be the two shaders of the app.
                    Hmm, for me 'grep SAMPLE' gives 4 occurences in the full dump for fill (or 8 with sb because each shader is dumped twice):
                    Code:
                    R600_DEBUG=sb,nollvm,ps ./fill 2>&1 | grep SAMPLE
                    Originally posted by curaga View Post
                    All my measurements were done with vblank_mode=0.
                    Please also make sure that SwapbuffersWait is turned off in xorg.conf, and probably turning on ColorTiling and ColorTiling2D also might help if you don't have them already, as well as disabling the compositor if you use it. Also gpu power profile or cpu governor may affect the results. If you use debug build of mesa, this also could explain the slowdown in theory.

                    If you still have low fill rate with all these options, you might want to check if the DUAL_EXPORT mode is actually enabled in the driver for your GPU during the tests (see these commits), IIRC fill rate was close to the specs for me since then.

                    Comment


                    • #20
                      I don't use a compositor, and both 1d and 2d tiling default on (and are on according to xorg.0.log). See the first post for the gpu power profile.

                      I don't use debug builds, the asserts and debug paths usually are not worth it (if I need to debug, I add -g to my cflags).

                      Will check dual export and cpu governor. Can't test swapbufferswait now (long downloads going), but that one is not really relevant, as tearing is unacceptable to me. It may turn out to be the wait causing the fillrate not to be up to hw specs, but since it has to be on, the question would then become "why didn't 2d tiling improve the fill rate".

                      Hmm, for me 'grep SAMPLE' gives 4 occurences in the full dump for fill (or 8 with sb because each shader is dumped twice):
                      Oh, I suppose it's then the fact sb only runs on first draw - I may have not waited for it to draw all passes, and assumed sb ran on shader link time.

                      Comment

                      Working...
                      X