Announcement

Collapse
No announcement yet.

Marek Working Towards Even Lower SGPR Register Usage

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Marek Working Towards Even Lower SGPR Register Usage

    Phoronix: Marek Working Towards Even Lower SGPR Register Usage

    Yesterday well known open-source AMD developer Marek Olšák landed his RadeonSI 32-bit pointers support for freeing up some scalar general purpose registers (SGPRs) and he's continued with a new patch series to alleviate register usage even more...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    What's the benefit of using less SGPRs?

    Comment


    • #3
      Originally posted by geearf View Post
      What's the benefit of using less SGPRs?
      AFAIK the reason is that with less registers you will have to do more memory accesses, which is slower.

      Comment


      • #4
        Typo:

        Originally posted by phoronix View Post
        Marek for reducing user SPGR register usage.

        Comment


        • #5
          Originally posted by AsuMagic View Post
          AFAIK the reason is that with less registers you will have to do more memory accesses, which is slower.
          Funny, I always thought using less registers mean more free registers for over things, which could reduce number of memory accesses. If optimized correctly of course.

          Comment


          • #6
            Originally posted by blacknova View Post

            Funny, I always thought using less registers mean more free registers for over things, which could reduce number of memory accesses. If optimized correctly of course.
            Less of those registers available, i meant.

            Comment


            • #7
              GPUs expand draw & compute operations into a lot of independent work items aka threads (one per pixel, vertex or compute result) which can run simultaneously on the shader core and help with hiding delays from accessing memory. When one wavefront (group of threads) is waiting for memory another wavefront can run on the ALUs.

              On current GCN hardware a wavefront is 64 work items sharing a program counter and up to 10 wavefronts can be active on each SIMD, although only one runs at any given time. If a wavefront has to wait for memory it is switched out and the next one that is not waiting starts running. In order to allow instantaneous switching the registers for all of the active wavefronts need to be stored in the register file, so we're talking about a LOT of register storage on the chip.

              Each work item can use up to 128 registers, but the total number of registers across all the active threads is limited to 256 IIRC. Before you ask "why not a higher limit ?" the register file is one of the larger parts of the shader core already. What this means is that the number of simultaneous wavefronts you can launch on a SIMD, and hence the degree of latency hiding, is inversely related to the number of registers each wavefront requires.

              A shader can use up to 128 registers but then only 2 wavefronts can be active on the SIMD at a time, and you lose a lot of potential performance to memory access delays. Limit that to 84 registers and you can run 3 wavefronts; 64 registers -> 4 wavefronts, and so on until you get to 25 registers and can run the maximum of 10 wavefronts (each SIMD has 10 program counters). Note I am typing all this from memory so details might be wrong but you get the idea - using fewer registers means more wavefronts active at the same time and more efficient use of the shader core, or the extra registers can be used to store more variables and reduce the number of variables that need to be stored in memory.
              Last edited by bridgman; 18 February 2018, 01:26 PM.
              Test signature

              Comment


              • #8
                Originally posted by debianxfce View Post

                If you click the first link in the article, you will find the answer.
                I've just tried, I opened the previous article and I did not find the answer.
                Where was it?

                Originally posted by AsuMagic View Post
                AFAIK the reason is that with less registers you will have to do more memory accesses, which is slower.
                I see, thank you for the explanation!

                Comment


                • #9
                  Originally posted by bridgman View Post
                  A shader can use up to 128 registers but then only 2 wavefronts can be active on the SIMD at a time, and you lose a lot of potential performance to memory access delays. Limit that to 84 registers and you can run 3 wavefronts; 64 registers -> 4 wavefronts, and so on until you get to 25 registers and can run the maximum of 10 wavefronts (each SIMD has 10 program counters).
                  That's for VGPRs though whereas the patch reduces SPGR usage.
                  Albeit the story is much the same there, I think in practice it is very likely that the amount of wavefronts is limited by VGPRs and not SPGRs (you can use 24 VGPRs but 48 (72 IIRC for GCN 1.2 and later) SPGRs for the maximum of 10 wavefronts). Especially with graphics, SGPRs are probably mostly used for texture, buffer and sampler descriptors (as texture descriptors use 8, and the other 2 still 4 SGPRs each).
                  Though since the limit was increased with GCN 1.2, I suppose it's indeed possible in practice to be limited by SGPR usage :-).

                  Comment


                  • #10
                    http://nerdralph.blogspot.ca/2017/02...execution.html Inside AMD GCN code execution


                    AMD's Graphics Core Next architecture was introduced over five years ago. Although there have been many documents written to help developers understand the architecture, and thereby write better code, I have yet to find one that is clear and concise. AMD's best GCN documentation is often cluttered with unnecessary details on the old VLIW architecture, when the GCN architecture is already complicated enough on it's own. I intend to summarize my research on GCN, and what that means for OpenCL and GCN assembler kernel developers.




                    As shown in the top diagram (GCN Compute Unit), the GPU consists of groups of four compute units. Each CU has four SIMD units, each of which can perform 16 simultaneous 32-bit operations. Each of these 16 SIMD "lanes" is also called a shading unit, so the R9 380 with 28 CUs has 28 * 4 * 16 = 1792 shading units.




                    AMD's documentation makes frequent reference to "wavefronts". A wavefront is a group of 64 operations that executes on a single SIMD. The SIMD operations take a minimum of four clock cycles to complete, however SIMD pipelines allow a new operation to be started every clock. "The compute unit selects a single SIMD to decode and issue each cycle, using round-robin arbitration." (AMD GCN whitepaper pg 5, para 3). So four cycles after SIMD0 has been issued an instruction, the CU is ready to issue it another.




                    In OpenCL, when the local work size is 64, the 64 work-items will be executed on a single SIMD. Since a maximum of four SIMD units can access the same local memory (LDS), AMD GCN devices support a maximum local work size of 256. When the local work size is 64, the OpenCL compiler can leave out barrier instructions, so performance will often (but not always) be better than using a local work size of 128, 192, or 256.

                    The SIMD units only perform vector operations such as mulitply, add, xor, etc. Branching for loops or function calls is performed by the scalar unit, which is shared by all four SIMD units. This means that when a kernel executes a branch instruction, it is executed by the scalar unit, leaving a SIMD unit available to perform a vector operation. The two operations (scalar and vector) must come from different waves, so to ensure the SIMD units are fully utilized, the kernel must allow for 2 simultaneous wavefronts to execute. For information on how resource usage such as registers and LDS impacts the number of simultaneous wavefronts that can execute, I suggest reading AMD's OpenCL Optimization Guide. Note that some sources state that full SIMD occupancy requires four waves, when it is technically possible with just one wave using only vector instructions. Most kernels will require some scalar instructions, so two waves is the practical minimum.
                    It seems rather obvious the purpose of optimizing for OpenCL/Vulkan/OpenGL instructions processed inside the shaders is a win/win for general purpose computing like Blender, game performance and on-screen window systems.

                    Comment

                    Working...
                    X