Announcement

Collapse
No announcement yet.

RadeonSI Gallium3D Driver Adds Navi Wave32 Support

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • RadeonSI Gallium3D Driver Adds Navi Wave32 Support

    Phoronix: RadeonSI Gallium3D Driver Adds Navi Wave32 Support

    One of the new features to the RDNA architecture with Navi is support for single cycle issue Wave32 execution on SIMD32. Up to now the RadeonSI code was using just Wave64 but now there is support in this AMD open-source Linux OpenGL driver for Wave32...

    http://www.phoronix.com/scan.php?pag...I-Wave32-Lands

  • #2
    Is there a Wave32 game?

    Comment


    • #3
      So the architecture supports both 32 and 64-length wavefronts ? Is there a performance interest to use 32 on the ALU side ? 64 has many advantages over 32 to me, relative to memory access patterns, but I agree for some workloads, 32 could be a better fit.

      Comment


      • #4
        I was thinking it is now 32 only, but who knows 64 might be in a way left for something

        Starting with the architecture itself, one of the biggest changes for RDNA is the width of a wavefront, the fundamental group of work. GCN in all of its iterations was 64 threads wide, meaning 64 threads were bundled together into a single wavefront for execution. RDNA drops this to a native 32 threads wide. At the same time, AMD has expanded the width of their SIMDs from 16 slots to 32 (aka SIMD32), meaning the size of a wavefront now matches the SIMD size. This is one of AMD’s key architectural efficiency changes, as it helps them keep their SIMD slots occupied more often. It also means that a wavefront can be passed through the SIMDs in a single cycle, instead of over 4 cycles on GCN parts.
        https://www.anandtech.com/show/14528...-5700-series/2

        Comment


        • #5
          Originally posted by mannerov View Post
          So the architecture supports both 32 and 64-length wavefronts ? Is there a performance interest to use 32 on the ALU side ? 64 has many advantages over 32 to me, relative to memory access patterns, but I agree for some workloads, 32 could be a better fit.
          I think now it is easier to have the ALUs all utilized and less penalty when you have to perform some operatoins at the cost of worse memory access pattern. (I can't tell for sure unless AMD releases Navi architecture documents like Vega). It's the AMD fine-wine problem, theoretically GCN wave 64 have a higher performance, but it is hard to fully utilize the entire GCN core. Like how VLIW was a good idea until we found that optimizing VLIW is a nightmare, so we moved back CISC and now even RISC.

          Comment


          • #6
            Originally posted by artivision View Post
            Is there a Wave32 game?
            I has nothing to do with the game... it is purely in how the driver handles the code. That said most compute code does care, and is optimized for wave64 so AMD retained support for running both, the way they did this is the wave64 code executes across 2 cycles instead of one. Up untill now the open source graphics driver has been issuing wave64 code, running in this two cycle mode, with wave32 code being issued we will see better ALU utilization, as previously with wave64 code lots of the compute resources could go underutilized in many cases.

            Comment


            • #7
              Originally posted by dungeon View Post
              I was thinking it is now 32 only, but who knows 64 might be in a way left for something


              https://www.anandtech.com/show/14528...-5700-series/2
              It can definitly still do wave64... that is what the entire driver is doing up until now on Navi... it just takes 2 cycles instead of one to do the work but the Navi CUs are 2x as fast to begin with so, you only notice a small overhead for doing this for compute code, while graphics shaders that sometimes have difficulty utilizing a wave64 and may only use only a quater of it's resources can now be broken down into wave32 for better utilization since the resources are scheduled more finely.

              They give a fairly detail high level overview of how all this works in the tech slides I'm supprised people are commenting on this without bothering to read it: https://www.techpowerup.com/256660/a...tc#g256660-121

              Doing this is the main reason they got a 1.25x IPC improvement... so current Mesa code we've seen benchmarked so far has not even taken advantage of this at all.
              Last edited by cb88; 07-20-2019, 12:00 PM.

              Comment


              • #8
                Originally posted by cb88 View Post
                They give a fairly detail high level overview of how all this works in the tech slides I'm supprised people are commenting on this without bothering to read it
                I had read it of course, but understood by this example there is no native wave64 anymore, just 2 wave32



                https://www.anandtech.com/Gallery/Album/7172#12
                Last edited by dungeon; 07-20-2019, 12:30 PM.

                Comment


                • #9
                  Originally posted by dungeon View Post

                  I had read it of course, but understood by this example there is no native wave64 anymore, just 2 wave32



                  https://www.anandtech.com/Gallery/Album/7172#12
                  The driver doesn't know... that's the whole point, it's executed natively just in a different way across 2 cycles. It's not doing emulation or anything like that.. similar to how AVX instructions on Zen 1 take 1 or 2 cycles depending on length etc (128bit is single, 256 is 2 cycle).... there is native support for the instructions but the hardware is not setup to execute them in a single cycle.

                  Comment


                  • #10
                    Originally posted by cb88 View Post

                    I has nothing to do with the game... it is purely in how the driver handles the code. That said most compute code does care, and is optimized for wave64 so AMD retained support for running both, the way they did this is the wave64 code executes across 2 cycles instead of one. Up untill now the open source graphics driver has been issuing wave64 code, running in this two cycle mode, with wave32 code being issued we will see better ALU utilization, as previously with wave64 code lots of the compute resources could go underutilized in many cases.
                    If it was like that, then why they keep compatibility with Wave64 in expense of more transistors? They could just execute everything per 32 threads.

                    Comment

                    Working...
                    X