Announcement

Collapse
No announcement yet.

Mesa 22.0 Gets RADV Ray-Tracing Performance Boost By Using Wave32 Mode

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mesa 22.0 Gets RADV Ray-Tracing Performance Boost By Using Wave32 Mode

    Phoronix: Mesa 22.0 Gets RADV Ray-Tracing Performance Boost By Using Wave32 Mode

    Landing today for Mesa 22.0 was a fix for Vulkan ray-tracing with the RADV driver in the RDNA Wave32 shader mode and then switching to Wave32 by default for ray-tracing on RDNA/RDNA2 GPUs...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    For those wondering how smaller waves, and thus less parallelism, can improve performance on this task: shading with ray queries involves a lot of conditional decisions, and if even one out of the wave of threads has to take a branch, then the whole wave has to take the branch with it, which ties up the execution unit.

    When you cut the wave size in half, the likelihood that a given wave can jump over a block doubles.

    Though GPU architectures do vary on this, and it is possible to make rare branches cheaper if you are willing to make branchless shaders slower.

    Comment


    • #3
      It might be more complicated than that, the control-flow divergence depends on the probabilities of the if condition.

      Comment


      • #4
        When I hear Wave32 I just imagine the card being a giant ocean and the GPU surfing to get through the ocean waves...

        Comment


        • #5
          Originally posted by tildearrow View Post
          When I hear Wave32 I just imagine the card being a giant ocean and the GPU surfing to get through the ocean waves...
          I see you have a creative imagination. I would say the ocean is sometimes too big for the GPU itself like in RX 6500 XT's case.

          Comment


          • #6
            Originally posted by microcode View Post
            For those wondering how smaller waves, and thus less parallelism, can improve performance on this task:
            Wave32 doesn't halve the total number ALUs, though. It basically just re-partitions them.

            It also has lower latency than Wave64, because that breaks each wave into 4 parts of 16, and sequentially executes each one. So, Wave64 has a latency of 4 cycles (i.e. you have to wait at least 4 cycles before using the result of a prior instruction) whereas I think Wave32 has a latency of just 2 cycles.

            Also, I think I recall Wave32 having more scalar execution units per CU.

            So, what's the downside of Wave32 and why did AMD stick with Wave64 for CDNA? Well, you have 2x the instruction decoders and branching logic per CU. So, in less branch-dense code with greater parallelism, Wave64 ends up being a more efficient use of silicon area.

            Comment


            • #7
              Originally posted by tildearrow View Post
              When I hear Wave32 I just imagine the card being a giant ocean and the GPU surfing to get through the ocean waves...
              When I hear Wave64, I think of the Nintendo 64 game, Wave Race 64:

              Comment


              • #8
                Originally posted by coder View Post
                It also has lower latency than Wave64, because that breaks each wave into 4 parts of 16, and sequentially executes each one. So, Wave64 has a latency of 4 cycles (i.e. you have to wait at least 4 cycles before using the result of a prior instruction) whereas I think Wave32 has a latency of just 2 cycles.

                Also, I think I recall Wave32 having more scalar execution units per CU.
                Yep, although the Wave32 latency is actually 1 cycle rather than 2.

                In GCN each CU was organized as 4 16-way SIMDs, each executing a 64-thread wave in 4 cycle as you say, with a single scalar unit multiplexed between the SIMDs. The multiplexing worked because each SIMD took 4 clocks to execute a vector instruction while the scalar unit only took 1 clock.

                In RDNA the organization becomes 2 32-way SIMDs, each executing a 32-thread wave in 1 cycle, with two scalar units and no multiplexing.

                https://www.amd.com/system/files/doc...whitepaper.pdf
                Last edited by bridgman; 22 January 2022, 11:42 PM.
                Test signature

                Comment


                • #9
                  Originally posted by bridgman View Post
                  Yep, although the Wave32 latency is actually 1 cycle rather than 2.
                  Thanks, I had a nagging feeling it was 1 cycle, but I was too lazy to look it up and 2 somehow seemed to make sense.

                  I guess that means that RDNA GPUs could actually cut the latency of Wave64 to 2 cycles, if not for breaking ISA compatibility?

                  Comment


                  • #10
                    Originally posted by tildearrow View Post
                    When I hear Wave32 I just imagine the card being a giant ocean and the GPU surfing to get through the ocean waves...
                    I think of old school sound cards - Wave32 3D

                    Comment

                    Working...
                    X