Announcement

Collapse
No announcement yet.

The Performance-Per-Watt, Efficiency Of GPUs On Open-Source Drivers

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Looks like r600 still has the edge in raw power

    That Radeon HD 6770 was defeated only by 3 larger GCN cards, yet was middle of the road for loaded power use and idled as low as most of the discrete cards. The highest performance in the entire test was a larger r600 card, the HD6870. Given how good the RadeonSI driver has become when the right firmware is used, I wonder if r600 is just plain a better architecture for how Mesa as a whole works. This tells me to sit on my HD6750 unless Kdenlive comes out with a version of the new GPU-accelerated Movit version that supports using hardware encoding for raw speed.

    Comment


    • #17
      Originally posted by Luke View Post
      That Radeon HD 6770 was defeated only by 3 larger GCN cards, yet was middle of the road for loaded power use and idled as low as most of the discrete cards. The highest performance in the entire test was a larger r600 card, the HD6870. Given how good the RadeonSI driver has become when the right firmware is used, I wonder if r600 is just plain a better architecture for how Mesa as a whole works. This tells me to sit on my HD6750 unless Kdenlive comes out with a version of the new GPU-accelerated Movit version that supports using hardware encoding for raw speed.
      I think there are a couple of messages here, but none of them are "r600 is a better architecture for Mesa" AFAIK:

      1. Shader compilers take a long time to optimize, and when you change shader architecture as drastically as we did between VLIW and GCN you're pretty much starting from scratch - this gives the later VLIW parts a chance to shine for a while

      2. The nature of the other remaining optimization opportunities (primarily making best use of available memory & bandwidth) is going to favor mid-range parts, since both low-end and high-end parts tend to be relatively more memory-limited

      3. Things work better when DPM is enabled

      Comment


      • #18
        Originally posted by bridgman View Post
        I think there are a couple of messages here, but none of them are "r600 is a better architecture for Mesa" AFAIK:

        1. Shader compilers take a long time to optimize, and when you change shader architecture as drastically as we did between VLIW and GCN you're pretty much starting from scratch - this gives the later VLIW parts a chance to shine for a while

        2. The nature of the other remaining optimization opportunities (primarily making best use of available memory & bandwidth) is going to favor mid-range parts, since both low-end and high-end parts tend to be relatively more memory-limited

        3. Things work better when DPM is enabled
        Also LLVM sucks.

        Comment


        • #19
          Nice roundup, Michael, though including the new firmware for the 260 would've been nice.
          Can we still expect to see article about the power draw of your tegra k1 board?

          Comment


          • #20
            Originally posted by liam View Post
            Nice roundup, Michael, though including the new firmware for the 260 would've been nice.
            Can we still expect to see article about the power draw of your tegra k1 board?
            Yes, that's still coming but I only have one WattsUp USB meter and it's been busy now for the past ~2 weeks with this graphics card testing (still doing the proprietary driver tests. etc).
            Michael Larabel
            http://www.michaellarabel.com/

            Comment


            • #21
              Originally posted by Michael View Post
              The HD 5450 was passively cooled while the ASUS HD 4890 has a very large cooler.
              Oh! Well, that explains that.

              Hmm, the choice of passive vs active cooling is kind of annoying. Passive cooling is quieter, but then if it raises the temperature so that the CPU cooler has to work harder, then in the end it might be louder... Argh.

              Comment


              • #22
                Why not use the best combination?

                Passively cooled components + huge silent fan on the case, 20cm+. Temps lower than normal small-fan equipped hw, silence.

                Comment


                • #23
                  Originally posted by bridgman View Post
                  I think there are a couple of messages here, but none of them are "r600 is a better architecture for Mesa"
                  Are you sure about that? I remember hearing VLIW was tuned for graphics while SIMD (GCN) is there to make GPGPU tasks easier/faster. Now I don't want to spread misinformation, so please correct me if I'm wrong as I think you know it best.

                  Comment


                  • #24
                    Originally posted by TAXI View Post
                    Are you sure about that? I remember hearing VLIW was tuned for graphics while SIMD (GCN) is there to make GPGPU tasks easier/faster. Now I don't want to spread misinformation, so please correct me if I'm wrong as I think you know it best.
                    It's less that VLIW is more tuned for graphics, and more that it is unable to be tuned for anything else.

                    SIMD is better for GPGPU, but it can be equally good for graphics. The hardware is probably just more expensive to create that you could get away with if you were only targeting graphics through a VLIW architecture (and so therefore, it may end up being slower if amd just targets a specific price point).

                    Also, VLIW requires more complicated compiler optimizations to work correctly, so if anything SIMD should be the "better" option for Mesa.
                    Last edited by smitty3268; 06-07-2014, 08:45 PM.

                    Comment


                    • #25
                      vec4 a = b +c

                      One instruction, one "core" taken on a vector arch. Four units taken on a scalar (gcn) arch. As long as the gcn card has less than 4x the cores, well vectorized code will be slower.

                      Comment


                      • #26
                        Originally posted by smitty3268 View Post
                        SIMD is better for GPGPU, but it can be equally good for graphics. The hardware is probably just more expensive to create that you could get away with if you were only targeting graphics through a VLIW architecture (and so therefore, it may end up being slower if amd just targets a specific price point).
                        Right. A VLIW SIMD implementation requires less silicon area than a scalar SIMD implementation for the same graphics performance (since graphics workloads are mostly 3- and 4-vector operations anyways), but it's harder to make optimal use of VLIW hardware on arbitrary compute workloads.

                        On the other hand, for compute workloads which *do* fit well with VLIW hardware (basically ones which can be readily modified to make use of 4-vectors) the compute performance per unit area (and hence per-dollar) can be very high.
                        Last edited by bridgman; 06-08-2014, 06:30 AM.

                        Comment


                        • #27
                          Originally posted by bridgman View Post
                          Right. A VLIW SIMD implementation requires less silicon area than a scalar SIMD implementation for the same graphics performance (since graphics workloads are mostly 3- and 4-vector operations anyways), but it's harder to make optimal use of VLIW hardware on arbitrary compute workloads.

                          On the other hand, for compute workloads which *do* fit well with VLIW hardware (basically ones which can be readily modified to make use of 4-vectors) the compute performance per unit area (and hence per-dollar) can be very high.
                          Can you explain the difference between scalar and vector simd? I know that a big difference between pre-gcn and gcn but I haven't found a resource that actually explains exactly what it means except that scalar SIMD is more flexible and can make good use of a hardware scheduler. That's also the very issue of SCALAR SIMD. The idea seems bizarre, as if it becomes an oxymoron with the multiple data part.

                          Comment


                          • #28
                            Originally posted by liam View Post
                            Can you explain the difference between scalar and vector simd? I know that a big difference between pre-gcn and gcn but I haven't found a resource that actually explains exactly what it means except that scalar SIMD is more flexible and can make good use of a hardware scheduler. That's also the very issue of SCALAR SIMD. The idea seems bizarre, as if it becomes an oxymoron with the multiple data part.
                            I don't know about all the exact terminology, but the major difference between the old VLIW and the new radeon SI architecture is explained somewhat by Anandtech.
                            Whereas VLIW is all about extracting instruction level parallelism (ILP), a non-VLIW SIMD is primarily about thread level parallelism (TLP).
                            Because the smallest unit of work is the SIMD and a CU has 4 SIMDs, a CU works on 4 different wavefronts at once. As wavefronts are still 64 operations wide, each cycle a SIMD will complete of the operations on their respective wavefront, and after 4 cycles the current instruction for the active wavefront is completed.

                            Cayman by comparison would attempt to execute multiple instructions from the same wavefront in parallel, rather than executing a single instruction from multiple wavefronts. This is where Cayman got bursty if the instructions were in any way dependent, Cayman would have to let some of its ALUs go idle. GCN on the other hand does not face this issue, because each SIMD handles single instructions from different wavefronts they are in no way attempting to take advantage of ILP, and their performance will be very consistent.
                            http://www.anandtech.com/show/4455/a...-for-compute/3
                            http://www.anandtech.com/show/5261/a...-7970-review/3

                            Comment


                            • #29
                              Originally posted by smitty3268 View Post
                              I don't know about all the exact terminology, but the major difference between the old VLIW and the new radeon SI architecture is explained somewhat by Anandtech.




                              http://www.anandtech.com/show/4455/a...-for-compute/3
                              http://www.anandtech.com/show/5261/a...-7970-review/3
                              Nice find. Thanks. This seems like it forces the hardware to be far more aware of program state than previous iterations. This would take some of the burden off the compiler writers, but it also appears to be more costly when it comes to silicon efficiency.

                              Comment

                              Working...
                              X