Announcement

Collapse
No announcement yet.

POWER9 + ARM64 Performance For Dav1d 0.8 AV1 Decoding

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • POWER9 + ARM64 Performance For Dav1d 0.8 AV1 Decoding

    Phoronix: POWER9 + ARM64 Performance For Dav1d 0.8 AV1 Decoding

    With last week's release of dav1d 0.8 for CPU-based AV1 video decoding we provided a number of x86_64 benchmarks while questions were raised around the ARM64 and POWER9 performance. Here are such benchmarks for those wondering about the AV1 video decoding speed on those architectures...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    I wonder if the 10-bit decode path involves mostly scalar, integer ops. That would explain why the x86 architectures do so much worse on it, as well as why the Altra is really able to pull ahead.

    Comment


    • #3
      Originally posted by coder View Post
      I wonder if the 10-bit decode path involves mostly scalar, integer ops. That would explain why the x86 architectures do so much worse on it, as well as why the Altra is really able to pull ahead.
      My guess is there is some bug in the x86 code path.
      For PPC and ARM the FPS were halved (compared with Chimera 1080p), but x86 performance goes down to 1/4.
      This is weird.

      Comment


      • #4
        Originally posted by coder View Post
        I wonder if the 10-bit decode path involves mostly scalar, integer ops. That would explain why the x86 architectures do so much worse on it, as well as why the Altra is really able to pull ahead.
        At least according to a reader on Twitter, Netflix paid for hand written ARM assembly for the 10-bit code path but no one has paid for x86_64 10-bit hand tuned code yet...
        Michael Larabel
        https://www.michaellarabel.com/

        Comment


        • #5
          Originally posted by zxy_thf View Post
          My guess is there is some bug in the x86 code path.
          For PPC and ARM the FPS were halved (compared with Chimera 1080p), but x86 performance goes down to 1/4.
          This is weird.
          It's not weird, if you consider that x86 has 256-bit AVX2 (and 512-bit AVX-512) that is certainly used in the 8-bit path. 10-bit probably has more trouble using it, so it's probably implemented using scalar code. That explains why it takes a bigger hit on x86 than Power.

          And ARMv8-A only has 128-bit vector extensions, which explains its performance disadvantage at 8-bit.

          Comment


          • #6
            Dav1d 0.8.1

            - ARM32 optimizations for 10bit bitdepth for SGR
            - ARM32 optimizations for 16bit bitdepth for blend/w_masl/emu_edge
            - ARM64 optimizations for 10bit bitdepth for SGR
            - x86 optimizations for wiener in SSE2/SSSE3/AVX2

            I also took a look at the use of VSX vector instructions in the wiener filter for PPC. I have seen a lot of SSE/AVX related code but not any VSX before.

            Also was curious to see how they did discreet CPU architecture references.

            P7ZIP needs an update to support multi-arch and was looking for some ideas.

            Comment


            • #7
              To compare these wildly different machines I think you need to use common metrics like fps per watt or fps per dollar. In fact maybe create a scenario like 'which hardware is best for an online streaming setup' and take the hardware cost, rack space cost and electricity cost into account and produce a dollar cost for each type of hardware for running 1000 streams concurrently for a year.
              Last edited by arideden; 10 January 2021, 03:17 PM.

              Comment


              • #8
                Originally posted by Michael View Post
                Talos II 2P server with 44 cores / 176 threads and then for ARM64 was the Altra in its 160 core 2P configuration. For these quick reference tests are also performance numbers for the Xeon Platinum 8280 2P and EPYC 7742 2P.
                How hard would it be to add the information about how many cores and threads do the Xeon and Epyc platforms have?
                Last edited by pkese; 10 January 2021, 02:08 PM.

                Comment


                • #9
                  Guys, guys.

                  The actual reason x86_64 CPUs perform a lot worse in 10-bit is that until recently, x86_64 CPUs did not have any SIMD assembly code written for HBD(10-bit+) decoding.
                  Even now, these patches haven't been merged into the main branch.

                  Comment


                  • #10
                    M1 (4+4 cores) looks at those numbers and laughs

                    Comment

                    Working...
                    X