Announcement

Collapse
No announcement yet.

VP9 & AV1 Have More Room To Improve For POWER & ARM Architectures

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • VP9 & AV1 Have More Room To Improve For POWER & ARM Architectures

    Phoronix: VP9 & AV1 Have More Room To Improve For POWER & ARM Architectures

    Luc Trudeau, a video compression wizard and co-author of the AV1 royalty-free video format, has written a piece about the optimization state for video formats like VP9 and AV1 on POWER and ARM CPU architectures...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Quote from the article:

    Sadly, but not surprisingly, when the number of parallel encodes exceed the number of cores the POWER9 does not scale linearly anymore. It would appear to be closer to a log function.
    This shouldn't be a shock as SMT still requires the use of the resources aligned to the physical core that supports it.

    The articles references between VP9 and AV1 are interesting in light that he notes that AV1 has no optimizations outside of x86-64. This makes sense as IBM moves ecosystem development from strictly their space to OpenPOWER. I assume that "someone" in the OpenPOWER consortium will have to take on the optimizations in AV1 that better leverage the platform.

    On another angle, with IBM aligned with NVidia on Volta co-processing on POWER9 platforms, could make for a very interesting AV1 encode pairing. I haven't heard of anyone working on AV1 encode/decode using CUDA or OpenCL assist.

    Andrey @ Elecard says AV1 will stay in the tech field and won't make headroom over HEVC in broadcast. He noted that unless the bitrate drops by 50%, broadcasters won't want to make a change. This means we will probably see a lot of AV1 usage first in video conferencing products, screen sharing and other products that require the use of unqualified connectivity with higher quality.

    Comment


    • #3
      For anyone interested, here is a recent test chart showing the progress x86 CPU's have made in software decoding VP9.




      Comment


      • #4
        Originally posted by edwaleni View Post
        On another angle, with IBM aligned with NVidia on Volta co-processing on POWER9 platforms, could make for a very interesting AV1 encode pairing. I haven't heard of anyone working on AV1 encode/decode using CUDA or OpenCL assist.
        I thought it's not helpful to do encodes using GPU, coz the bottleneck is mostly sequential, that even parallelization on CPUs doesn't scale that well and requires threads to be processing different chunks (blocks, frames, etc?) of the video? I'd love to see GPU-accelerated software encoding, that would speed things up so much, especially when AV1 is prolly gonna be one of the slowest even after libaom is mostly optimized (I think?).

        Comment


        • #5
          Originally posted by edwaleni View Post
          Quote from the article:
          On another angle, with IBM aligned with NVidia on Volta co-processing on POWER9 platforms, could make for a very interesting AV1 encode pairing. I haven't heard of anyone working on AV1 encode/decode using CUDA or OpenCL assist.
          I thought GPU doesn't really help software encoding since the algorithm is hard to parallelize? I would love to see GPU-accelerated software encoding since hardware encoding tend to be low quality (or so I've heard?)

          Comment


          • #6
            The hefty part is parallelizable to a degree depending on the resolution (macroblocks of a frame, if its not then the format is completely bonkers), some serial decision making is needed of course. You could always just have a thread run ahead and detect scene-changes (video codecs have to be able to jump to positions, so encoding has to restart every couple seconds at least ) to create completely separate parts of the video that can be encoded in parallel and copied together later. Needs more ram/hdd than just encoding a stream on-the-fly of course.

            Comment


            • #7
              Let's not forget the reason why encoding is slower than decoding: The encoder makes all the decisions! If the format has a lot of coding tools, like AV1, that multiplies up to a ginormous search space for the encoder. It is easy to make an efficient but slow encoder by searching the space with brute force. That's how you get an encoder that is 1000 times slower than real time.

              Such a brute force encoder is embarrassingly parallel: Take block sizes as an example: You can encode each frame simultaneously in blocks of 64×64, 32×32, 16×16, 8×8, 4×4 and all possible combinations of them (oh and AV1 has rectangular blocks too) and select the best encoding for each frame. You get the idea.

              It is even easier to make a fast but inefficient encoder by simply dropping features. That's why consumer hardware encoders are bad quality.
              Last edited by andreano; 28 July 2018, 05:39 PM.

              Comment


              • #8
                Originally posted by busukxuan View Post

                I thought GPU doesn't really help software encoding since the algorithm is hard to parallelize? I would love to see GPU-accelerated software encoding since hardware encoding tend to be low quality (or so I've heard?)
                Wow I just did a repost a few hours later... Idk exactly what was buggy but I didn't see my first post after posting, and waited a few hours before checking again and it still wasn't there. Now there are two of them.

                Comment


                • #9
                  Originally posted by andreano View Post
                  Let's not forget the reason why encoding is slower than decoding: The encoder makes all the decisions! If the format has a lot of coding tools, like AV1, that translates to a ginormous search space for the encoder. It is easy to make an efficient but slow encoder by searching the space with brute force. That's how you get an encoder that is 1000 times slower than real time.

                  Such a brute force encoder is embarrassingly parallel: Take block sizes as an example: You can encode each frame simultaneously in blocks of 64×64, 32×32, 16×16, 8×8, 4×4 and all possible combinations of them (oh and AV1 has rectangular blocks too) and select the best encoding for each frame. You get the idea.

                  It is even easier to make a fast but inefficient encoder by simply dropping features. That's why consumer hardware encoders are bad quality.
                  I see. But GPU still can't accelerate software encoding, is that right? I don't really know GPU well but I can't imagine each shader unit trying a different encoding.

                  Comment


                  • #10
                    Originally posted by busukxuan View Post
                    But GPU still can't accelerate software encoding, is that right?
                    Not without rewriting the software, but there should be plenty of opportunities for GPU offloading of AV1 encoding given its many coding tools.

                    I didn't even mention GPU offloading, as I was thinking strictly about the problem of saturating 1000 CPU cores (for diminishing returns). That's really a different question: For CPU parallelization, you're looking for coarse grained parallelism – being able to split off as large sections as possible (because of thread synchronization overhead), whereas GPUs can only run very simple programs, so they need fine grained parallelism. We're looking for different things, and they don't exclude each other.

                    Take chroma from luma for example: That involves things like calculating the average luma value of a block and then subtracting that from all its pixels. For a 32×32 block, that's 1024 pixels to add up, followed by 1024 subtractions. I'm no GPU programmer, but I see a lot of fine grained parallelism here.

                    Originally posted by busukxuan View Post
                    I can't imagine each shader unit trying a different encoding.
                    I can. Call it exploratory parallelism. It's like speculative execution, except instead of trying to predict the right branch, you take all branches at once in parallel. Whether some of those are GPU offloaded is a different question.

                    Comment

                    Working...
                    X