Announcement

Collapse
No announcement yet.

Intel Adds GPU-Accelerated Memory Copy Support To FFmpeg

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Intel Adds GPU-Accelerated Memory Copy Support To FFmpeg

    Phoronix: Intel Adds GPU-Accelerated Memory Copy Support To FFmpeg

    Intel engineers have contributed GPU-accelerated memory copy support to FFmpeg when making use of their preferred video decode implementation...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Is this a different driver to their two vaapi drivers? Or is it the newer vaapi driver?

    Comment


    • #3
      Do I misunderstand or is the common belief that best performance comes when you don't make any unnecessary copies in the first place?
      "Yes, we're inefficient, but we optimized the inefficient parts so they're not as wasteful."

      Comment


      • #4
        Originally posted by willmore View Post
        Do I misunderstand or is the common belief that best performance comes when you don't make any unnecessary copies in the first place?
        "Yes, we're inefficient, but we optimized the inefficient parts so they're not as wasteful."
        Oh yes, they surely missed that. Now that you said it, they will discover that those copies could be avoided just as easily.

        Edit: sorry, didn't mean to be so sarcastic. Just wanted to say: I'm sure they thought about that, but here they are optimizing copies that are hard to avoid.
        Last edited by treba; 09 October 2019, 08:54 AM.

        Comment


        • #5
          AMD's APUs should be capable of something similar, does anyone know more about their support for this feature in FFmpeg?

          Comment


          • #6
            Originally posted by treba View Post

            Oh yes, they surely missed that. Now that you said it, they will discover that those copies could be avoided just as easily.

            Edit: sorry, didn't mean to be so sarcastic. Just wanted to say: I'm sure they thought about that, but here they are optimizing copies that are hard to avoid.
            Thanks for the edit. I understand what you're saying. And I agree that optimizing things that can't be avoided is a useful thing to do. But, my arguement is more from what I've learned from watching the Amlogic and Allwinner video decoding drivers being developed. In these chips (except for where they messed up and broke things by accident) a large amount of effort goes into having zero copies in the video decode pipeline. For SoCs it's because they have so little memory bandwidth to start with that copies eat into what little they have quickly. But they're often designed for mobile applications where power useage is critical to minimize.

            For Intel, they have much more compute resources to draw on, but the arguement is the same--any time you go to memory you use power. The same decoder is in mobile parts as well as desktop parts, so it seems important to be power efficient.

            The conclusion of this is that *if the decoder->display pipeline requires a lot of coppies*, then you designed it wrong to start with. I get that we're past that stage and we're into the "making the best of what we have" and that's commendable. And I appreciate that it's different people who design the chips than who end up coding the drivers. So, I give Intel credit for this optimization, but at the same time Intel deserves scorn for designing it to *need* copies in the first place.

            Possibly some of it may be the fault of the API as it may not support the buffer types that the hardware can handle and that requires some copy or transform/copy steps that wouldn't have been necessary if the software didn't force some way of doing things on the hardware. If that's the case, then the same arguement stands. Thanks for making things better, but the real soltution is to fix the API. And that's a real option. Look at all the work that's being done by the people doing the Allwinner decoder support. They had to come up with a whole new API because their hardware is fundimentally different than normal PC/GPU type of video decoding.

            Summary: Thank you to the people at Intel who coded up this optimization. I hope it benefits lots of people. But also, could you smack the hardware people around a bit and get them to design things to work better or find out how it's supposed to be used and code for that instead?

            Comment


            • #7
              my understanding of this change is a lot more benign: ffmpeg needs to copy the compressed H.264 stream into the iGPU's onboard memory pool, and then read the iGPU's uncompressed output from that embedded memory pool back into RAM so that ffmpeg (which is primarily CPU based) can operate on it. This commit merely adds support for some QSV API that lets the iGPU manage that copy (`ff_qsv_get_continuous_buffer`) rather than the generic `ff_get_buffer` function written by FFmpeg. This QSV call is likely to be faster.

              Comment


              • #8
                Originally posted by ms178 View Post
                AMD's APUs should be capable of something similar, does anyone know more about their support for this feature in FFmpeg?
                I want to know it too. Any AMD people able to reply it?

                Comment


                • #9
                  Originally posted by sturmen View Post
                  my understanding of this change is a lot more benign: ffmpeg needs to copy the compressed H.264 stream into the iGPU's onboard memory pool, and then read the iGPU's uncompressed output from that embedded memory pool back into RAM so that ffmpeg (which is primarily CPU based) can operate on it.
                  Given that system & video memory are the same physical RAM (in the iGPU case - the only one, currently), this only makes sense to me if ffmpeg doesn't know how to manage or use Intel's buffers.

                  The only argument I can see why it might be strictly necessary to do the copy is that pre-Broadwell iGPUs didn't support shared memory between CPU & GPU. Of course, that's assuming that your app needs access to the output frame before it's displayed on screen. And, what blows a hole in that explanation is that a non-GPU version of the copy exists as a starting point.

                  Anyway, if you're just going to display it after decoding, then just teach ffmpeg how to manage Intel's buffers and leave the data in "video" memory.

                  Comment


                  • #10
                    Originally posted by coder View Post
                    Given that system & video memory are the same physical RAM (in the iGPU case - the only one, currently), this only makes sense to me if ffmpeg doesn't know how to manage or use Intel's buffers.
                    Some of Intel's iGPUs have their own memory, for instance the Intel Iris Plus 640 (kabylake (nuc7i5 for instance)) has 64MB of eDRAM

                    Comment

                    Working...
                    X