Intel Adds GPU-Accelerated Memory Copy Support To FFmpeg

coder replied

09 October 2019, 01:59 PM
Originally posted by syrjala View Post

That's actually just a cache.

Was, but not in Skylake.

Intel Adds Crystal Well-based Skylake-R Processors: 65W with 128MB eDRAM

https://www.anandtech.com/show/10281/intel-adds-crystal-well-skylake-processors-65w-edram
Leave a comment:
coder replied

09 October 2019, 01:58 PM
Originally posted by AluminumGriffin View Post

Some of Intel's iGPUs have their own memory, for instance the Intel Iris Plus 640 (kabylake (nuc7i5 for instance)) has 64MB of eDRAM

I'm guessing, to the extent that's used for video decompression, that it's private to the driver.

Last edited by coder; 09 October 2019, 02:01 PM.
Leave a comment:
syrjala replied

09 October 2019, 01:49 PM
Originally posted by AluminumGriffin View Post

Some of Intel's iGPUs have their own memory, for instance the Intel Iris Plus 640 (kabylake (nuc7i5 for instance)) has 64MB of eDRAM

That's actually just a cache.
Likes 2
Leave a comment:
sturmen replied

09 October 2019, 11:55 AM
Originally posted by coder View Post

Given that system & video memory are the same physical RAM (in the iGPU case - the only one, currently), this only makes sense to me if ffmpeg doesn't know how to manage or use Intel's buffers.

The only argument I can see why it might be strictly necessary to do the copy is that pre-Broadwell iGPUs didn't support shared memory between CPU & GPU. Of course, that's assuming that your app needs access to the output frame before it's displayed on screen. And, what blows a hole in that explanation is that a non-GPU version of the copy exists as a starting point.

Anyway, if you're just going to display it after decoding, then just teach ffmpeg how to manage Intel's buffers and leave the data in "video" memory.

I won't pretend to know the inner workings of the Intel silicon package; I'm basically extrapolating from the commit message ("GPU copy enables or disables GPU accelerated copying between video and system memory.") and my knowledge of how ffmpeg exposes hwaccel decoding + filtering in its filtergraph.
Leave a comment:
microcode replied

09 October 2019, 11:52 AM
Originally posted by willmore View Post

Do I misunderstand or is the common belief that best performance comes when you don't make any unnecessary copies in the first place?
"Yes, we're inefficient, but we optimized the inefficient parts so they're not as wasteful."

In this case, the copy is needed because ffmpeg wants to do something with it on the CPU, in the ffmpeg process's address space, which typically can't operate directly on the GPU memory.
Leave a comment:
AluminumGriffin replied

09 October 2019, 11:34 AM
Originally posted by coder View Post

Given that system & video memory are the same physical RAM (in the iGPU case - the only one, currently), this only makes sense to me if ffmpeg doesn't know how to manage or use Intel's buffers.

Some of Intel's iGPUs have their own memory, for instance the Intel Iris Plus 640 (kabylake (nuc7i5 for instance)) has 64MB of eDRAM
Likes 1
Leave a comment:
coder replied

09 October 2019, 11:24 AM
Originally posted by sturmen View Post

my understanding of this change is a lot more benign: ffmpeg needs to copy the compressed H.264 stream into the iGPU's onboard memory pool, and then read the iGPU's uncompressed output from that embedded memory pool back into RAM so that ffmpeg (which is primarily CPU based) can operate on it.

Given that system & video memory are the same physical RAM (in the iGPU case - the only one, currently), this only makes sense to me if ffmpeg doesn't know how to manage or use Intel's buffers.

The only argument I can see why it might be strictly necessary to do the copy is that pre-Broadwell iGPUs didn't support shared memory between CPU & GPU. Of course, that's assuming that your app needs access to the output frame before it's displayed on screen. And, what blows a hole in that explanation is that a non-GPU version of the copy exists as a starting point.

Anyway, if you're just going to display it after decoding, then just teach ffmpeg how to manage Intel's buffers and leave the data in "video" memory.
Likes 3
Leave a comment:
timofonic replied

09 October 2019, 11:08 AM
Originally posted by ms178 View Post

AMD's APUs should be capable of something similar, does anyone know more about their support for this feature in FFmpeg?

I want to know it too. Any AMD people able to reply it?
Likes 2
Leave a comment:
sturmen replied

09 October 2019, 10:24 AM
my understanding of this change is a lot more benign: ffmpeg needs to copy the compressed H.264 stream into the iGPU's onboard memory pool, and then read the iGPU's uncompressed output from that embedded memory pool back into RAM so that ffmpeg (which is primarily CPU based) can operate on it. This commit merely adds support for some QSV API that lets the iGPU manage that copy (`ff_qsv_get_continuous_buffer`) rather than the generic `ff_get_buffer` function written by FFmpeg. This QSV call is likely to be faster.
Likes 3
Leave a comment:
willmore replied

09 October 2019, 09:37 AM
Originally posted by treba View Post

Oh yes, they surely missed that. Now that you said it, they will discover that those copies could be avoided just as easily.

Edit: sorry, didn't mean to be so sarcastic. Just wanted to say: I'm sure they thought about that, but here they are optimizing copies that are hard to avoid.

Thanks for the edit. I understand what you're saying. And I agree that optimizing things that can't be avoided is a useful thing to do. But, my arguement is more from what I've learned from watching the Amlogic and Allwinner video decoding drivers being developed. In these chips (except for where they messed up and broke things by accident) a large amount of effort goes into having zero copies in the video decode pipeline. For SoCs it's because they have so little memory bandwidth to start with that copies eat into what little they have quickly. But they're often designed for mobile applications where power useage is critical to minimize.

For Intel, they have much more compute resources to draw on, but the arguement is the same--any time you go to memory you use power. The same decoder is in mobile parts as well as desktop parts, so it seems important to be power efficient.

The conclusion of this is that *if the decoder->display pipeline requires a lot of coppies*, then you designed it wrong to start with. I get that we're past that stage and we're into the "making the best of what we have" and that's commendable. And I appreciate that it's different people who design the chips than who end up coding the drivers. So, I give Intel credit for this optimization, but at the same time Intel deserves scorn for designing it to *need* copies in the first place.

Possibly some of it may be the fault of the API as it may not support the buffer types that the hardware can handle and that requires some copy or transform/copy steps that wouldn't have been necessary if the software didn't force some way of doing things on the hardware. If that's the case, then the same arguement stands. Thanks for making things better, but the real soltution is to fix the API. And that's a real option. Look at all the work that's being done by the people doing the Allwinner decoder support. They had to come up with a whole new API because their hardware is fundimentally different than normal PC/GPU type of video decoding.

Summary: Thank you to the people at Intel who coded up this optimization. I hope it benefits lots of people. But also, could you smack the hardware people around a bit and get them to design things to work better or find out how it's supposed to be used and code for that instead?
Likes 7
Leave a comment:

Announcement

Intel Adds GPU-Accelerated Memory Copy Support To FFmpeg

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: