Announcement

**microcode** · 09 October 2019, 11:52 AM

Originally posted by willmore View Post

Do I misunderstand or is the common belief that best performance comes when you don't make any unnecessary copies in the first place?
"Yes, we're inefficient, but we optimized the inefficient parts so they're not as wasteful."

In this case, the copy is needed because ffmpeg wants to do something with it on the CPU, in the ffmpeg process's address space, which typically can't operate directly on the GPU memory.

**sturmen** · 09 October 2019, 11:55 AM

Originally posted by coder View Post

Given that system & video memory are the same physical RAM (in the iGPU case - the only one, currently), this only makes sense to me if ffmpeg doesn't know how to manage or use Intel's buffers.

The only argument I can see why it might be strictly necessary to do the copy is that pre-Broadwell iGPUs didn't support shared memory between CPU & GPU. Of course, that's assuming that your app needs access to the output frame before it's displayed on screen. And, what blows a hole in that explanation is that a non-GPU version of the copy exists as a starting point.

Anyway, if you're just going to display it after decoding, then just teach ffmpeg how to manage Intel's buffers and leave the data in "video" memory.

I won't pretend to know the inner workings of the Intel silicon package; I'm basically extrapolating from the commit message ("GPU copy enables or disables GPU accelerated copying between video and system memory.") and my knowledge of how ffmpeg exposes hwaccel decoding + filtering in its filtergraph.

**syrjala** · 09 October 2019, 01:49 PM

Originally posted by AluminumGriffin View Post

Some of Intel's iGPUs have their own memory, for instance the Intel Iris Plus 640 (kabylake (nuc7i5 for instance)) has 64MB of eDRAM

That's actually just a cache.

**coder** · 09 October 2019, 01:58 PM

Originally posted by AluminumGriffin View Post

Some of Intel's iGPUs have their own memory, for instance the Intel Iris Plus 640 (kabylake (nuc7i5 for instance)) has 64MB of eDRAM

I'm guessing, to the extent that's used for video decompression, that it's private to the driver.

**coder** · 09 October 2019, 01:59 PM

Originally posted by syrjala View Post

That's actually just a cache.

Was, but not in Skylake.

Intel Adds Crystal Well-based Skylake-R Processors: 65W with 128MB eDRAM

https://www.anandtech.com/show/10281/intel-adds-crystal-well-skylake-processors-65w-edram

**coder** · 09 October 2019, 02:02 PM

Originally posted by microcode View Post

In this case, the copy is needed because ffmpeg wants to do something with it on the CPU, in the ffmpeg process's address space, which typically can't operate directly on the GPU memory.

Context is everything. The new memcpy replaces an exisiting, userspace, CPU-based one. That should tell you that this is nothing to do with the buffer memory being outside the process' address space.

**syrjala** · 09 October 2019, 02:40 PM

Originally posted by coder View Post

Was, but not in Skylake.
https://www.anandtech.com/show/10281...sors-65w-edram

It's still a cache even though it sits in a slightly different position in the topology. One interesting upside of the new arrangement is that the display engine can now "see" the eDRAM so your scanout buffers can remain eLLC cacheable. Currently i915 doesn't allow that though. I have a pending patch to enable it but I'm not quite 100% sure it's a good idea.

**fuzz** · 09 October 2019, 03:25 PM

Shame HSA never really caught on :/

**Guest** · 09 October 2019, 04:05 PM

Originally posted by coder View Post

Given that system & video memory are the same physical RAM (in the iGPU case - the only one, currently), this only makes sense to me if ffmpeg doesn't know how to manage or use Intel's buffers.

The only argument I can see why it might be strictly necessary to do the copy is that pre-Broadwell iGPUs didn't support shared memory between CPU & GPU. Of course, that's assuming that your app needs access to the output frame before it's displayed on screen. And, what blows a hole in that explanation is that a non-GPU version of the copy exists as a starting point.

Anyway, if you're just going to display it after decoding, then just teach ffmpeg how to manage Intel's buffers and leave the data in "video" memory.

True, I believe VAAPI already supports something like this. You can get a handle to the surface or buffer (and I think it's somehow tied to the kernel DRM and maybe DRM PRIME) and use it in other places like Wayland, OpenGL, Vulkan etc.

So in the case of an iGPU, or if the decoded data is just going to be displayed, there's no reason to perform copies and move it around.

**ms178** · 10 October 2019, 07:23 AM

Originally posted by fuzz View Post

Shame HSA never really caught on :/

Maybe there is a comeback of similar functionality on the AMD side soon, just think of their chiplet approach with HBM on the same package. That implies HSA-like functionality not only on APUs but on their next-gen high performance cores as well. I guess Intel's oneAPI approach with SYCL surrounding LLVM would help the software ecosystem as a whole to bring dGP + APU use for GPGPU tasks forward as this could also be targeted by AMD and other vendors.

Also the industry is now zeroing in on CXL as a cache coherent protocol standard for connecting several devices together. That is also an important ingredient in the overall picture.

Announcement

Intel Adds GPU-Accelerated Memory Copy Support To FFmpeg

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment