Announcement

Collapse
No announcement yet.

Intel Is Still Working On G45 VA-API Video Acceleration

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • curaga
    replied
    Out of curiosity, what kind of speeds are you getting now?

    Leave a comment:


  • Veerappan
    replied
    Originally posted by bridgman View Post
    If it wasn't for latency-hiding you would be absolutely correct, and the number of cores does determine the number of threads that can execute in any single clock cycle (2 SIMDs, each executing 8 threads / clock, with up to 5 simultaneous instructions per thread per clock for a total of 80).

    Running more threads than the minimum allows you to make much better use of available memory bandwidth, however, since any thread blocked on a memory access can simply idle while another unblocked thread runs. IIRC this was introduced in the r5xx series, described by marketing as the "Ultra-Threaded Dispatch Processor".
    What bridgman said.

    The decoder I've got will launch anywhere from 16 to 336 threads at a given time, which would be fine, except for two things:
    1. Memory Access Stalls
    2. Kernel Launch Latency


    When you perform a memory access, you've got several hundred GPU clock cycles you might be waiting around (when reading from the graphics card memory). During this time, the GPU usually tries to swap to another set of threads, similar to hyper-threading on an Intel CPU. The difference being that instead of one thread stalling, an entire group of threads will stall. On Nvidia, threads move together in groups of 32 threads, so a single read will stall 32 threads.

    I'm not sure about AMD as none of my ATI graphics cards (3200/4200/4770) support the byte_addressable_store extension the decoder needs, but given bridgman's number of 8 threads per SIMD, I'm guessing that 8 threads get stalled by a single read, which means up to 40 vector calculations would get stalled.

    The other factor is how long it takes to start running an OpenCL kernel. Normally on a CPU, function calls take a few cycles to fire up. You set the arguments, jump to a new IR value, and then back up any registers you're about to trash.

    In OpenCL you get the additional fun of having to send the function call request into the CL library, which sends it to the graphics driver, which queues it for execution, then sends it over the PCI-express bus to the GPU, and then the GPU finally starts executing it. In most cases, this latency can be fairly low (several tens of clock cycles), but in some cases, such as my laptop (GF9400, Ubuntu 10.10, Nv blob) I've seen latencies which will occasionally spike up to several thousand clock cycles.

    In order to minimize this start-up cost, its usually good to do either a lot of work in a given thread, or launch a TON of threads which would take the CPU a long time to loop over. The danger of the first option is that its hard to write long kernels that don't branch a lot (another performance killer). The second is generally preferred, but as I said, the most I've managed is 336 threads in a single launch, and that's not exactly the common case.

    Usually with Nvidia hardware, you want thousands of threads in flight. With most Nvidia GPUs, they can handle something like 16k-32k simultaneous threads.

    Leave a comment:


  • bridgman
    replied
    If it wasn't for latency-hiding you would be absolutely correct, and the number of cores does determine the number of threads that can execute in any single clock cycle (2 SIMDs, each executing 8 threads / clock, with up to 5 simultaneous instructions per thread per clock for a total of 80).

    Running more threads than the minimum allows you to make much better use of available memory bandwidth, however, since any thread blocked on a memory access can simply idle while another unblocked thread runs. IIRC this was introduced in the r5xx series, described by marketing as the "Ultra-Threaded Dispatch Processor".

    Leave a comment:


  • curaga
    replied
    Originally posted by bridgman View Post
    You normally want a lot more threads than shader cores to cover latency (memory accesses for texture fetches etc..).
    Oh, I thought the number of shader cores was the max amount of gpu threads one could run.

    Leave a comment:


  • bridgman
    replied
    You normally want a lot more threads than shader cores to cover latency (memory accesses for texture fetches etc..).

    Leave a comment:


  • curaga
    replied
    Originally posted by Veerappan View Post
    (max of ~320 parallel threads, which isn't nearly enough).
    320 threads isn't enough? To be useful it should run on HW having only 80 shaders (such as the current Bobcat), no?

    Leave a comment:


  • Veerappan
    replied
    Originally posted by popper View Post
    why didn't you say something earlier and wait 3-4 months, just read the x264 IRC log
    http://akuvian.org/src/x264/freenode-x264dev.log.bz2
    and realise that all Gfx developers including the Nvidia Pro dev that have tried to date ran away for exactly that reason.

    search for
    < Dark_Shikari> because all people who try are eaten by the cuda monster
    cuda
    opengl
    parallelize
    2010-03-24 16:48:05 < Dark_Shikari> basically it needs to demonstrate the implementation of a highly-parallelizable ME algorithm, like hierarchical

    to get the idea, then realise they still think its usable in even a limited form (better to do something useful rather than let that Gfx sit there unused) for Encoding , come up with a viable gfx algorithm or two then provide even a simple prototype patch for x264 to start with and get feedback in their IRC dev channel without running away, then have that patch ported to the ffmpeg decoder as a beginning.

    you may think its an odd way to do it, read the log, learn from it, write a gfx patch to improve the x264 Encoder then have that ported to ffmpeg, but that has proven time and again to be the best option (soon ffmpeg gets the latest avx assembly ported from x264 almost unchanged ) as they know the video spec's inside out and contribute directly to ffmpeg too as that is their preferred decode code base, and you can learn a lot to take away and use elsewhere.

    just a thought for you to consider anyway.

    One key difference: Most of what was in that IRC log (and the linked git repositories) is talking about ENcoding. My project has been for DEcoding. I've managed to convert Subpixel Prediction (sixtap + bilinear), IDCT, Dequantization, and the VP8 Loop filter, but I haven't had a chance to really rip apart and rebuild the algorithms from scratch in a parallel manner, so most of the CL code executes fairly serially (max of ~320 parallel threads, which isn't nearly enough). As far as the entropy decoding, and detokenizing, those aren't really bottlenecks for VP8 in most videos (highest I've seen detokenize was 20% of decoding time, and entropy decoding was normally <5%).

    As far as why I didn't say anything earlier. I've been mentioning the project occasionally since last summer, and went through the thesis proposal process back in June-August after getting the ok from Jim Bankowski (former chief of On2) on the WebM Project mailing list. I've been working on it for a while, but didn't really start coding until a few months ago (Nov/Dec). And well, I could've mentioned it on the ffmpeg mailing list, but I figured it'd be best to go straight upstream to the source instead.

    Edit: And honestly, I can kinda see why Cuda gets preferred by some developers over OpenCL..

    Leave a comment:


  • popper
    replied
    Originally posted by Veerappan View Post
    Tell me about it... I've spent the last 3-4 months discovering how difficult it is to parallelize video decoding (with VP8). I've got a functional OpenCL VP8 decoder, but Functional != Fast.
    why didn't you say something earlier and wait 3-4 months, just read the x264 IRC log
    http://akuvian.org/src/x264/freenode-x264dev.log.bz2
    and realise that all Gfx developers including the Nvidia Pro dev that have tried to date ran away for exactly that reason.

    search for
    < Dark_Shikari> because all people who try are eaten by the cuda monster
    cuda
    opengl
    parallelize
    2010-03-24 16:48:05 < Dark_Shikari> basically it needs to demonstrate the implementation of a highly-parallelizable ME algorithm, like hierarchical

    to get the idea, then realise they still think its usable in even a limited form (better to do something useful rather than let that Gfx sit there unused) for Encoding , come up with a viable gfx algorithm or two then provide even a simple prototype patch for x264 to start with and get feedback in their IRC dev channel without running away, then have that patch ported to the ffmpeg decoder as a beginning.

    you may think its an odd way to do it, read the log, learn from it, write a gfx patch to improve the x264 Encoder then have that ported to ffmpeg, but that has proven time and again to be the best option (soon ffmpeg gets the latest avx assembly ported from x264 almost unchanged ) as they know the video spec's inside out and contribute directly to ffmpeg too as that is their preferred decode code base, and you can learn a lot to take away and use elsewhere.

    just a thought for you to consider anyway.

    Leave a comment:


  • popper
    replied
    Originally posted by deanjo View Post
    So how again can intel provide accelerated h264 support with open drivers but when it comes to AMD it becomes a "legal issue"? w00t 5k posts!
    did you forget deanjo , its all about that pesky long promised and long overdue 'UVD review' result we are still waiting on.

    its again time to look up the date that was first mentioned here i think coming up near 12 months now i think....

    Leave a comment:


  • gbeauche
    replied
    Originally posted by const View Post
    i don't think the information in this news is entirely correct - i can recall clearly that few months ago there was driver and test report http://intellinuxgraphics.org that state VA-API is supported and working on "GMA 4500MHD" and i even remember checking in Wikipedia about device hardware IDs, because 0x2A42 and 0x2A43 was mentioned in the information that they are the working ones. so, even i didn't test personally and i still haven't looked again on http://intellinuxgraphics.org for the information i found there before, i have doubts in the correctness of the news and more specifically about GM45 with device IDs 0x2A42 and 0x2A43.
    MPEG-2 VLD is already implemented on GMA 4500MHD. H.264 support is being worked on. I think it was also mentioned that VC-1 won't be supported on those older chips.

    Leave a comment:

Working...
X