Announcement

**popper** · 25 October 2010, 05:39 AM

no, you miss the point tball , x264 may be considering only H.264 right now, but they do and know Mpeg2 and Vp8 Encode AND Decode assembly and C inside out, as do the FFmpeg dev's.

they can Help You and your team get this prototype up and running and you can port and re-factor any code sections that suits you to replace the CPU code with GPU code where needed/wanted, and use the other CPU parts unchanged to start with.

its perfectly usual and expected to start riping out things you dont need for a prototype,and building basic simple test case per new function you want to write and test, just as the OpenCL University guys did "..Considering the fact that only a fraction of the motion estimation capabilities have been ported to OpenCL...." section by section, one routine at a time.

the only other code-base of significance that really matters is LinuxTv/Media and they too use x264/FFmpeg code frame works inside their hardware devices etc, everything else assentially is just secondary apps and code that wrap's/port's code from these 3.

**tball** · 25 October 2010, 12:22 PM

Originally posted by popper View Post

no, you miss the point tball , x264 may be considering only H.264 right now, but they do and know Mpeg2 and Vp8 Encode AND Decode assembly and C inside out, as do the FFmpeg dev's.

they can Help You and your team get this prototype up and running and you can port and re-factor any code sections that suits you to replace the CPU code with GPU code where needed/wanted, and use the other CPU parts unchanged to start with.

its perfectly usual and expected to start riping out things you dont need for a prototype,and building basic simple test case per new function you want to write and test, just as the OpenCL University guys did "..Considering the fact that only a fraction of the motion estimation capabilities have been ported to OpenCL...." section by section, one routine at a time.

the only other code-base of significance that really matters is LinuxTv/Media and they too use x264/FFmpeg code frame works inside their hardware devices etc, everything else assentially is just secondary apps and code that wrap's/port's code from these 3.

That would be the easiest way, yes. But that approach has already been considered.

Implementing a vaapi backend seems somewhat cleaner, don't you agree?
If you like, plz join #gallium-vdpau irc channel and discuss it with us.

**bridgman** · 25 October 2010, 12:38 PM

I guess it depends on what you mean by "vaapi backend". If the plan is to implement a VLD-level vaapi entry point then that will be "clean" from a user perspective but "dirty" from a developer perspective since a lot of decode-on-the-CPU code will need to be (re)implemented.

If, on the other hand, you are talking about implementing a lower level vaapi entrypoint (IDCT, MoComp or Deblocking) that corresponds with the functionality you are likely to be able to implement, and modifying an existing decoder stack to use that lower level entry point, then that seems a lot more do-able.

* DISCLAIMER - when I looked at the vaapi spec I saw the entry points but didn't think I saw enough of the right kind of data structures to implement an IDCT or MC-level interface... guess I expected it to look a bit more like XvMC

**popper** · 25 October 2010, 01:01 PM

well, im in favor of the engineer approach, taking the best in class of whatever's available at the time of design and connecting the dots as it where.

make them all fit together in a consistent way, but allow for pulling a given element and replacing it with something totally different , rinse/repeat.

and OC any such design Must allow and include some form of basline code path you can turn on, to actually test each element/block for speed/time so you can find the bottle necks, and write test cases to see if your new idea is actually better ,faster and actually works.

but most apps/HW seem to go for the 'it's good enough' for that is how 'I Think' it should work without getting meaningful feedback from a 3rd party that's gone through that type of design already, and doesn't allow for simple future speed improvements by others and thats a shame.

that's how the UVD came become virtually unused by Open source 3rd party's it seems,

as someone inside AMD/ATI had a 'for that is how 'I Think' it should work' moment with their 'i think ill put the protected DRM/BR
code in this UVD along with the decode logic and save some money/time/kudos this quarter

**jrch2k8** · 25 October 2010, 05:30 PM

Originally posted by popper View Post

well, im in favor of the engineer approach, taking the best in class of whatever's available at the time of design and connecting the dots as it where.

make them all fit together in a consistent way, but allow for pulling a given element and replacing it with something totally different , rinse/repeat.

and OC any such design Must allow and include some form of basline code path you can turn on, to actually test each element/block for speed/time so you can find the bottle necks, and write test cases to see if your new idea is actually better ,faster and actually works.

but most apps/HW seem to go for the 'it's good enough' for that is how 'I Think' it should work without getting meaningful feedback from a 3rd party that's gone through that type of design already, and doesn't allow for simple future speed improvements by others and thats a shame.

that's how the UVD came become virtually unused by Open source 3rd party's it seems,

as someone inside AMD/ATI had a 'for that is how 'I Think' it should work' moment with their 'i think ill put the protected DRM/BR
code in this UVD along with the decode logic and save some money/time/kudos this quarter

well you are right up to some point, we all have many ideas of what is good and wrong with ffmpeg/x264/vaapi/players/tv apps/etc when you confront it from the point of view of an integrated video framework for linux and other unix like oses, so is not like we are expecting anything to be widely accepted or even used in the short term.

see it more like a study phase to understand gallium and how video decode and encode works from a gpu point of view that in a near future could produce some code good enough to reproduce mpeg2 video using the GPU and compliant with va-api.

the idea at least in my head is after we master TGSI and video algorithm theory decently enough is sit an have a brainstorm of a more focused project aiming to be codec agnostic and maybe even process some sections of audio compression in the gpu among other stuffs

**jrch2k8** · 25 October 2010, 06:03 PM

Originally posted by bridgman View Post

The reality is that IDCT is going to have to be implemented anyways if only to determine whether running it on GPU was a good idea or not

well im not an expert in IDCT or anything like it ofc, but so far i founded 3 ways of resolving a fast IDCT

1. using shift and mask (butterfly IDCT)
2. integers mostly (FFT IDCT)
3. matrixes multiplication (DCT float M'TM)

Cuda developers proved that IDCT using FFT is slow on GPUs so i discard it and the butterfly way is horrible to vectorize(look the pdf that explain the butterfly algorithm you will understand me) so this is not discarded yet but i want to put my hope in get a pure float M'TM first and then when i get more confidence with TGSI bench it against butterfly, especially since M'TM is beatiful to vectorize and prefecth and can be reduced to only Vectors Dot Products aka sums and mult's transposing the incoming macroblock and the coeff matrix .

so far using the third method simply converted to SSE/2 aka no prefecht or memory optimization at all (since gpu is different on this aspect)
it can resolve 64000 8x8 dct'ed block in around 0.35 ish ms on 1 core of my phenom II X4 (it can be faster once i found a way to calculate a dot4 without leaving the cache or find a smarter way to multiply 3 matrix).

from the perfomance point of view i believe this code should behave almost linear in a GPU (as far i understand adds and muls can be done in 1 cycle) and provide a nice performance or at least beast the cpu by a nice factor.

another interesting stuff of pass idct to the gpu is bandwith saving since is ovbiuosly cheaper pass the dct'ed data wich is very small to the gpu memory than pass the IDCT data wich is way bigger.

another benefit could be the color convertion pass (maybe force it naturally to RGBA ) since by the time IDCT ended and MC kick in you almost have a full frame so in theory could a time saver add the color convertion step inside IDCT and hack MC to process all in rgba(for now is a crazy idea, remember im learning this stuff while speaking)(ofc i need to think of a fast way to do color convertion that maintain the linear behavior of the IDCT algorithm (remember still learning and looking for differnt ways of get the job done so maybe ill find out later that this is impossible or creppy faster or the same but i try it and a +1 for me)

but i strongly believe that idct on the gpu would be way nicer XD (for now tu tu tu tun tu tu tun)

**popper** · 25 October 2010, 08:01 PM

"
#46 jrch2k8
see it more like a study phase to understand gallium and how video decode and encode works from a gpu point of view that in a near future could produce some code good enough to reproduce mpeg2 video using the GPU and compliant with va-api.

the idea at least in my head is after we master TGSI and video algorithm theory decently enough is sit an have a brainstorm of a more focused project aiming to be codec agnostic and maybe even process some sections of audio compression in the gpu among other stuffs"

oh sure, i dont expect you or the team to produce anything in the short term, in fact many GPU uni guys pass through x264dev channels with great potential and say they will produce x then never come back after talking with devs LOL.

even those OpenCL guys never did the simple pop on IRC, ask DS if he's around, he is there at all times of the 24 hour day (odd sleeper apparently)then tell him this OpenCL test/prototype code exists at location and can the dev's review it and tell them what small changes to make so it can be pushed to master on the next commit, done.

that IS the way to get any patch committed on x264 and even FFmpeg is getting better as more dev's get commit privilege

BUT you forgot the Most Important Thing in all this, you Need to have FUN learning this stuff.

and im sure you would love talking to Holger etc on #x264dev he Will have idea's and and perhaps new code to use if its a better idea than used now for "a smarter way to multiply 3 matrix" and the 3 IDCT versions you outline and probably lots more FUN for you

**popper** · 25 October 2010, 08:34 PM

jrch2k8 have a read of the content and comments on http://x264dev.multimedia.cx/archives/71
http://x264dev.multimedia.cx/archives/486 and
http://x264dev.multimedia.cx/archives/377 if you haven't already to get an idea of the people most skilled in assembly etc.

**jrch2k8** · 25 October 2010, 09:28 PM

Originally posted by popper View Post

jrch2k8 have a read of the content and comments on http://x264dev.multimedia.cx/archives/71
http://x264dev.multimedia.cx/archives/486 and
http://x264dev.multimedia.cx/archives/377 if you haven't already to get an idea of the people most skilled in assembly etc.

well im mostly a C++ troll (QT/KDE) that do commertial development on linux(releasing the source if the client want's them ofc) and from time to time ill tried to do stuff for kde3(kde4 is on my sight once i have some free time), so after many years i've become a not so fan of X86 assembly in my work cuz i noticed is very hard for new devs to understand it compared to a cleaner C++ interface, ofc i understand that assembly can get a nice code boost but if i have to use it is online inlined and very well documented wich take a lot of time or sacrifice a bit of performance an use the C/C++ like SSE for example and i have to admit that a little patience properly writng your sse code in C and a recent GCC give results very close to hand writed asm (gcc -S rulez XD to check that). yeah all that to say im coding in C not asm lol.

but you are right x264 irc is my next stop and ffmpeg since well there is wehre you find the guru's and many thanks for the links they seems really useful and i probably can use some asm trick to do the dot sums part accesing directly xmm0..8 instead of using pointers

+1 for popper XD

**popper** · 25 October 2010, 09:54 PM

funny there was some dct32_sse.c and related commits today in ffmpeg, check em out etc, later.

Announcement

There May Still Be Hope For R600g Supporting XvMC, VDPAU

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment