Announcement

Collapse
No announcement yet.

VP8 Over VDPAU In Gallium3D Is Emeric's Target

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by RealNC View Post
    This whole thing is kind of useless. VP8 on the internet is low-bitrate enough as to not need acceleration. H.264 would be much more important to have on top of Gallium.
    I think the point Emeric made was that it just wouldn't be possible for him to make a H.264 decoder state tracker with the time he has.
    So he would prefer ending up with a less usable but functional (VP8) decoder, rather than a more useful (H.264) but non-functional/incomplete decoder.
    Implementing a full H.264 decoder in software took him six months last time he tried (1), and this time he has to learn about shader-based optimizations as well, so it would seem to be a bit too much work for one GSoC.

    Also, he first started out his thread on the mailing list by proposing a generic implementation of various processes involved in video decoding, so that an arbitrary codec could just hook into these and accelerate decoding this way. But I'm not sure if he's left that idea as well:
    The project would be to write a state tracker wich expose some of the
    most shaders-friendly decoding operations (like motion compensation,
    idct, intra-predictions, deblocking filter and maybe vlc decoding)
    through a common API like VDPAU or VA-API.
    These APIs can be used to decode mpeg2, mpeg 4 asp/avc, vc1 and
    others, but at first I intend to focus on the h264 decoding to save
    time, because I know it better and it is currently widely in use, but
    again the goal of the project is to be generic.
    (2)

    In any case, if he successfully makes this VDPAU VP8 decoder state tracker, adding support for H.264, VC-1 etc. later will be much easier than it is now.

    Comment


    • #17
      Can anyone answer exactly how these optimizations are coded in a state tracker?
      I mean, I think I get how the state tracker itself functions, on a conceptual level, at least.
      But in the state tracker code, in the part of the code that deals with the actual decoding of a video stream, how, concretely, is a certain part of this decoding code - let's say iDCT - written, to allow it to be executed on a GPU (in parallel)?
      Would the process of starting to write useful code for this kind of thing be something like reading a couple of papers on parallelizing the iDCT algorithm, and then write the actual paralllelization-code in TGSI? Or does one use a higher level language like GLSL? Are there any currently working code examples in mesa that I can take a look at to get a better understanding of it?

      More generally perhaps, is the point of access to the shaders of a graphics card in a Gallium state tracker always TGSI? If this is the case, wouldn't something as relatively easy as iDCT be fairly complicated to implement in TGSI? I mean, I've seen the C code, and the assembly code, that implements iDCT. Wouldn't the TGSI-code look a lot like the assembly (CPU) code except with some form a parallelization-enabled instructions?

      On a third note, does anyone know where the "TGSI specification" pdf on this site has gone? I'd really like to take a look at it (even though I probably wouldn't understand much of it ). But it seems like it's the only documentation that I can find.

      Comment


      • #18
        Originally posted by runeks View Post
        Can anyone answer exactly how these optimizations are coded in a state tracker?
        I mean, I think I get how the state tracker itself functions, on a conceptual level, at least.
        But in the state tracker code, in the part of the code that deals with the actual decoding of a video stream, how, concretely, is a certain part of this decoding code - let's say iDCT - written, to allow it to be executed on a GPU (in parallel)?
        Would the process of starting to write useful code for this kind of thing be something like reading a couple of papers on parallelizing the iDCT algorithm, and then write the actual paralllelization-code in TGSI? Or does one use a higher level language like GLSL? Are there any currently working code examples in mesa that I can take a look at to get a better understanding of it?

        More generally perhaps, is the point of access to the shaders of a graphics card in a Gallium state tracker always TGSI? If this is the case, wouldn't something as relatively easy as iDCT be fairly complicated to implement in TGSI? I mean, I've seen the C code, and the assembly code, that implements iDCT. Wouldn't the TGSI-code look a lot like the assembly (CPU) code except with some form a parallelization-enabled instructions?

        On a third note, does anyone know where the "TGSI specification" pdf on this site has gone? I'd really like to take a look at it (even though I probably wouldn't understand much of it ). But it seems like it's the only documentation that I can find.
        In broad strokes, the job of the state tracker is to convert random API input (VDPAU, OpenGL, etc.) into TGSI output.

        Then the hardware drivers take the TGSI as input and output commands that the actual hardware works with.

        I suspect that the video decoding code will be written directly in TGSI within the state tracker, but I suppose it's possible to do it in something like GLSL and then compile it down to TGSI. I'm not sure how difficult that would be to implement, but it's probably more efficient to just code in TGSI directly.

        Comment


        • #19
          Originally posted by smitty3268 View Post
          I suspect that the video decoding code will be written directly in TGSI within the state tracker, but I suppose it's possible to do it in something like GLSL and then compile it down to TGSI. I'm not sure how difficult that would be to implement, but it's probably more efficient to just code in TGSI directly.
          I'm guessing what will be done is:

          1. Add the state tracker but do everything in C code, test until it works
          2. Pick a function like idtc and write a separate test app with a shader implementing it
          3. Test that shader until it works well, then use Mesa to record the TGSI code it generates
          4. Move that generated code into the state tracker, test
          5. Either move on to the next function to optimize, or work on optimizing the generated TGSI code directly in the state tracker. Repeat as needed.

          Comment


          • #20
            Originally posted by smitty3268 View Post
            2. Pick a function like idtc and write a separate test app with a shader implementing it
            3. Test that shader until it works well, then use Mesa to record the TGSI code it generates
            I guess what I'm interested in is these two steps. Perhaps only step 2.
            I'm not sure what you mean by "write a separate test app with a shader implementing it" though. Why would we write a separate (test) application to implement a sub-feature of a state tracker? Or do you mean just writing an application that can be used to test whichever decoding routine we choose to optimize using shaders?

            Also, in step 3: are we not writing this shader in TGSI ourselves? If so, why would we use mesa to record "the TGSI code it generates"?

            Comment


            • #21
              Originally posted by runeks View Post
              I guess what I'm interested in is these two steps. Perhaps only step 2.
              I'm not sure what you mean by "write a separate test app with a shader implementing it" though. Why would we write a separate (test) application to implement a sub-feature of a state tracker? Or do you mean just writing an application that can be used to test whichever decoding routine we choose to optimize using shaders?

              Also, in step 3: are we not writing this shader in TGSI ourselves? If so, why would we use mesa to record "the TGSI code it generates"?
              I think smitty meant "write a test app with a GLSL shader".

              Comment


              • #22
                Originally posted by runeks View Post
                I guess what I'm interested in is these two steps. Perhaps only step 2.
                I'm not sure what you mean by "write a separate test app with a shader implementing it" though. Why would we write a separate (test) application to implement a sub-feature of a state tracker? Or do you mean just writing an application that can be used to test whichever decoding routine we choose to optimize using shaders?

                Also, in step 3: are we not writing this shader in TGSI ourselves? If so, why would we use mesa to record "the TGSI code it generates"?
                What i mean is a toy app which only has a single shader in it that does the idtc. Used to test that shader until it's working. It's easier doing it there because then you can just use standard OpenGL instead of trying to hook up a shader compiler inside the state tracker.

                Think about optimizing a function in x264. First you'd write it in C code and make sure that's working. Then you can compile that to assembly with GCC and copy the output into an assembly section in x264. Then you can work on actually trying to optimize the assembly, instead of writing it from scratch.

                I know that wouldn't always lead to optimal results, but it does seem like the quickest path to me.

                PS - I am not a developer involved with Mesa, video decoding, or anything else discussed here. So I'm just giving my opinion of what will probably be done, I have no inside information.

                Comment


                • #23
                  Originally posted by bridgman View Post
                  I think smitty meant "write a test app with a GLSL shader".
                  Ah, of course, that makes sense. To me it seems like writing directly in TGSI is simply too low level. I mean, who would want to write these sort of algorithms in an assembly-like language/on an instruction-by-instruction basis. It simply seems like too much work. So it definitely makes sense to first write it in GLSL and look a the (TGSI) output of the GLSL compiler. Or, rather, just copy-and-paste the TGSI code into the state tracker. I know I wouldn't want to fiddle too much with TGSI. I imagine just getting a part of the video coding process written and working in GLSL is quite a job in itself.

                  Originally posted by smitty3268 View Post
                  What i mean is a toy app which only has a single shader in it that does the idtc. Used to test that shader until it's working. It's easier doing it there because then you can just use standard OpenGL instead of trying to hook up a shader compiler inside the state tracker.
                  Hmm, bear with me here. When you write "shader", are you referring to the shader program? I keep thinking about the actual hardware units on the graphics card when I read "shader". So we'd write an application that implements an iDCT function in GLSL, and then input various values into this function and see that we get the correct results?

                  Originally posted by smitty3268 View Post
                  Think about optimizing a function in x264. First you'd write it in C code and make sure that's working. Then you can compile that to assembly with GCC and copy the output into an assembly section in x264. Then you can work on actually trying to optimize the assembly, instead of writing it from scratch.
                  I think I get the point. On a CPU the path would be C->asm->optimized asm while on a GPU the path would be GLSL->TGSI->optimized TGSI.
                  But again, would we even gain that much trying to optimize the TGSI? Wouldn't the fact that GLSL is able to utilize hundreds of shaders make it optimized enough, or is it just far more complicated than I'm making it out to be? (It often is )

                  Comment


                  • #24
                    Originally posted by runeks View Post
                    I think I get the point. On a CPU the path would be C->asm->optimized asm while on a GPU the path would be GLSL->TGSI->optimized TGSI.
                    But again, would we even gain that much trying to optimize the TGSI? Wouldn't the fact that GLSL is able to utilize hundreds of shaders make it optimized enough, or is it just far more complicated than I'm making it out to be? (It often is )
                    Using GLSL would probably be good enough. I just don't know how easy it would be to hook the GLSL compiler into the VDPAU state tracker - maybe it's already extremely simple and would only take a couple of lines, or maybe it would require tons of glue code and the current compiler only really works with lots of assumptions that it's being called from the OpenGL tracker. Further, I don't know how much of a slowdown compiling those shaders would be and if it makes sense to "pre-compile" them from a performance standpoint or not.

                    So my guess was that to keep things simple they would only mess with the TGSI in the state tracker, but I don't know if that's really the plan or not.

                    I do think the developers are quite familiar with TGSI and I don't think they would view working directly with it too burdensome. They are the same people who are writing the driver compilers, after all, which are working directly on the TGSI and previous mesa IR code.

                    Comment


                    • #25
                      Originally posted by runeks View Post
                      Hmm, bear with me here. When you write "shader", are you referring to the shader program? I keep thinking about the actual hardware units on the graphics card when I read "shader". So we'd write an application that implements an iDCT function in GLSL, and then input various values into this function and see that we get the correct results?
                      Strictly speaking the term "shader" originally described the program, not the hardware. I believe the term originated with RenderMan but not sure... anyways, dedicated hardware for running shader programs came later.

                      Originally posted by runeks View Post
                      I think I get the point. On a CPU the path would be C->asm->optimized asm while on a GPU the path would be GLSL->TGSI->optimized TGSI. But again, would we even gain that much trying to optimize the TGSI? Wouldn't the fact that GLSL is able to utilize hundreds of shaders make it optimized enough, or is it just far more complicated than I'm making it out to be? (It often is )
                      In general you are mostly optimizing with respect to memory accesses (memory bandwidth is always a challenge) more than shader hardware. Algorithms like IDCT and filtering tend to have to perform a lot of reads for every write (even more so than with normal textured rendering), and a significant part of optimizing is about reducing the number of reads or making sure the read pattern is cache-friendly.

                      Comment


                      • #26
                        Try this link - I'm on dial-up so it'll be an hour or so before I can confirm if it's the right slide deck, but I *think* this deck talks about optimizing for compute-type applications (and video decode is more like compute work than 3D work) :

                        http://developer.amd.com%2Fgpu_asset...plications.pdf

                        Comment


                        • #27
                          @smitty3268 I think we're on the same page here. What I meant wasn't really to enable state trackers to use GLSL directly, but rather just use the TGSI that the GLSL compiler outputs as-is, ie. to just copy-and-paste that TGSI code into a state tracker without optimizations. So we'd just be using the already functioning GLSL compiler to generate the TGSI code that we'd be sticking in the state tracker.

                          Although I was actually about to ask how much it would take to make GLSL directly supported in state trackers instead of TGSI. But I guess that leads me to John's response (wrt. optimizing)...:
                          Originally posted by bridgman View Post
                          In general you are mostly optimizing with respect to memory accesses (memory bandwidth is always a challenge) more than shader hardware. Algorithms like IDCT and filtering tend to have to perform a lot of reads for every write (even more so than with normal textured rendering), and a significant part of optimizing is about reducing the number of reads or making sure the read pattern is cache-friendly.
                          I see. And so, GLSL doesn't really cut it because it abstracts away all the memory management right?
                          But I guess it's just a matter of learning TGSI like any other language. I've just only briefly touched on RISC assembly, and that seemed like so much effort for so little. It would probably help if we had some TGSI code already, generated from GLSL to start with though.

                          Originally posted by bridgman View Post
                          Sweet! Looks great! Second page says "GPGPU from real world applications - Decoding H.264 Video" so without knowing much else I'd say it's right on the money.
                          I will definitely be digging into that at some point! Would there happen to be a recorded talk/presentation over these slides somewhere?

                          Comment


                          • #28
                            Originally posted by runeks View Post
                            Although I was actually about to ask how much it would take to make GLSL directly supported in state trackers instead of TGSI. But I guess that leads me to John's response (wrt. optimizing)... <snip> And so, GLSL doesn't really cut it because it abstracts away all the memory management right?
                            Actually I was mostly responding to your question about the need to optimize.

                            GLSL could probably get you pretty close, if not give you the same performance (although I haven't done enough shader work to be sure). The real issue is that a Gallium3D state tracker uses Gallium3d calls and TGSI shaders by definition, so you probably want to end up with TGSI rather than copying a big heap of code from the OpenGL state tracker (aka Mesa) to convert the shaders from GLSL to TGSI every time you wanted to decode a video.

                            Originally posted by runeks View Post
                            But I guess it's just a matter of learning TGSI like any other language. I've just only briefly touched on RISC assembly, and that seemed like so much effort for so little. It would probably help if we had some TGSI code already, generated from GLSL to start with though.
                            I imagine there is a debug mechanism in Mesa to do that already, not sure though.

                            Originally posted by runeks View Post
                            Sweet! Looks great! Second page says "GPGPU from real world applications - Decoding H.264 Video" so without knowing much else I'd say it's right on the money. I will definitely be digging into that at some point! Would there happen to be a recorded talk/presentation over these slides somewhere?
                            There might be (or, more likely a newer talk), but I would have to be at work (with something faster than 24 Kb/s download) to find it before the technology becomes obsolete

                            Comment


                            • #29
                              Originally posted by bridgman View Post
                              Actually I was mostly responding to your question about the need to optimize.

                              GLSL could probably get you pretty close, if not give you the same performance (although I haven't done enough shader work to be sure). The real issue is that a Gallium3D state tracker uses Gallium3d calls and TGSI shaders by definition, so you probably want to end up with TGSI rather than copying a big heap of code from the OpenGL state tracker (aka Mesa) to convert the shaders from GLSL to TGSI every time you wanted to decode a video.
                              Yes, as things stand today it would have to be implemented in TGSI as you say.

                              But I guess my point is that if the TGSI code produced from the GLSL code, as you say, doesn't even need any optimizations to perform really well, then maybe the ability to hook in the GLSL compiler into other Gallium state trackers would ease the developing of future state trackers? Or is Gallium just designed in such a way that this cannot be done without creating a mess?
                              The GLSL compiler lives in the mesa state tracker right? So if we were to utilize this feature, the GLSL compiler, in another state tracker, we would be creating a new state tracker that is dependant on another state tracker (mesa). But I guess that's not something that is too distant of a concept to Linux; dependencies.
                              Of course, it would probably need to be some kind of Just-in-time compiler with code caching, in order to be effective. Which quickly makes it quite of a project in itself.

                              Originally posted by bridgman View Post
                              There might be (or, more likely a newer talk), but I would have to be at work (with something faster than 24 Kb/s download) to find it before the technology becomes obsolete
                              Hehe . If you do find the talk, or any other talk(s) regarding this please do post a link in this thread. The talks usually go into more detail, plus the questions often reflect my own questions on the topic.

                              Comment


                              • #30
                                The issue is that OpenGL is a Big Honkin' API and therefore needs a Big Honkin' State Tracker. Mesa is a lot bigger and more complex than the video decoder state tracker would be, and you would probably end up having a lot more code supporting GLSL than supporting video decode.

                                It's sort of like bringing your house into your car so you can make coffee while you drive -- OK in principle but not so good in practice

                                Comment

                                Working...
                                X