- bitstream decode
- reverse entropy
- inverse transform
- motion comp
- colour space conversion
... then you get something like :
- bitstream decode : not practical for shaders, inherently single-thread
- reverse entropy : not considered practical for shaders but not sure if anyone has really tried
- inverse transform : doable on shaders but not a great fit and probably not worth it
- motion comp : good fit for shaders
- deblocking : good fit for shaders
The good news is that the last two steps are usually the most computationally expensive as well, so accelerating those stages on GPU should make a big difference in CPU utilization.
If you look at page 5 of this (2005) paper you can see a rough breakdown of where the CPU cycles were going at the time.
I believe that paper lumped bitstream decode in with reverse entropy.
You generally want to pick a point in the pipe and accelerate everything after that, in order to avoid having to push data back and forth between CPU and GPU. Since all of the subsequent steps (scaling, colour space conversion, post-filtering, de-interlacing) are usually done on GPU anyways this all works nicely.