I understand that a lot of computations are done on the CPU, then the results are sent to the GPU. I was talking more about stuff like this though:
Watching a 1080p video normally, and watching a 1080p video with "hardware acceleration".
I assume the first means that all decoding and graphics processing is done on the CPU (assuming a non-OGL rendering method) while the second means using the GPU for both operations. If this is true, why wouldn't the GPU be used in the first place? Since it's obviously made for tasks such as these, vs the CPU which (for the most part) is not.
not every GPU can do it, and not everything related to video decoding is feasible to do on gpu. it greatly depends on the capacities of given chip. also, GPU's have very strict support for certain codecs, while cpu can handle anything decoder software can do.
Originally Posted by Daktyl198
there are dedicated SoC's in the market that can decode video on their own, mostly RealTek RTD chips, or whatever is in android devices and modern gaming consoles. those are very specific too, and making them handle any new codecs requires a firmware update or a total replacement of hardware. basically this solution is not very flexible, also certain codecs require lots of licensing (video patents).
Last edited by yoshi314; 03-27-2014 at 08:12 AM.
Yep, GPU hardware doesn't handle all video formats, and people preparing an OS can't assume that every piece of GPU hardware will have full driver support from day one either, so typically OSes ship with a full SW decode/render stack then drivers plug in acceleration options for the more commonly used formats.
I think what he means is: "Why are GPUs so specialized?" And as far as the CPU having to set things up, isn't that what the control units on the GPU are supposed to be doing?
I imagine a crapload of little CPUs when I think of a GPU and I understand that it's easier to make a single powerful processing core pretend it's multiple in terms of programming complexity. But I still can't figure out why we can't have a GPU do most things that the CPU can do. Rather, it does a few things a LOT better than the CPU.
What I mean: Isn't the tradeoff between upgrading the hardware via software worth the performance hits? [which I think could become on par or very close to with the hardware based decoders]
Last edited by profoundWHALE; 03-27-2014 at 08:51 PM.
There aren't a lot of specialized areas on a GPU these days -- texture filtering is the main one for regular graphics work, and dedicated video encode/decode processing is generally aimed at letting you perform specific operations without having to rely so much on the main GPU core *or* CPU cores, both of which draw more power. Most of a GPU these days is SIMD floating point processors, memory controllers or register files.
In general CPUs are optimized for single thread processing while GPUs are optimized for massively parallel processing (each element might be only 1/10th as fast as a typical CPU core but manages that with 1/200th the space and power) and stream processing (eg big delayed-write caches without logic to detect read-after-write hazards). CPUs devote a lot of logic to maintaining a simple, coherent programming model, while GPUs toss most of that out the window in exchange for much higher performance. Sports car vs. muscle car.
GPUs don't take over all the work normally done by a CPU (the inherently single-threaded part) because they would essentially have to become CPUs themselves in order to do that.
Last edited by bridgman; 03-27-2014 at 09:55 PM.
The reason why I was asking is because I heard something about a type of GPU that could be in phones if it weren't for them having about 35% more power draw. This particular type can get hardware decoding for new codecs or even new versions just with a firmware or software update of some kind. I'm really interested as to the possibility of having things like this in the desktop, or, do we just have the decoder/encoder target OpenCL?
sounds like a FPGA
Originally Posted by profoundWHALE
That's a tricky one. Most modern decoders are "slightly programmable" so new/different codecs can be handled *if* the "building block" functions in fixed function hardware blocks are either the same as supported codecs or are in line with the "let's guess about the future" options built into the hardware.
Each new generation of decode acceleration hardware tends to be "wider and slower", ie more processing elements but running at slower clocks for an overall reduction in power consumption at the expense of die area, so each new generation can afford a bit more flexibility without giving up too much in the way of power savings.
Thank you! I just spent the last hour or so reading up on them and found something interesting which should provide some clarity for people like me.
Originally Posted by gens
For the FFT benchmark:
8 core cpu: 32.63 ms (note, speed increased with cores up to 12 cores, after which saw 8 core speeds)
GPU: 8.13us exec time + 69ms retrieval time = 69 ms
FPGA: 2.59 ms
This fits in with what people were saying earlier about CPUs bottlenecking GPUs. If the transfer speed issue is fixed, we would see a drastic improvement for them, which explains AMD's HSA speed improvements in things like Libreoffice.
The downside with FPGA however is the part that involves writing a program specifically for it. Most uses for the FPGA are for enhancing programs and such, rather than doing it all themselves.
So basically, CPUs excel in transfer, and GPUs in execution.
Originally Posted by profoundWHALE
il' get to why but first to point out that that benchmark is not a real cpu benchmark
fact is FFT is all math
the time taken by a cpu to do something like FFT would scale linearly with the number of cores (HT doesn't really help anything in this case; to be honest i don't know when it does)
in the results that is not the case
for one there is concurrency problems as in the fact that the test cpu's are NUMA (and by the fact they are multi-core)
those are not that big problems if the code ran takes them into account (to the point of modifying the algorithm if necessary)
for another there is the compiler, that can't efficiently use the extensions on that cpu's (vectorize for AVX)
to get back to comparing
a cpu program timeline goes something like this
code > compiler > bytecode
bytecode > cpu MMU > cpu instruction decoding... thing > MMU to get the data into registers > cpu units that do the actual work (ALU or FPU) > back to memory thru MMU
a gpu program
code > compiler > bytecode
where the compiler spends almost all its time massively vectorizing the math code you sent it and taking care of concurrency issues that might arise
bytecode > a simpler instruction decoding (since there are less, more specialized instructions) > send orders to compute units > wait till its done
a compute unit does
gets instruction > does it
(gpus have per compute unit cache and some things idk, simpler then cpu but still overhead (for the sake of speed ofc))
you are the compiler (can use like verilog to help you) > fpga... wiring i guess
then you have to load the "code" to the fpga (once)
given your fpga has enough transistors to hold the data at once (computer memory is a bunch of flip-flops) you would then just
one instruction (process data)
but that is ideal, while in reality it would be something like
copy external (like from RAM) data to block of internal memory > that one instruction > copy results back
and as bridgman pointed out it, decoding video is not just "do one thing to this data"
so you would need another step and more instructions
back to cpu and the benchmark
cpu is a generic compute unit
people that design them have taken into account that math is used a lot in computers, and that there are these kinds of math formulae that can be done by doing calculations in parallel
so they made this extensions just for that (SSE, AVX, XOP, etc)
compilers are not that good yet in making these kinds of math run in parallel (vectorizing)
like lets say you have 3 sets of 4 numbers and you want to multiply the first set by the 2'nd and 3'rd then add them together
you make the code and the compiler spits out something scalar like
(load a number into xmm registers 0, 1 and 2)
movss xmm3, xmm0 (copy first float from xmm0 to xmm3 for later)
mulss xmm0, xmm1 (multiply first float in xmm0 with first float in xmm1)
mulss xmm3, xmm2 (multiply first float in xmm3 with first float in xmm2
addss xmm0, xmm3 (add them together)
(loop for the next number)
biggest problem here (except the instruction decoding overhead) is that xmm registers have to "settle" before branching (before looping to next iteration)
the cpu can not predict what comes after the branch so it can't "vectorize" the instructions (well it can, to a degree)
and that is something a modern cpu does, it predicts what comes after what is currently being done and sees if it can be done in parallel with what is current
a vectorized code would be like
(load 4 numbers into xmm registers 0, 1 and 2)
movaps xmm3, xmm0
mulps xmm0, xmm1
mulps xmm3, xmm2
addps xmm0, xmm3
and done, no more loops, all 4 results are in xmm0
(i got over 2x the speed in a test loop long ago, meaning maybe even 4x for ymm registers that can hold 8 floats)
a gpu is insanely parallel with math
in that it can do calculations like the one above, but like 500x at once
thing is, for the way it does things;
a gpu sucks at logic (logic as in branching code) so everything has to be parallelizable or it would be slooooooow
also note that a cpu runs at like 2-3 giga hertz
a gpu runs at like 1
and a fpga has a big problem running at high speeds 'cuz for every transistor you program in it there are like 2-3 to make it possible to program that 1
so fpga-s run at much much lower speeds
fpga-s have their use in really really complicated sciency stuff where algorithms change but have to run the same
in professional DPS where latency matters (so you cant have buffers; and no this has no use for user audio)
when designing cpu's where you can test it before printing 1000's of them (bdw cpu's are measured in processed instructions per cycle)
in industry when needing something done that cant be done on something like a PIC (or AVR, whatever company) (even thou those little things are getting more powerful)
i still think it would be cool to have a fpga in the computer, but a cpu would beat it 99% of cases
back to video decoding
read what bridgman wrote, it's not that simple
Last edited by gens; 04-01-2014 at 09:12 PM.