Page 2 of 4 FirstFirst 1234 LastLast
Results 11 to 20 of 31

Thread: Why are graphics as complicated as they are?

  1. #11
    Join Date
    Jul 2013
    Posts
    348

    Default

    I understand that a lot of computations are done on the CPU, then the results are sent to the GPU. I was talking more about stuff like this though:

    Watching a 1080p video normally, and watching a 1080p video with "hardware acceleration".
    I assume the first means that all decoding and graphics processing is done on the CPU (assuming a non-OGL rendering method) while the second means using the GPU for both operations. If this is true, why wouldn't the GPU be used in the first place? Since it's obviously made for tasks such as these, vs the CPU which (for the most part) is not.

  2. #12
    Join Date
    Sep 2006
    Location
    PL
    Posts
    909

    Default

    Quote Originally Posted by Daktyl198 View Post
    I understand that a lot of computations are done on the CPU, then the results are sent to the GPU. I was talking more about stuff like this though:

    Watching a 1080p video normally, and watching a 1080p video with "hardware acceleration".
    I assume the first means that all decoding and graphics processing is done on the CPU (assuming a non-OGL rendering method) while the second means using the GPU for both operations. If this is true, why wouldn't the GPU be used in the first place? Since it's obviously made for tasks such as these, vs the CPU which (for the most part) is not.
    not every GPU can do it, and not everything related to video decoding is feasible to do on gpu. it greatly depends on the capacities of given chip. also, GPU's have very strict support for certain codecs, while cpu can handle anything decoder software can do.

    there are dedicated SoC's in the market that can decode video on their own, mostly RealTek RTD chips, or whatever is in android devices and modern gaming consoles. those are very specific too, and making them handle any new codecs requires a firmware update or a total replacement of hardware. basically this solution is not very flexible, also certain codecs require lots of licensing (video patents).
    Last edited by yoshi314; 03-27-2014 at 07:12 AM.

  3. #13
    Join Date
    Oct 2007
    Location
    Toronto-ish
    Posts
    7,386

    Default

    Yep, GPU hardware doesn't handle all video formats, and people preparing an OS can't assume that every piece of GPU hardware will have full driver support from day one either, so typically OSes ship with a full SW decode/render stack then drivers plug in acceleration options for the more commonly used formats.

  4. #14
    Join Date
    Oct 2013
    Location
    Canada
    Posts
    319

    Default

    I think what he means is: "Why are GPUs so specialized?" And as far as the CPU having to set things up, isn't that what the control units on the GPU are supposed to be doing?
    I imagine a crapload of little CPUs when I think of a GPU and I understand that it's easier to make a single powerful processing core pretend it's multiple in terms of programming complexity. But I still can't figure out why we can't have a GPU do most things that the CPU can do. Rather, it does a few things a LOT better than the CPU.

    What I mean: Isn't the tradeoff between upgrading the hardware via software worth the performance hits? [which I think could become on par or very close to with the hardware based decoders]
    Last edited by profoundWHALE; 03-27-2014 at 07:51 PM.

  5. #15
    Join Date
    Oct 2007
    Location
    Toronto-ish
    Posts
    7,386

    Default

    There aren't a lot of specialized areas on a GPU these days -- texture filtering is the main one for regular graphics work, and dedicated video encode/decode processing is generally aimed at letting you perform specific operations without having to rely so much on the main GPU core *or* CPU cores, both of which draw more power. Most of a GPU these days is SIMD floating point processors, memory controllers or register files.

    In general CPUs are optimized for single thread processing while GPUs are optimized for massively parallel processing (each element might be only 1/10th as fast as a typical CPU core but manages that with 1/200th the space and power) and stream processing (eg big delayed-write caches without logic to detect read-after-write hazards). CPUs devote a lot of logic to maintaining a simple, coherent programming model, while GPUs toss most of that out the window in exchange for much higher performance. Sports car vs. muscle car.

    GPUs don't take over all the work normally done by a CPU (the inherently single-threaded part) because they would essentially have to become CPUs themselves in order to do that.
    Last edited by bridgman; 03-27-2014 at 08:55 PM.

  6. #16
    Join Date
    Oct 2013
    Location
    Canada
    Posts
    319

    Default

    The reason why I was asking is because I heard something about a type of GPU that could be in phones if it weren't for them having about 35% more power draw. This particular type can get hardware decoding for new codecs or even new versions just with a firmware or software update of some kind. I'm really interested as to the possibility of having things like this in the desktop, or, do we just have the decoder/encoder target OpenCL?

  7. #17
    Join Date
    May 2012
    Posts
    431

    Default

    Quote Originally Posted by profoundWHALE View Post
    The reason why I was asking is because I heard something about a type of GPU that could be in phones if it weren't for them having about 35% more power draw. This particular type can get hardware decoding for new codecs or even new versions just with a firmware or software update of some kind.
    sounds like a FPGA

  8. #18
    Join Date
    Oct 2007
    Location
    Toronto-ish
    Posts
    7,386

    Default

    That's a tricky one. Most modern decoders are "slightly programmable" so new/different codecs can be handled *if* the "building block" functions in fixed function hardware blocks are either the same as supported codecs or are in line with the "let's guess about the future" options built into the hardware.

    Each new generation of decode acceleration hardware tends to be "wider and slower", ie more processing elements but running at slower clocks for an overall reduction in power consumption at the expense of die area, so each new generation can afford a bit more flexibility without giving up too much in the way of power savings.

  9. #19
    Join Date
    Oct 2013
    Location
    Canada
    Posts
    319

    Default

    Quote Originally Posted by gens View Post
    sounds like a FPGA
    Thank you! I just spent the last hour or so reading up on them and found something interesting which should provide some clarity for people like me.
    http://www.wpi.edu/Pubs/E-project/Av...king_Final.pdf

    For the FFT benchmark:
    8 core cpu: 32.63 ms (note, speed increased with cores up to 12 cores, after which saw 8 core speeds)
    GPU: 8.13us exec time + 69ms retrieval time = 69 ms
    FPGA: 2.59 ms

    This fits in with what people were saying earlier about CPUs bottlenecking GPUs. If the transfer speed issue is fixed, we would see a drastic improvement for them, which explains AMD's HSA speed improvements in things like Libreoffice.

    The downside with FPGA however is the part that involves writing a program specifically for it. Most uses for the FPGA are for enhancing programs and such, rather than doing it all themselves.

    So basically, CPUs excel in transfer, and GPUs in execution.

  10. #20
    Join Date
    May 2012
    Posts
    431

    Default

    Quote Originally Posted by profoundWHALE View Post
    Thank you! I just spent the last hour or so reading up on them and found something interesting which should provide some clarity for people like me.
    http://www.wpi.edu/Pubs/E-project/Av...king_Final.pdf

    For the FFT benchmark:
    8 core cpu: 32.63 ms (note, speed increased with cores up to 12 cores, after which saw 8 core speeds)
    GPU: 8.13us exec time + 69ms retrieval time = 69 ms
    FPGA: 2.59 ms

    This fits in with what people were saying earlier about CPUs bottlenecking GPUs. If the transfer speed issue is fixed, we would see a drastic improvement for them, which explains AMD's HSA speed improvements in things like Libreoffice.

    The downside with FPGA however is the part that involves writing a program specifically for it. Most uses for the FPGA are for enhancing programs and such, rather than doing it all themselves.

    So basically, CPUs excel in transfer, and GPUs in execution.
    well..
    not really

    il' get to why but first to point out that that benchmark is not a real cpu benchmark
    fact is FFT is all math

    the time taken by a cpu to do something like FFT would scale linearly with the number of cores (HT doesn't really help anything in this case; to be honest i don't know when it does)
    in the results that is not the case
    for one there is concurrency problems as in the fact that the test cpu's are NUMA (and by the fact they are multi-core)
    those are not that big problems if the code ran takes them into account (to the point of modifying the algorithm if necessary)
    for another there is the compiler, that can't efficiently use the extensions on that cpu's (vectorize for AVX)


    to get back to comparing

    a cpu program timeline goes something like this

    code > compiler > bytecode
    bytecode > cpu MMU > cpu instruction decoding... thing > MMU to get the data into registers > cpu units that do the actual work (ALU or FPU) > back to memory thru MMU


    a gpu program

    code > compiler > bytecode
    where the compiler spends almost all its time massively vectorizing the math code you sent it and taking care of concurrency issues that might arise

    on gpu
    bytecode > a simpler instruction decoding (since there are less, more specialized instructions) > send orders to compute units > wait till its done
    a compute unit does
    gets instruction > does it
    (gpus have per compute unit cache and some things idk, simpler then cpu but still overhead (for the sake of speed ofc))


    a fpga

    you are the compiler (can use like verilog to help you) > fpga... wiring i guess
    then you have to load the "code" to the fpga (once)
    given your fpga has enough transistors to hold the data at once (computer memory is a bunch of flip-flops) you would then just
    one instruction (process data)

    but that is ideal, while in reality it would be something like
    copy external (like from RAM) data to block of internal memory > that one instruction > copy results back

    and as bridgman pointed out it, decoding video is not just "do one thing to this data"
    so you would need another step and more instructions



    back to cpu and the benchmark
    cpu is a generic compute unit
    people that design them have taken into account that math is used a lot in computers, and that there are these kinds of math formulae that can be done by doing calculations in parallel
    so they made this extensions just for that (SSE, AVX, XOP, etc)
    compilers are not that good yet in making these kinds of math run in parallel (vectorizing)

    like lets say you have 3 sets of 4 numbers and you want to multiply the first set by the 2'nd and 3'rd then add them together
    you make the code and the compiler spits out something scalar like

    (load a number into xmm registers 0, 1 and 2)
    movss xmm3, xmm0 (copy first float from xmm0 to xmm3 for later)
    mulss xmm0, xmm1 (multiply first float in xmm0 with first float in xmm1)
    mulss xmm3, xmm2 (multiply first float in xmm3 with first float in xmm2
    addss xmm0, xmm3 (add them together)
    (loop for the next number)

    biggest problem here (except the instruction decoding overhead) is that xmm registers have to "settle" before branching (before looping to next iteration)
    the cpu can not predict what comes after the branch so it can't "vectorize" the instructions (well it can, to a degree)
    and that is something a modern cpu does, it predicts what comes after what is currently being done and sees if it can be done in parallel with what is current

    a vectorized code would be like
    (load 4 numbers into xmm registers 0, 1 and 2)
    movaps xmm3, xmm0
    mulps xmm0, xmm1
    mulps xmm3, xmm2
    addps xmm0, xmm3
    and done, no more loops, all 4 results are in xmm0
    (i got over 2x the speed in a test loop long ago, meaning maybe even 4x for ymm registers that can hold 8 floats)


    a gpu is insanely parallel with math
    in that it can do calculations like the one above, but like 500x at once
    thing is, for the way it does things;
    a gpu sucks at logic (logic as in branching code) so everything has to be parallelizable or it would be slooooooow

    also note that a cpu runs at like 2-3 giga hertz
    a gpu runs at like 1
    and a fpga has a big problem running at high speeds 'cuz for every transistor you program in it there are like 2-3 to make it possible to program that 1
    so fpga-s run at much much lower speeds

    fpga-s have their use in really really complicated sciency stuff where algorithms change but have to run the same
    in professional DPS where latency matters (so you cant have buffers; and no this has no use for user audio)
    when designing cpu's where you can test it before printing 1000's of them (bdw cpu's are measured in processed instructions per cycle)
    in industry when needing something done that cant be done on something like a PIC (or AVR, whatever company) (even thou those little things are getting more powerful)

    i still think it would be cool to have a fpga in the computer, but a cpu would beat it 99% of cases


    back to video decoding
    read what bridgman wrote, it's not that simple
    Last edited by gens; 04-01-2014 at 08:12 PM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •