Announcement

Collapse
No announcement yet.

Why are graphics as complicated as they are?

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    The reason why I was asking is because I heard something about a type of GPU that could be in phones if it weren't for them having about 35% more power draw. This particular type can get hardware decoding for new codecs or even new versions just with a firmware or software update of some kind. I'm really interested as to the possibility of having things like this in the desktop, or, do we just have the decoder/encoder target OpenCL?

    Comment


    • #17
      Originally posted by profoundWHALE View Post
      The reason why I was asking is because I heard something about a type of GPU that could be in phones if it weren't for them having about 35% more power draw. This particular type can get hardware decoding for new codecs or even new versions just with a firmware or software update of some kind.
      sounds like a FPGA

      Comment


      • #18
        That's a tricky one. Most modern decoders are "slightly programmable" so new/different codecs can be handled *if* the "building block" functions in fixed function hardware blocks are either the same as supported codecs or are in line with the "let's guess about the future" options built into the hardware.

        Each new generation of decode acceleration hardware tends to be "wider and slower", ie more processing elements but running at slower clocks for an overall reduction in power consumption at the expense of die area, so each new generation can afford a bit more flexibility without giving up too much in the way of power savings.

        Comment


        • #19
          Originally posted by gens View Post
          sounds like a FPGA
          Thank you! I just spent the last hour or so reading up on them and found something interesting which should provide some clarity for people like me.
          http://www.wpi.edu/Pubs/E-project/Av...king_Final.pdf

          For the FFT benchmark:
          8 core cpu: 32.63 ms (note, speed increased with cores up to 12 cores, after which saw 8 core speeds)
          GPU: 8.13us exec time + 69ms retrieval time = 69 ms
          FPGA: 2.59 ms

          This fits in with what people were saying earlier about CPUs bottlenecking GPUs. If the transfer speed issue is fixed, we would see a drastic improvement for them, which explains AMD's HSA speed improvements in things like Libreoffice.

          The downside with FPGA however is the part that involves writing a program specifically for it. Most uses for the FPGA are for enhancing programs and such, rather than doing it all themselves.

          So basically, CPUs excel in transfer, and GPUs in execution.

          Comment


          • #20
            Originally posted by profoundWHALE View Post
            Thank you! I just spent the last hour or so reading up on them and found something interesting which should provide some clarity for people like me.
            http://www.wpi.edu/Pubs/E-project/Av...king_Final.pdf

            For the FFT benchmark:
            8 core cpu: 32.63 ms (note, speed increased with cores up to 12 cores, after which saw 8 core speeds)
            GPU: 8.13us exec time + 69ms retrieval time = 69 ms
            FPGA: 2.59 ms

            This fits in with what people were saying earlier about CPUs bottlenecking GPUs. If the transfer speed issue is fixed, we would see a drastic improvement for them, which explains AMD's HSA speed improvements in things like Libreoffice.

            The downside with FPGA however is the part that involves writing a program specifically for it. Most uses for the FPGA are for enhancing programs and such, rather than doing it all themselves.

            So basically, CPUs excel in transfer, and GPUs in execution.
            well..
            not really

            il' get to why but first to point out that that benchmark is not a real cpu benchmark
            fact is FFT is all math

            the time taken by a cpu to do something like FFT would scale linearly with the number of cores (HT doesn't really help anything in this case; to be honest i don't know when it does)
            in the results that is not the case
            for one there is concurrency problems as in the fact that the test cpu's are NUMA (and by the fact they are multi-core)
            those are not that big problems if the code ran takes them into account (to the point of modifying the algorithm if necessary)
            for another there is the compiler, that can't efficiently use the extensions on that cpu's (vectorize for AVX)


            to get back to comparing

            a cpu program timeline goes something like this

            code > compiler > bytecode
            bytecode > cpu MMU > cpu instruction decoding... thing > MMU to get the data into registers > cpu units that do the actual work (ALU or FPU) > back to memory thru MMU


            a gpu program

            code > compiler > bytecode
            where the compiler spends almost all its time massively vectorizing the math code you sent it and taking care of concurrency issues that might arise

            on gpu
            bytecode > a simpler instruction decoding (since there are less, more specialized instructions) > send orders to compute units > wait till its done
            a compute unit does
            gets instruction > does it
            (gpus have per compute unit cache and some things idk, simpler then cpu but still overhead (for the sake of speed ofc))


            a fpga

            you are the compiler (can use like verilog to help you) > fpga... wiring i guess
            then you have to load the "code" to the fpga (once)
            given your fpga has enough transistors to hold the data at once (computer memory is a bunch of flip-flops) you would then just
            one instruction (process data)

            but that is ideal, while in reality it would be something like
            copy external (like from RAM) data to block of internal memory > that one instruction > copy results back

            and as bridgman pointed out it, decoding video is not just "do one thing to this data"
            so you would need another step and more instructions



            back to cpu and the benchmark
            cpu is a generic compute unit
            people that design them have taken into account that math is used a lot in computers, and that there are these kinds of math formulae that can be done by doing calculations in parallel
            so they made this extensions just for that (SSE, AVX, XOP, etc)
            compilers are not that good yet in making these kinds of math run in parallel (vectorizing)

            like lets say you have 3 sets of 4 numbers and you want to multiply the first set by the 2'nd and 3'rd then add them together
            you make the code and the compiler spits out something scalar like

            (load a number into xmm registers 0, 1 and 2)
            movss xmm3, xmm0 (copy first float from xmm0 to xmm3 for later)
            mulss xmm0, xmm1 (multiply first float in xmm0 with first float in xmm1)
            mulss xmm3, xmm2 (multiply first float in xmm3 with first float in xmm2
            addss xmm0, xmm3 (add them together)
            (loop for the next number)

            biggest problem here (except the instruction decoding overhead) is that xmm registers have to "settle" before branching (before looping to next iteration)
            the cpu can not predict what comes after the branch so it can't "vectorize" the instructions (well it can, to a degree)
            and that is something a modern cpu does, it predicts what comes after what is currently being done and sees if it can be done in parallel with what is current

            a vectorized code would be like
            (load 4 numbers into xmm registers 0, 1 and 2)
            movaps xmm3, xmm0
            mulps xmm0, xmm1
            mulps xmm3, xmm2
            addps xmm0, xmm3
            and done, no more loops, all 4 results are in xmm0
            (i got over 2x the speed in a test loop long ago, meaning maybe even 4x for ymm registers that can hold 8 floats)


            a gpu is insanely parallel with math
            in that it can do calculations like the one above, but like 500x at once
            thing is, for the way it does things;
            a gpu sucks at logic (logic as in branching code) so everything has to be parallelizable or it would be slooooooow

            also note that a cpu runs at like 2-3 giga hertz
            a gpu runs at like 1
            and a fpga has a big problem running at high speeds 'cuz for every transistor you program in it there are like 2-3 to make it possible to program that 1
            so fpga-s run at much much lower speeds

            fpga-s have their use in really really complicated sciency stuff where algorithms change but have to run the same
            in professional DPS where latency matters (so you cant have buffers; and no this has no use for user audio)
            when designing cpu's where you can test it before printing 1000's of them (bdw cpu's are measured in processed instructions per cycle)
            in industry when needing something done that cant be done on something like a PIC (or AVR, whatever company) (even thou those little things are getting more powerful)

            i still think it would be cool to have a fpga in the computer, but a cpu would beat it 99% of cases


            back to video decoding
            read what bridgman wrote, it's not that simple
            Last edited by gens; 04-01-2014, 08:12 PM.

            Comment


            • #21
              Originally posted by gens View Post
              (disclaimer: i'm not an expert)
              i'm guessing here
              you are thinking about absolute efficiency

              thing is gpu's are nowadays just a bunch of compute units orchestrated by a control unit (theres more ofc)
              compute units are simple things
              with that kind of design gpu's are not limited to doing disk cleanup tools just one specific kind of "rendering"
              (rendering 3D is just a bunch of mathematical transforms with some logic in the mix)

              so a gpu driver is basically a state machine that says to the hardware (firmware in this case) what should be done


              also about the cpu part in it
              even in a case of something simple like a desktop or a window with some buttons or something you still need the logic behind it

              like when you move a window
              you have to calculate where it is moved
              check, based on rules, things like if you move it to an edge do you flip to the next virtual desktop (etc)

              but that is simple
              in complex graphics for example you don't want the gpu to draw the whole huge world
              so you cull everything not seen
              you do it on the cpu because you have to know in advance what you will be rendering (to not send textures when not needed, vertices, etc.)
              (this is also required for a desktop if you want lower gpu memory usage)


              still i like the idea of directly controlling the gpu
              i read something that in the future (or maybe even now in new opengl) you will be able to get a pointer in gpu memory

              also i think gpus are going in the direction of having a dedicated cpu (like ARM or something) on them that would control it
              imagine you could write full fledged programs to run on a massively parallel gpu (like semi touring-complete shaders)

              in my eyes the future looks bright in the gpu department

              edit: in short; they are complicated so they don't become even more complicated, but they are simplifying slowly (in general design)
              my cheap laptop have Intel HD Graphics 2500. good enough for playing many FPS and RTS titles released before 2008 at medium details.

              Comment


              • #22
                Take a look at the slides from AMD's true audio, which bypasses the gpu/cpu and instead uses a specific hardware that's built in to raise overall performance. Mantle let's you bypass the cpu for setting things up. Anyways those slides will show you how much each things costs to process.

                Comment


                • #23
                  Originally posted by profoundWHALE View Post
                  ...which bypasses the gpu/cpu and instead uses a specific hardware that's built in to raise overall performance.
                  please don't, just don't talk about audio processing
                  it's an area with many many people who don't know about it (and some are "smart" non the less, that's why i don't call myself an audiophile)
                  here, a low pass FIR filter
                  i think it's 32 tap, cant remember

                  Code:
                  format ELF64 executable
                  entry start
                  
                  align 16
                  segment readable writeable
                  
                  samples equ 1024
                  
                  align 16
                  buff_in: rw samples*2
                  align 16
                  buff_out: rw samples*2
                  
                  ; 2^16 = 1.0
                  align 16
                  coefficients:
                  
                  dw 5
                  dw 14
                  dw 35
                  dw 57
                  dw 39
                  dw -76
                  dw -328
                  dw -677
                  dw -945
                  dw -828
                  dw 16
                  dw 1799
                  dw 4428
                  dw 7441
                  dw 10109
                  dw 11680
                  dw 11680
                  dw 10109
                  dw 7441
                  dw 4428
                  dw 1799
                  dw 16
                  dw -828
                  dw -945
                  dw -677
                  dw -328
                  dw -76
                  dw 39
                  dw 57
                  dw 35
                  dw 14
                  dw 5
                  
                  dw 705
                  dw 1240
                  dw 1434
                  dw 1169
                  dw 456
                  dw -550
                  dw -1576
                  dw -2287
                  dw -2367
                  dw -1609
                  dw 25
                  dw 2377
                  dw 5106
                  dw 7760
                  dw 9862
                  dw 11024
                  dw 11024
                  dw 9862
                  dw 7760
                  dw 5106
                  dw 2377
                  dw 25
                  dw -1609
                  dw -2367
                  dw -2287
                  dw -1576
                  dw -550
                  dw 456
                  dw 1169
                  dw 1434
                  dw 1240
                  dw 705
                  
                  rq 10 ;just in case
                  tmp rq 100
                  
                  sys_read equ 0
                  sys_write equ 1
                  sys_exit equ 60
                  
                  
                  align 16
                  segment readable executable
                  
                  start:
                  
                  xorps xmm7, xmm7
                  xorps xmm6, xmm6
                  xorps xmm5, xmm5
                  xorps xmm4, xmm4
                  xorps xmm3, xmm3
                  xorps xmm2, xmm2
                  xorps xmm1, xmm1
                  xorps xmm0, xmm0
                  
                  mov [tmp], rdx
                  mov [tmp+8], rdx
                  
                  filter:
                  
                  mov rdx, samples*2*2
                  mov rsi, buff_in
                  mov rdi, 0
                  mov rax, sys_read
                  syscall
                  cmp rax, 0
                  jz end_fir
                  
                  xor rax, rax
                  
                  mov rcx, samples
                  loopy:
                  movaps xmm12, [coefficients]
                  movaps xmm13, [coefficients+16*1]
                  movaps xmm14, [coefficients+16*2]
                  movaps xmm15, [coefficients+16*3]
                  mov edx, [rax+buff_in] ;dx = left channel sample
                  mov esi, edx
                  shr esi, 16 ;si = right channel sample
                  
                  ; left channel delay line
                  pslldq xmm3, 2
                  movaps xmm10, xmm2
                  psrldq xmm10, 2*7
                  orpd xmm3, xmm10
                  
                  pslldq xmm2, 2
                  movaps xmm10, xmm1
                  psrldq xmm10, 2*7
                  orpd xmm2, xmm10
                  
                  pslldq xmm1, 2
                  movaps xmm10, xmm0
                  psrldq xmm10, 2*7
                  orpd xmm1, xmm10
                  
                  pslldq xmm0, 2 ;bytes
                  movzx rdx, dx
                  movq xmm10, rdx
                  orpd xmm0, xmm10
                  
                  ; right channel delay line
                  pslldq xmm7, 2
                  movaps xmm10, xmm6
                  psrldq xmm10, 2*7
                  orpd xmm7, xmm10
                  
                  pslldq xmm6, 2
                  movaps xmm10, xmm5
                  psrldq xmm10, 2*7
                  orpd xmm6, xmm10
                  
                  pslldq xmm5, 2
                  movaps xmm10, xmm4
                  psrldq xmm10, 2*7
                  orpd xmm5, xmm10
                  
                  pslldq xmm4, 2
                  movq xmm10, rsi
                  orpd xmm4, xmm10
                  
                  
                  
                  
                  movaps xmm8, xmm12
                  movaps xmm9, xmm13
                  movaps xmm10, xmm14
                  movaps xmm11, xmm15
                  
                  
                  pmaddwd xmm12, xmm0
                  pmaddwd xmm13, xmm1
                  pmaddwd xmm14, xmm2
                  pmaddwd xmm15, xmm3
                  
                  paddd xmm12, xmm14
                  paddd xmm13, xmm15
                  
                  paddd xmm12, xmm13
                  
                  movhlps xmm13, xmm12
                  paddd xmm12, xmm13
                  
                  movss xmm13, xmm12
                  psrldq xmm12, 32
                  paddd xmm12, xmm13
                  
                  
                  
                  pmaddwd xmm8, xmm4
                  pmaddwd xmm9, xmm5
                  pmaddwd xmm10, xmm6
                  pmaddwd xmm11, xmm7
                  
                  paddd xmm8, xmm10
                  paddd xmm9, xmm11
                  
                  paddd xmm8, xmm9
                  
                  movhlps xmm9, xmm8
                  paddd xmm8, xmm9
                  
                  movss xmm9, xmm8
                  psrldq xmm8, 32
                  paddd xmm8, xmm9
                  
                  
                  
                  movd r10d, xmm12
                  ;sal r10d, 3 ;overflow possibility unless compensated
                  shr r10d, 16
                  mov [rax+buff_out], r10w
                  
                  movq r10, xmm8
                  ;sal r10d, 2
                  shr r10d, 16
                  mov [rax+buff_out+2], r10w
                  
                  add rax, 4
                  dec rcx
                  jnz loopy
                  
                  mov rdx, samples*2*2
                  mov rsi, buff_out
                  mov rdi, 1
                  mov rax, sys_write
                  syscall
                  
                  jmp filter
                  
                  end_fir:
                  mov rax, sys_exit
                  syscall
                  bash-4.2# time ./example < hell.raw >hell2.raw

                  real 0m2.206s
                  user 0m0.129s
                  sys 0m0.251s

                  where hell.raw is a 64MB, ~6.25 minutes long, 44100Hz, 16bit stereo
                  and most of that time is spent in read() and write()
                  Last edited by gens; 04-28-2014, 07:59 AM.

                  Comment


                  • #24
                    i think i c/p a testing version
                    here is that i know is correct
                    (phoronix... rly took me 30sec to realize)

                    Code:
                    format ELF64 executable
                    entry start
                    
                    align 16
                    segment readable writeable
                    
                    samples equ 1024
                    
                    align 16
                    buff_in: rw samples*2
                    align 16
                    buff_out: rw samples*2
                    
                    ; 2^16 = 1.0
                    align 16
                    coefficients:
                    
                    dw 5
                    dw 14
                    dw 35
                    dw 57
                    dw 39
                    dw -76
                    dw -328
                    dw -677
                    dw -945
                    dw -828
                    dw 16
                    dw 1799
                    dw 4428
                    dw 7441
                    dw 10109
                    dw 11680
                    dw 11680
                    dw 10109
                    dw 7441
                    dw 4428
                    dw 1799
                    dw 16
                    dw -828
                    dw -945
                    dw -677
                    dw -328
                    dw -76
                    dw 39
                    dw 57
                    dw 35
                    dw 14
                    dw 5
                    
                    rq 10 ;just in case
                    tmp rq 100
                    
                    sys_read equ 0
                    sys_write equ 1
                    sys_exit equ 60
                    
                    
                    align 16
                    segment readable executable
                    
                    start:
                    
                    xorps xmm7, xmm7
                    xorps xmm6, xmm6
                    xorps xmm5, xmm5
                    xorps xmm4, xmm4
                    xorps xmm3, xmm3
                    xorps xmm2, xmm2
                    xorps xmm1, xmm1
                    xorps xmm0, xmm0
                    
                    mov [tmp], rdx
                    mov [tmp+8], rdx
                    
                    filter:
                    
                    mov rdx, samples*2*2
                    mov rsi, buff_in
                    mov rdi, 0
                    mov rax, sys_read
                    syscall
                    cmp rax, 0
                    jz end_fir
                    
                    xor rax, rax
                    
                    mov rcx, samples
                    loopy:
                    movaps xmm12, [coefficients]
                    movaps xmm13, [coefficients+16*1]
                    movaps xmm14, [coefficients+16*2]
                    movaps xmm15, [coefficients+16*3]
                    mov edx, [rax+buff_in] ;dx = left channel sample
                    mov esi, edx
                    shr esi, 16 ;si = right channel sample
                    
                    ; left channel delay line
                    pslldq xmm3, 2
                    movaps xmm10, xmm2
                    psrldq xmm10, 2*7
                    orpd xmm3, xmm10
                    
                    pslldq xmm2, 2
                    movaps xmm10, xmm1
                    psrldq xmm10, 2*7
                    orpd xmm2, xmm10
                    
                    pslldq xmm1, 2
                    movaps xmm10, xmm0
                    psrldq xmm10, 2*7
                    orpd xmm1, xmm10
                    
                    pslldq xmm0, 2 ;bytes
                    movzx rdx, dx
                    movq xmm10, rdx
                    orpd xmm0, xmm10
                    
                    ; right channel delay line
                    pslldq xmm7, 2
                    movaps xmm10, xmm6
                    psrldq xmm10, 2*7
                    orpd xmm7, xmm10
                    
                    pslldq xmm6, 2
                    movaps xmm10, xmm5
                    psrldq xmm10, 2*7
                    orpd xmm6, xmm10
                    
                    pslldq xmm5, 2
                    movaps xmm10, xmm4
                    psrldq xmm10, 2*7
                    orpd xmm5, xmm10
                    
                    pslldq xmm4, 2
                    movq xmm10, rsi
                    orpd xmm4, xmm10
                    
                    
                    
                    
                    movaps xmm8, xmm12
                    movaps xmm9, xmm13
                    movaps xmm10, xmm14
                    movaps xmm11, xmm15
                    
                    
                    pmaddwd xmm12, xmm0
                    pmaddwd xmm13, xmm1
                    pmaddwd xmm14, xmm2
                    pmaddwd xmm15, xmm3
                    
                    paddd xmm12, xmm14
                    paddd xmm13, xmm15
                    
                    paddd xmm12, xmm13
                    
                    movhlps xmm13, xmm12
                    paddd xmm12, xmm13
                    
                    movss xmm13, xmm12
                    psrldq xmm12, 32
                    paddd xmm12, xmm13
                    
                    
                    
                    pmaddwd xmm8, xmm4
                    pmaddwd xmm9, xmm5
                    pmaddwd xmm10, xmm6
                    pmaddwd xmm11, xmm7
                    
                    paddd xmm8, xmm10
                    paddd xmm9, xmm11
                    
                    paddd xmm8, xmm9
                    
                    movhlps xmm9, xmm8
                    paddd xmm8, xmm9
                    
                    movss xmm9, xmm8
                    psrldq xmm8, 32
                    paddd xmm8, xmm9
                    
                    
                    
                    movd r10d, xmm12
                    ;sal r10d, 3 ;overflow possibility unless compensated
                    shr r10d, 16
                    mov [rax+buff_out], r10w
                    
                    movq r10, xmm8
                    ;sal r10d, 2
                    shr r10d, 16
                    mov [rax+buff_out+2], r10w
                    
                    add rax, 4
                    dec rcx
                    jnz loopy
                    
                    mov rdx, samples*2*2
                    mov rsi, buff_out
                    mov rdi, 1
                    mov rax, sys_write
                    syscall
                    
                    jmp filter
                    
                    end_fir:
                    mov rax, sys_exit
                    syscall

                    Comment


                    • #25
                      Originally posted by Daktyl198 View Post
                      I don't get this either... why can't we map the API calls to the specific hardware at boot or through a function (for hot-swapping/other GPU changes) and store the result, that way we don't spend precious CPU cycles translating all the time?
                      Because GPU design changes over time, and if you hardcoded everything, the API would break every single time the HW it ran on changed. You also ignore the quite possible case of multiple-GPU's, or even non-GPU cards that can run the OGL API (ASIC's and the like).

                      You also have the somewhat substantial driver layer to manage the HW resources, which also saps performance and is probably THE performance killer on weaker CPU's. Thats what Mantle/DX12 is attempting to solve going forward.

                      The final problem, at least for OGL, is the API is aged, doesn't reflect how the HW actually works anymore, and simply doesn't play well with newer programming methodologies (Object Oriented, OGL is not).

                      Comment


                      • #26
                        Originally posted by Daktyl198 View Post
                        I understand that a lot of computations are done on the CPU, then the results are sent to the GPU. I was talking more about stuff like this though:

                        Watching a 1080p video normally, and watching a 1080p video with "hardware acceleration".
                        I assume the first means that all decoding and graphics processing is done on the CPU (assuming a non-OGL rendering method) while the second means using the GPU for both operations. If this is true, why wouldn't the GPU be used in the first place? Since it's obviously made for tasks such as these, vs the CPU which (for the most part) is not.
                        Not every GPU can do this. What about new encoding formats on old GPUs? What if the user is running some other GPU heavy task at the same time; would it be better to let the CPU handle the video decoding instead?

                        On a general purpose PC, you can NOT assume you have access to all computer resources. You can NOT assume full HW support.

                        Take a look at the slides from AMD's true audio, which bypasses the gpu/cpu and instead uses a specific hardware that's built in to raise overall performance. Mantle let's you bypass the cpu for setting things up. Anyways those slides will show you how much each things costs to process.
                        But you tie yourself to a SPECIFIC HW specification. Take Mantle; its designed for only GCN GPU's. What happens when GCN reaches EOL and AMD has to replace it? Woops, every game that used Mantle now needs to patch itself, because the assumptions made about the HW design no longer hold true.

                        Hence the advantage and disadvantage of higher level API's: You can abstract everything, and keep support, in theory, forever. The downside is, this easily doubles the time to complete any specific task, due to said abstraction. If you want speed, you go to lower level API's. If you want support, you go to higher level API's.

                        Hence why we have consoles.
                        Last edited by gamerk2; 04-28-2014, 10:58 AM.

                        Comment


                        • #27
                          Originally posted by gamerk2 View Post
                          Take Mantle; its designed for only GCN GPU's. What happens when GCN reaches EOL and AMD has to replace it? Woops, every game that used Mantle now needs to patch itself, because the assumptions made about the HW design no longer hold true.
                          It's not designed for *only* GCN GPUs -- it's more correct to say that the HW capabilities of GCN GPUs represent the minimum requirement.

                          Comment


                          • #28
                            Originally posted by bridgman View Post
                            It's not designed for *only* GCN GPUs -- it's more correct to say that the HW capabilities of GCN GPUs represent the minimum requirement.
                            Nope. When you make that low a level API, you *are required* to expose the lower level details of how the GPU operates in order to create a working driver. Every time the HW capabilities change, you are going to have to tear apart the API to support the newer architecture.

                            I still remember the days when games shipped with four different graphics drivers and two or three audio drivers. I do NOT want to go back to those days.

                            Comment


                            • #29
                              Originally posted by gamerk2 View Post
                              Nope. When you make that low a level API, you *are required* to expose the lower level details of how the GPU operates in order to create a working driver. Every time the HW capabilities change, you are going to have to tear apart the API to support the newer architecture.

                              I still remember the days when games shipped with four different graphics drivers and two or three audio drivers. I do NOT want to go back to those days.
                              Agreed, but Mantle is not that low level.

                              Comment


                              • #30
                                Originally posted by gamerk2 View Post
                                Nope. When you make that low a level API, you *are required* to expose the lower level details of how the GPU operates in order to create a working driver. Every time the HW capabilities change, you are going to have to tear apart the API to support the newer architecture.

                                I still remember the days when games shipped with four different graphics drivers and two or three audio drivers. I do NOT want to go back to those days.
                                No. Mantle only requires GCN because it requires certain hw features that we not available on older asics.

                                Comment

                                Working...
                                X