Page 3 of 4 FirstFirst 1234 LastLast
Results 21 to 30 of 31

Thread: Why are graphics as complicated as they are?

  1. #21
    Join Date
    Apr 2014
    Posts
    1

    Default

    Quote Originally Posted by gens View Post
    (disclaimer: i'm not an expert)
    i'm guessing here
    you are thinking about absolute efficiency

    thing is gpu's are nowadays just a bunch of compute units orchestrated by a control unit (theres more ofc)
    compute units are simple things
    with that kind of design gpu's are not limited to doing disk cleanup tools just one specific kind of "rendering"
    (rendering 3D is just a bunch of mathematical transforms with some logic in the mix)

    so a gpu driver is basically a state machine that says to the hardware (firmware in this case) what should be done


    also about the cpu part in it
    even in a case of something simple like a desktop or a window with some buttons or something you still need the logic behind it

    like when you move a window
    you have to calculate where it is moved
    check, based on rules, things like if you move it to an edge do you flip to the next virtual desktop (etc)

    but that is simple
    in complex graphics for example you don't want the gpu to draw the whole huge world
    so you cull everything not seen
    you do it on the cpu because you have to know in advance what you will be rendering (to not send textures when not needed, vertices, etc.)
    (this is also required for a desktop if you want lower gpu memory usage)


    still i like the idea of directly controlling the gpu
    i read something that in the future (or maybe even now in new opengl) you will be able to get a pointer in gpu memory

    also i think gpus are going in the direction of having a dedicated cpu (like ARM or something) on them that would control it
    imagine you could write full fledged programs to run on a massively parallel gpu (like semi touring-complete shaders)

    in my eyes the future looks bright in the gpu department

    edit: in short; they are complicated so they don't become even more complicated, but they are simplifying slowly (in general design)
    my cheap laptop have Intel HD Graphics 2500. good enough for playing many FPS and RTS titles released before 2008 at medium details.

  2. #22
    Join Date
    Oct 2013
    Location
    Canada
    Posts
    419

    Default

    Take a look at the slides from AMD's true audio, which bypasses the gpu/cpu and instead uses a specific hardware that's built in to raise overall performance. Mantle let's you bypass the cpu for setting things up. Anyways those slides will show you how much each things costs to process.

  3. #23
    Join Date
    May 2012
    Posts
    555

    Default

    Quote Originally Posted by profoundWHALE View Post
    ...which bypasses the gpu/cpu and instead uses a specific hardware that's built in to raise overall performance.
    please don't, just don't talk about audio processing
    it's an area with many many people who don't know about it (and some are "smart" non the less, that's why i don't call myself an audiophile)
    here, a low pass FIR filter
    i think it's 32 tap, cant remember

    Code:
    format ELF64 executable
    entry start
    
    align 16
    segment readable writeable
    
    samples equ 1024
    
    align 16
    buff_in: rw samples*2
    align 16
    buff_out: rw samples*2
    
    ; 2^16 = 1.0
    align 16
    coefficients:
    
    dw 5
    dw 14
    dw 35
    dw 57
    dw 39
    dw -76
    dw -328
    dw -677
    dw -945
    dw -828
    dw 16
    dw 1799
    dw 4428
    dw 7441
    dw 10109
    dw 11680
    dw 11680
    dw 10109
    dw 7441
    dw 4428
    dw 1799
    dw 16
    dw -828
    dw -945
    dw -677
    dw -328
    dw -76
    dw 39
    dw 57
    dw 35
    dw 14
    dw 5
    
    dw 705
    dw 1240
    dw 1434
    dw 1169
    dw 456
    dw -550
    dw -1576
    dw -2287
    dw -2367
    dw -1609
    dw 25
    dw 2377
    dw 5106
    dw 7760
    dw 9862
    dw 11024
    dw 11024
    dw 9862
    dw 7760
    dw 5106
    dw 2377
    dw 25
    dw -1609
    dw -2367
    dw -2287
    dw -1576
    dw -550
    dw 456
    dw 1169
    dw 1434
    dw 1240
    dw 705
    
    rq 10 ;just in case
    tmp rq 100
    
    sys_read equ 0
    sys_write equ 1
    sys_exit equ 60
    
    
    align 16
    segment readable executable
    
    start:
    
    xorps xmm7, xmm7
    xorps xmm6, xmm6
    xorps xmm5, xmm5
    xorps xmm4, xmm4
    xorps xmm3, xmm3
    xorps xmm2, xmm2
    xorps xmm1, xmm1
    xorps xmm0, xmm0
    
    mov [tmp], rdx
    mov [tmp+8], rdx
    
    filter:
    
    mov rdx, samples*2*2
    mov rsi, buff_in
    mov rdi, 0
    mov rax, sys_read
    syscall
    cmp rax, 0
    jz end_fir
    
    xor rax, rax
    
    mov rcx, samples
    loopy:
    movaps xmm12, [coefficients]
    movaps xmm13, [coefficients+16*1]
    movaps xmm14, [coefficients+16*2]
    movaps xmm15, [coefficients+16*3]
    mov edx, [rax+buff_in] ;dx = left channel sample
    mov esi, edx
    shr esi, 16 ;si = right channel sample
    
    ; left channel delay line
    pslldq xmm3, 2
    movaps xmm10, xmm2
    psrldq xmm10, 2*7
    orpd xmm3, xmm10
    
    pslldq xmm2, 2
    movaps xmm10, xmm1
    psrldq xmm10, 2*7
    orpd xmm2, xmm10
    
    pslldq xmm1, 2
    movaps xmm10, xmm0
    psrldq xmm10, 2*7
    orpd xmm1, xmm10
    
    pslldq xmm0, 2 ;bytes
    movzx rdx, dx
    movq xmm10, rdx
    orpd xmm0, xmm10
    
    ; right channel delay line
    pslldq xmm7, 2
    movaps xmm10, xmm6
    psrldq xmm10, 2*7
    orpd xmm7, xmm10
    
    pslldq xmm6, 2
    movaps xmm10, xmm5
    psrldq xmm10, 2*7
    orpd xmm6, xmm10
    
    pslldq xmm5, 2
    movaps xmm10, xmm4
    psrldq xmm10, 2*7
    orpd xmm5, xmm10
    
    pslldq xmm4, 2
    movq xmm10, rsi
    orpd xmm4, xmm10
    
    
    
    
    movaps xmm8, xmm12
    movaps xmm9, xmm13
    movaps xmm10, xmm14
    movaps xmm11, xmm15
    
    
    pmaddwd xmm12, xmm0
    pmaddwd xmm13, xmm1
    pmaddwd xmm14, xmm2
    pmaddwd xmm15, xmm3
    
    paddd xmm12, xmm14
    paddd xmm13, xmm15
    
    paddd xmm12, xmm13
    
    movhlps xmm13, xmm12
    paddd xmm12, xmm13
    
    movss xmm13, xmm12
    psrldq xmm12, 32
    paddd xmm12, xmm13
    
    
    
    pmaddwd xmm8, xmm4
    pmaddwd xmm9, xmm5
    pmaddwd xmm10, xmm6
    pmaddwd xmm11, xmm7
    
    paddd xmm8, xmm10
    paddd xmm9, xmm11
    
    paddd xmm8, xmm9
    
    movhlps xmm9, xmm8
    paddd xmm8, xmm9
    
    movss xmm9, xmm8
    psrldq xmm8, 32
    paddd xmm8, xmm9
    
    
    
    movd r10d, xmm12
    ;sal r10d, 3 ;overflow possibility unless compensated
    shr r10d, 16
    mov [rax+buff_out], r10w
    
    movq r10, xmm8
    ;sal r10d, 2
    shr r10d, 16
    mov [rax+buff_out+2], r10w
    
    add rax, 4
    dec rcx
    jnz loopy
    
    mov rdx, samples*2*2
    mov rsi, buff_out
    mov rdi, 1
    mov rax, sys_write
    syscall
    
    jmp filter
    
    end_fir:
    mov rax, sys_exit
    syscall
    bash-4.2# time ./example < hell.raw >hell2.raw

    real 0m2.206s
    user 0m0.129s
    sys 0m0.251s

    where hell.raw is a 64MB, ~6.25 minutes long, 44100Hz, 16bit stereo
    and most of that time is spent in read() and write()
    Last edited by gens; 04-28-2014 at 07:59 AM.

  4. #24
    Join Date
    May 2012
    Posts
    555

    Default

    i think i c/p a testing version
    here is that i know is correct
    (phoronix... rly took me 30sec to realize)

    Code:
    format ELF64 executable
    entry start
    
    align 16
    segment readable writeable
    
    samples equ 1024
    
    align 16
    buff_in: rw samples*2
    align 16
    buff_out: rw samples*2
    
    ; 2^16 = 1.0
    align 16
    coefficients:
    
    dw 5
    dw 14
    dw 35
    dw 57
    dw 39
    dw -76
    dw -328
    dw -677
    dw -945
    dw -828
    dw 16
    dw 1799
    dw 4428
    dw 7441
    dw 10109
    dw 11680
    dw 11680
    dw 10109
    dw 7441
    dw 4428
    dw 1799
    dw 16
    dw -828
    dw -945
    dw -677
    dw -328
    dw -76
    dw 39
    dw 57
    dw 35
    dw 14
    dw 5
    
    rq 10 ;just in case
    tmp rq 100
    
    sys_read equ 0
    sys_write equ 1
    sys_exit equ 60
    
    
    align 16
    segment readable executable
    
    start:
    
    xorps xmm7, xmm7
    xorps xmm6, xmm6
    xorps xmm5, xmm5
    xorps xmm4, xmm4
    xorps xmm3, xmm3
    xorps xmm2, xmm2
    xorps xmm1, xmm1
    xorps xmm0, xmm0
    
    mov [tmp], rdx
    mov [tmp+8], rdx
    
    filter:
    
    mov rdx, samples*2*2
    mov rsi, buff_in
    mov rdi, 0
    mov rax, sys_read
    syscall
    cmp rax, 0
    jz end_fir
    
    xor rax, rax
    
    mov rcx, samples
    loopy:
    movaps xmm12, [coefficients]
    movaps xmm13, [coefficients+16*1]
    movaps xmm14, [coefficients+16*2]
    movaps xmm15, [coefficients+16*3]
    mov edx, [rax+buff_in] ;dx = left channel sample
    mov esi, edx
    shr esi, 16 ;si = right channel sample
    
    ; left channel delay line
    pslldq xmm3, 2
    movaps xmm10, xmm2
    psrldq xmm10, 2*7
    orpd xmm3, xmm10
    
    pslldq xmm2, 2
    movaps xmm10, xmm1
    psrldq xmm10, 2*7
    orpd xmm2, xmm10
    
    pslldq xmm1, 2
    movaps xmm10, xmm0
    psrldq xmm10, 2*7
    orpd xmm1, xmm10
    
    pslldq xmm0, 2 ;bytes
    movzx rdx, dx
    movq xmm10, rdx
    orpd xmm0, xmm10
    
    ; right channel delay line
    pslldq xmm7, 2
    movaps xmm10, xmm6
    psrldq xmm10, 2*7
    orpd xmm7, xmm10
    
    pslldq xmm6, 2
    movaps xmm10, xmm5
    psrldq xmm10, 2*7
    orpd xmm6, xmm10
    
    pslldq xmm5, 2
    movaps xmm10, xmm4
    psrldq xmm10, 2*7
    orpd xmm5, xmm10
    
    pslldq xmm4, 2
    movq xmm10, rsi
    orpd xmm4, xmm10
    
    
    
    
    movaps xmm8, xmm12
    movaps xmm9, xmm13
    movaps xmm10, xmm14
    movaps xmm11, xmm15
    
    
    pmaddwd xmm12, xmm0
    pmaddwd xmm13, xmm1
    pmaddwd xmm14, xmm2
    pmaddwd xmm15, xmm3
    
    paddd xmm12, xmm14
    paddd xmm13, xmm15
    
    paddd xmm12, xmm13
    
    movhlps xmm13, xmm12
    paddd xmm12, xmm13
    
    movss xmm13, xmm12
    psrldq xmm12, 32
    paddd xmm12, xmm13
    
    
    
    pmaddwd xmm8, xmm4
    pmaddwd xmm9, xmm5
    pmaddwd xmm10, xmm6
    pmaddwd xmm11, xmm7
    
    paddd xmm8, xmm10
    paddd xmm9, xmm11
    
    paddd xmm8, xmm9
    
    movhlps xmm9, xmm8
    paddd xmm8, xmm9
    
    movss xmm9, xmm8
    psrldq xmm8, 32
    paddd xmm8, xmm9
    
    
    
    movd r10d, xmm12
    ;sal r10d, 3 ;overflow possibility unless compensated
    shr r10d, 16
    mov [rax+buff_out], r10w
    
    movq r10, xmm8
    ;sal r10d, 2
    shr r10d, 16
    mov [rax+buff_out+2], r10w
    
    add rax, 4
    dec rcx
    jnz loopy
    
    mov rdx, samples*2*2
    mov rsi, buff_out
    mov rdi, 1
    mov rax, sys_write
    syscall
    
    jmp filter
    
    end_fir:
    mov rax, sys_exit
    syscall

  5. #25
    Join Date
    Jun 2012
    Posts
    361

    Default

    Quote Originally Posted by Daktyl198 View Post
    I don't get this either... why can't we map the API calls to the specific hardware at boot or through a function (for hot-swapping/other GPU changes) and store the result, that way we don't spend precious CPU cycles translating all the time?
    Because GPU design changes over time, and if you hardcoded everything, the API would break every single time the HW it ran on changed. You also ignore the quite possible case of multiple-GPU's, or even non-GPU cards that can run the OGL API (ASIC's and the like).

    You also have the somewhat substantial driver layer to manage the HW resources, which also saps performance and is probably THE performance killer on weaker CPU's. Thats what Mantle/DX12 is attempting to solve going forward.

    The final problem, at least for OGL, is the API is aged, doesn't reflect how the HW actually works anymore, and simply doesn't play well with newer programming methodologies (Object Oriented, OGL is not).

  6. #26
    Join Date
    Jun 2012
    Posts
    361

    Default

    Quote Originally Posted by Daktyl198 View Post
    I understand that a lot of computations are done on the CPU, then the results are sent to the GPU. I was talking more about stuff like this though:

    Watching a 1080p video normally, and watching a 1080p video with "hardware acceleration".
    I assume the first means that all decoding and graphics processing is done on the CPU (assuming a non-OGL rendering method) while the second means using the GPU for both operations. If this is true, why wouldn't the GPU be used in the first place? Since it's obviously made for tasks such as these, vs the CPU which (for the most part) is not.
    Not every GPU can do this. What about new encoding formats on old GPUs? What if the user is running some other GPU heavy task at the same time; would it be better to let the CPU handle the video decoding instead?

    On a general purpose PC, you can NOT assume you have access to all computer resources. You can NOT assume full HW support.

    Take a look at the slides from AMD's true audio, which bypasses the gpu/cpu and instead uses a specific hardware that's built in to raise overall performance. Mantle let's you bypass the cpu for setting things up. Anyways those slides will show you how much each things costs to process.
    But you tie yourself to a SPECIFIC HW specification. Take Mantle; its designed for only GCN GPU's. What happens when GCN reaches EOL and AMD has to replace it? Woops, every game that used Mantle now needs to patch itself, because the assumptions made about the HW design no longer hold true.

    Hence the advantage and disadvantage of higher level API's: You can abstract everything, and keep support, in theory, forever. The downside is, this easily doubles the time to complete any specific task, due to said abstraction. If you want speed, you go to lower level API's. If you want support, you go to higher level API's.

    Hence why we have consoles.
    Last edited by gamerk2; 04-28-2014 at 10:58 AM.

  7. #27
    Join Date
    Oct 2007
    Location
    Toronto-ish
    Posts
    7,516

    Default

    Quote Originally Posted by gamerk2 View Post
    Take Mantle; its designed for only GCN GPU's. What happens when GCN reaches EOL and AMD has to replace it? Woops, every game that used Mantle now needs to patch itself, because the assumptions made about the HW design no longer hold true.
    It's not designed for *only* GCN GPUs -- it's more correct to say that the HW capabilities of GCN GPUs represent the minimum requirement.

  8. #28
    Join Date
    Jun 2012
    Posts
    361

    Default

    Quote Originally Posted by bridgman View Post
    It's not designed for *only* GCN GPUs -- it's more correct to say that the HW capabilities of GCN GPUs represent the minimum requirement.
    Nope. When you make that low a level API, you *are required* to expose the lower level details of how the GPU operates in order to create a working driver. Every time the HW capabilities change, you are going to have to tear apart the API to support the newer architecture.

    I still remember the days when games shipped with four different graphics drivers and two or three audio drivers. I do NOT want to go back to those days.

  9. #29
    Join Date
    Oct 2007
    Location
    Toronto-ish
    Posts
    7,516

    Default

    Quote Originally Posted by gamerk2 View Post
    Nope. When you make that low a level API, you *are required* to expose the lower level details of how the GPU operates in order to create a working driver. Every time the HW capabilities change, you are going to have to tear apart the API to support the newer architecture.

    I still remember the days when games shipped with four different graphics drivers and two or three audio drivers. I do NOT want to go back to those days.
    Agreed, but Mantle is not that low level.

  10. #30
    Join Date
    Dec 2007
    Posts
    2,395

    Default

    Quote Originally Posted by gamerk2 View Post
    Nope. When you make that low a level API, you *are required* to expose the lower level details of how the GPU operates in order to create a working driver. Every time the HW capabilities change, you are going to have to tear apart the API to support the newer architecture.

    I still remember the days when games shipped with four different graphics drivers and two or three audio drivers. I do NOT want to go back to those days.
    No. Mantle only requires GCN because it requires certain hw features that we not available on older asics.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •