Originally posted by gens
View Post
Announcement
Collapse
No announcement yet.
Why are graphics as complicated as they are?
Collapse
X
-
Originally posted by profoundWHALE View Post...which bypasses the gpu/cpu and instead uses a specific hardware that's built in to raise overall performance.
it's an area with many many people who don't know about it (and some are "smart" non the less, that's why i don't call myself an audiophile)
here, a low pass FIR filter
i think it's 32 tap, cant remember
Code:format ELF64 executable entry start align 16 segment readable writeable samples equ 1024 align 16 buff_in: rw samples*2 align 16 buff_out: rw samples*2 ; 2^16 = 1.0 align 16 coefficients: dw 5 dw 14 dw 35 dw 57 dw 39 dw -76 dw -328 dw -677 dw -945 dw -828 dw 16 dw 1799 dw 4428 dw 7441 dw 10109 dw 11680 dw 11680 dw 10109 dw 7441 dw 4428 dw 1799 dw 16 dw -828 dw -945 dw -677 dw -328 dw -76 dw 39 dw 57 dw 35 dw 14 dw 5 dw 705 dw 1240 dw 1434 dw 1169 dw 456 dw -550 dw -1576 dw -2287 dw -2367 dw -1609 dw 25 dw 2377 dw 5106 dw 7760 dw 9862 dw 11024 dw 11024 dw 9862 dw 7760 dw 5106 dw 2377 dw 25 dw -1609 dw -2367 dw -2287 dw -1576 dw -550 dw 456 dw 1169 dw 1434 dw 1240 dw 705 rq 10 ;just in case tmp rq 100 sys_read equ 0 sys_write equ 1 sys_exit equ 60 align 16 segment readable executable start: xorps xmm7, xmm7 xorps xmm6, xmm6 xorps xmm5, xmm5 xorps xmm4, xmm4 xorps xmm3, xmm3 xorps xmm2, xmm2 xorps xmm1, xmm1 xorps xmm0, xmm0 mov [tmp], rdx mov [tmp+8], rdx filter: mov rdx, samples*2*2 mov rsi, buff_in mov rdi, 0 mov rax, sys_read syscall cmp rax, 0 jz end_fir xor rax, rax mov rcx, samples loopy: movaps xmm12, [coefficients] movaps xmm13, [coefficients+16*1] movaps xmm14, [coefficients+16*2] movaps xmm15, [coefficients+16*3] mov edx, [rax+buff_in] ;dx = left channel sample mov esi, edx shr esi, 16 ;si = right channel sample ; left channel delay line pslldq xmm3, 2 movaps xmm10, xmm2 psrldq xmm10, 2*7 orpd xmm3, xmm10 pslldq xmm2, 2 movaps xmm10, xmm1 psrldq xmm10, 2*7 orpd xmm2, xmm10 pslldq xmm1, 2 movaps xmm10, xmm0 psrldq xmm10, 2*7 orpd xmm1, xmm10 pslldq xmm0, 2 ;bytes movzx rdx, dx movq xmm10, rdx orpd xmm0, xmm10 ; right channel delay line pslldq xmm7, 2 movaps xmm10, xmm6 psrldq xmm10, 2*7 orpd xmm7, xmm10 pslldq xmm6, 2 movaps xmm10, xmm5 psrldq xmm10, 2*7 orpd xmm6, xmm10 pslldq xmm5, 2 movaps xmm10, xmm4 psrldq xmm10, 2*7 orpd xmm5, xmm10 pslldq xmm4, 2 movq xmm10, rsi orpd xmm4, xmm10 movaps xmm8, xmm12 movaps xmm9, xmm13 movaps xmm10, xmm14 movaps xmm11, xmm15 pmaddwd xmm12, xmm0 pmaddwd xmm13, xmm1 pmaddwd xmm14, xmm2 pmaddwd xmm15, xmm3 paddd xmm12, xmm14 paddd xmm13, xmm15 paddd xmm12, xmm13 movhlps xmm13, xmm12 paddd xmm12, xmm13 movss xmm13, xmm12 psrldq xmm12, 32 paddd xmm12, xmm13 pmaddwd xmm8, xmm4 pmaddwd xmm9, xmm5 pmaddwd xmm10, xmm6 pmaddwd xmm11, xmm7 paddd xmm8, xmm10 paddd xmm9, xmm11 paddd xmm8, xmm9 movhlps xmm9, xmm8 paddd xmm8, xmm9 movss xmm9, xmm8 psrldq xmm8, 32 paddd xmm8, xmm9 movd r10d, xmm12 ;sal r10d, 3 ;overflow possibility unless compensated shr r10d, 16 mov [rax+buff_out], r10w movq r10, xmm8 ;sal r10d, 2 shr r10d, 16 mov [rax+buff_out+2], r10w add rax, 4 dec rcx jnz loopy mov rdx, samples*2*2 mov rsi, buff_out mov rdi, 1 mov rax, sys_write syscall jmp filter end_fir: mov rax, sys_exit syscall
real 0m2.206s
user 0m0.129s
sys 0m0.251s
where hell.raw is a 64MB, ~6.25 minutes long, 44100Hz, 16bit stereo
and most of that time is spent in read() and write()Last edited by gens; 28 April 2014, 07:59 AM.
Comment
-
i think i c/p a testing version
here is that i know is correct
(phoronix... rly took me 30sec to realize)
Code:format ELF64 executable entry start align 16 segment readable writeable samples equ 1024 align 16 buff_in: rw samples*2 align 16 buff_out: rw samples*2 ; 2^16 = 1.0 align 16 coefficients: dw 5 dw 14 dw 35 dw 57 dw 39 dw -76 dw -328 dw -677 dw -945 dw -828 dw 16 dw 1799 dw 4428 dw 7441 dw 10109 dw 11680 dw 11680 dw 10109 dw 7441 dw 4428 dw 1799 dw 16 dw -828 dw -945 dw -677 dw -328 dw -76 dw 39 dw 57 dw 35 dw 14 dw 5 rq 10 ;just in case tmp rq 100 sys_read equ 0 sys_write equ 1 sys_exit equ 60 align 16 segment readable executable start: xorps xmm7, xmm7 xorps xmm6, xmm6 xorps xmm5, xmm5 xorps xmm4, xmm4 xorps xmm3, xmm3 xorps xmm2, xmm2 xorps xmm1, xmm1 xorps xmm0, xmm0 mov [tmp], rdx mov [tmp+8], rdx filter: mov rdx, samples*2*2 mov rsi, buff_in mov rdi, 0 mov rax, sys_read syscall cmp rax, 0 jz end_fir xor rax, rax mov rcx, samples loopy: movaps xmm12, [coefficients] movaps xmm13, [coefficients+16*1] movaps xmm14, [coefficients+16*2] movaps xmm15, [coefficients+16*3] mov edx, [rax+buff_in] ;dx = left channel sample mov esi, edx shr esi, 16 ;si = right channel sample ; left channel delay line pslldq xmm3, 2 movaps xmm10, xmm2 psrldq xmm10, 2*7 orpd xmm3, xmm10 pslldq xmm2, 2 movaps xmm10, xmm1 psrldq xmm10, 2*7 orpd xmm2, xmm10 pslldq xmm1, 2 movaps xmm10, xmm0 psrldq xmm10, 2*7 orpd xmm1, xmm10 pslldq xmm0, 2 ;bytes movzx rdx, dx movq xmm10, rdx orpd xmm0, xmm10 ; right channel delay line pslldq xmm7, 2 movaps xmm10, xmm6 psrldq xmm10, 2*7 orpd xmm7, xmm10 pslldq xmm6, 2 movaps xmm10, xmm5 psrldq xmm10, 2*7 orpd xmm6, xmm10 pslldq xmm5, 2 movaps xmm10, xmm4 psrldq xmm10, 2*7 orpd xmm5, xmm10 pslldq xmm4, 2 movq xmm10, rsi orpd xmm4, xmm10 movaps xmm8, xmm12 movaps xmm9, xmm13 movaps xmm10, xmm14 movaps xmm11, xmm15 pmaddwd xmm12, xmm0 pmaddwd xmm13, xmm1 pmaddwd xmm14, xmm2 pmaddwd xmm15, xmm3 paddd xmm12, xmm14 paddd xmm13, xmm15 paddd xmm12, xmm13 movhlps xmm13, xmm12 paddd xmm12, xmm13 movss xmm13, xmm12 psrldq xmm12, 32 paddd xmm12, xmm13 pmaddwd xmm8, xmm4 pmaddwd xmm9, xmm5 pmaddwd xmm10, xmm6 pmaddwd xmm11, xmm7 paddd xmm8, xmm10 paddd xmm9, xmm11 paddd xmm8, xmm9 movhlps xmm9, xmm8 paddd xmm8, xmm9 movss xmm9, xmm8 psrldq xmm8, 32 paddd xmm8, xmm9 movd r10d, xmm12 ;sal r10d, 3 ;overflow possibility unless compensated shr r10d, 16 mov [rax+buff_out], r10w movq r10, xmm8 ;sal r10d, 2 shr r10d, 16 mov [rax+buff_out+2], r10w add rax, 4 dec rcx jnz loopy mov rdx, samples*2*2 mov rsi, buff_out mov rdi, 1 mov rax, sys_write syscall jmp filter end_fir: mov rax, sys_exit syscall
Comment
-
Originally posted by Daktyl198 View PostI don't get this either... why can't we map the API calls to the specific hardware at boot or through a function (for hot-swapping/other GPU changes) and store the result, that way we don't spend precious CPU cycles translating all the time?
You also have the somewhat substantial driver layer to manage the HW resources, which also saps performance and is probably THE performance killer on weaker CPU's. Thats what Mantle/DX12 is attempting to solve going forward.
The final problem, at least for OGL, is the API is aged, doesn't reflect how the HW actually works anymore, and simply doesn't play well with newer programming methodologies (Object Oriented, OGL is not).
Comment
-
Originally posted by Daktyl198 View PostI understand that a lot of computations are done on the CPU, then the results are sent to the GPU. I was talking more about stuff like this though:
Watching a 1080p video normally, and watching a 1080p video with "hardware acceleration".
I assume the first means that all decoding and graphics processing is done on the CPU (assuming a non-OGL rendering method) while the second means using the GPU for both operations. If this is true, why wouldn't the GPU be used in the first place? Since it's obviously made for tasks such as these, vs the CPU which (for the most part) is not.
On a general purpose PC, you can NOT assume you have access to all computer resources. You can NOT assume full HW support.
Take a look at the slides from AMD's true audio, which bypasses the gpu/cpu and instead uses a specific hardware that's built in to raise overall performance. Mantle let's you bypass the cpu for setting things up. Anyways those slides will show you how much each things costs to process.
Hence the advantage and disadvantage of higher level API's: You can abstract everything, and keep support, in theory, forever. The downside is, this easily doubles the time to complete any specific task, due to said abstraction. If you want speed, you go to lower level API's. If you want support, you go to higher level API's.
Hence why we have consoles.Last edited by gamerk2; 28 April 2014, 10:58 AM.
Comment
-
Originally posted by gamerk2 View PostTake Mantle; its designed for only GCN GPU's. What happens when GCN reaches EOL and AMD has to replace it? Woops, every game that used Mantle now needs to patch itself, because the assumptions made about the HW design no longer hold true.Test signature
Comment
-
Originally posted by bridgman View PostIt's not designed for *only* GCN GPUs -- it's more correct to say that the HW capabilities of GCN GPUs represent the minimum requirement.
I still remember the days when games shipped with four different graphics drivers and two or three audio drivers. I do NOT want to go back to those days.
Comment
-
Originally posted by gamerk2 View PostNope. When you make that low a level API, you *are required* to expose the lower level details of how the GPU operates in order to create a working driver. Every time the HW capabilities change, you are going to have to tear apart the API to support the newer architecture.
I still remember the days when games shipped with four different graphics drivers and two or three audio drivers. I do NOT want to go back to those days.Test signature
Comment
-
Originally posted by gamerk2 View PostNope. When you make that low a level API, you *are required* to expose the lower level details of how the GPU operates in order to create a working driver. Every time the HW capabilities change, you are going to have to tear apart the API to support the newer architecture.
I still remember the days when games shipped with four different graphics drivers and two or three audio drivers. I do NOT want to go back to those days.
Comment
Comment