Announcement

**darkbasic** · 19 March 2010, 07:00 AM

Originally posted by RealNC View Post

This is taken care of by tmpfs swapping out.

Do it on my 2GB of ram desktop: it will need at least 6GB of swap and it will slow down everything. It will be a bad thing to do even on my laptop with root and swap encrypted with serpent 256.

Originally posted by RealNC View Post

Code:

post_src_unpack() {                                                                                                              
        epatch "/etc/portage/env/x11-libs/libXft-2.1.14-lcd-cleartype.diff" || die "failed to apply cleartype patch"             
}

And place the patch in the same directory.

Thank you for the tip, didn't know it

**oibaf** · 19 March 2010, 07:23 AM

What about writing a script that generate this ugly/faster code from the nice/slower code? There are already some machine generated files in mesa, e.g.:
src/mesa/main/remap_helper.h

**rohcQaH** · 19 March 2010, 08:32 AM

Originally posted by bridgman View Post

I should probably mention that I may have oversimplified airlied's response

so where would one go to find his non-simplified response? some irc-logs, a mailing list, somewhere different?

I didn't find anything on this forum, and I have this suspicion that googling for "airlied's response" won't get me too far

**bridgman** · 19 March 2010, 09:40 AM

IRC Logs of #radeon on irc.freenode.net for 2010-03-17

http://people.freedesktop.org/~cbrill/dri-log/index.php?date=2010-03-17&channel=radeon&show_html=true&highlight_names=&date=2010-03-17

... discussion starts around 17:27

**rohcQaH** · 19 March 2010, 10:19 AM

thank you! He didn't mention any details, though one might interpret his words in a way that fits my macro-hack.

< airlied> did you try just makeing radeon_cd_write_x_dwords and a variable macro?
[..]
< airlied> I could get behind a variable arg function

I'll do some performance-tests of my macro when I find the time.

**rohcQaH** · 19 March 2010, 12:03 PM

alright, did the tests.

I don't use radeon/mesa yet (HD5770 -> fglrx), so this benchmark is pretty synthetic. But it's easy to see that the variadic function is slow, gcc fails.

sourcecode

Code:

~> gcc -v
gcc version 4.3.4 (Gentoo 4.3.4 p1.0, pie-10.1.5)

~> gcc -O2 benchmark.c && ./a.out
benchmark with 50000000 runs, section_ndw = 0
direct calling:
 single:     1540ms
 double:     620ms
 triple:     610ms
 six:        620ms
variadic macro:
 single (v): 2380ms
 double (v): 1580ms
 triple (v): 1400ms
 six (v):    1160ms

**Obscene_CNN** · 19 March 2010, 01:12 PM

I believe this needs to be said.

In colleges these days when they talk about writing fast code and they beat it into students heads "There are three things to keep in mind when writing fast code, the algorithm, the algorithm, and the algorithm!". For a simple computer systems where the CPU (with a single execution pipe) does it all this is mostly true. When you add a cpu with a dual pipeline architecture with branch prediction and caches, then implementation starts to sneak its way into playing a more important role. Add more complexity to the system like sharing the memory bus and data with more processor cores or bus mastering peripherals then implementation can be almost or just as important as the algorithm.

This being said some consideration needs to be paid to what the compiler spits out.

Original C code
BEGIN_BATCH_NO_AUTOSTATE(6);
R600_OUT_BATCH(CP_PACKET3(R600_IT_SET_CTL_CONST, 1));
R600_OUT_BATCH( mmSQ_VTX_BASE_VTX_LOC - ASIC_CTL_CONST_BASE_INDEX);
R600_OUT_BATCH( 0);

R600_OUT_BATCH( CP_PACKET3(R600_IT_SET_CTL_CONST, 1));
R600_OUT_BATCH( mmSQ_VTX_START_INST_LOC - ASIC_CTL_CONST_BASE_INDEX);
R600_OUT_BATCH( 0);
END_BATCH();

Assembly output
testl %ebx, %ebx
je .L96
incl 20(%rax)
.L96:
movq 16(%rsp), %rsi
movq 1328(%rsi), %rax
movl 8(%rax), %edx
movq (%rax), %rcx
mov %edx, %ebx
incl %edx
movl $0, (%rcx,%rbx,4)
movl 16(%rax), %ecx
movl %edx, 8(%rax)
testl %ecx, %ecx
je .L97
incl 20(%rax)
.L97:
movq 16(%rsp), %rdx
movq 1328(%rdx), %rax
movl 8(%rax), %edx
movq (%rax), %rcx
mov %edx, %ebx
incl %edx
movl $0, (%rcx,%rbx,4)
movl %edx, 8(%rax)
movl 16(%rax), %edx
testl %edx, %edx
je .L98
incl 20(%rax)
.L98:
movq 16(%rsp), %rcx
movq 1328(%rcx), %rax
movl 8(%rax), %edx
movq (%rax), %rcx
mov %edx, %ebx
incl %edx
movl $-1073647872, (%rcx,%rbx,4)
movl 16(%rax), %r15d
movl %edx, 8(%rax)
testl %r15d, %r15d
je .L99
incl 20(%rax)
.L99:
movq 16(%rsp), %rsi
movq 1328(%rsi), %rax
movl 8(%rax), %edx
movq (%rax), %rcx
mov %edx, %ebx
incl %edx
movl $1, (%rcx,%rbx,4)
movl 16(%rax), %r14d
movl %edx, 8(%rax)
testl %r14d, %r14d
je .L100
incl 20(%rax)
.L100:
movq 16(%rsp), %rdx
movq 1328(%rdx), %rax
movl 8(%rax), %edx
movq (%rax), %rcx
mov %edx, %ebx
incl %edx
movl $0, (%rcx,%rbx,4)
movl 16(%rax), %r13d
movl %edx, 8(%rax)
testl %r13d, %r13d
jne .L117

<one more block similar to the ones above goes here>

Now for what I replaced it with

Changed C code

BEGIN_BATCH_NO_AUTOSTATE(6);
R600_OUT_BATCHX6(CP_PACKET3(R600_IT_SET_CTL_CONST, 1),
/* R600_OUT_BATCH( */ mmSQ_VTX_BASE_VTX_LOC - ASIC_CTL_CONST_BASE_INDEX,
/* R600_OUT_BATCH( */ 0,

/* R600_OUT_BATCH( */ CP_PACKET3(R600_IT_SET_CTL_CONST, 1),
/* R600_OUT_BATCH( */ mmSQ_VTX_START_INST_LOC - ASIC_CTL_CONST_BASE_INDEX,
/* R600_OUT_BATCH( */ 0);

Changed Code Assembly output

movq 16(%rsp), %rcx
movq 1328(%rcx), %rax
mov 8(%rax), %ecx
movq (%rax), %rdx
movl $-1073647872, (%rdx,%rcx,4)
movl 8(%rax), %ecx
movq (%rax), %rdx
incl %ecx
movl $0, (%rdx,%rcx,4)
movl 8(%rax), %ecx
movq (%rax), %rdx
addl $2, %ecx
movl $0, (%rdx,%rcx,4)
movl 8(%rax), %ecx
movq (%rax), %rdx
addl $3, %ecx
movl $-1073647872, (%rdx,%rcx,4)
movl 8(%rax), %ecx
movq (%rax), %rdx
addl $4, %ecx
movl $1, (%rdx,%rcx,4)
movl 8(%rax), %ecx
movq (%rax), %rdx
addl $5, %ecx
movl $0, (%rdx,%rcx,4)
movl 16(%rax), %r9d
addl $6, 8(%rax)
testl %r9d, %r9d
je .L76
addl $6, 20(%rax)
.L76:

The question is now whether it is worth it to hide the ugly macros in a couple of header files and inconvenience programmers a little while writing the main code to cut execution time by more than half.

**Obscene_CNN** · 19 March 2010, 01:39 PM

To be fair some of the core developers are warming to the general idea, They just object to the 12 hacky macros that implement it.

**darkbasic** · 20 March 2010, 01:18 PM

I have some problems with your patches:

http://darkbasic.homelinux.com/benchmarks/20100319/perfpatchissue.ogg

(Ogg Theora)

**darkbasic** · 20 March 2010, 01:28 PM

Here is dmesg drm output:

http://darkbasic.homelinux.com/benchmarks/20100319/perfpatch.log

Patches are applied against git master, a patch which disable vsync is applied too.

Announcement

r600/r700 libdrm, mesa, and radeon performance patches

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment