Announcement

Collapse
No announcement yet.

r600/r700 libdrm, mesa, and radeon performance patches

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by RealNC View Post
    This is taken care of by tmpfs swapping out.
    Do it on my 2GB of ram desktop: it will need at least 6GB of swap and it will slow down everything. It will be a bad thing to do even on my laptop with root and swap encrypted with serpent 256.

    Originally posted by RealNC View Post
    Code:
    post_src_unpack() {                                                                                                              
            epatch "/etc/portage/env/x11-libs/libXft-2.1.14-lcd-cleartype.diff" || die "failed to apply cleartype patch"             
    }
    And place the patch in the same directory.
    Thank you for the tip, didn't know it
    ## VGA ##
    AMD: X1950XTX, HD3870, HD5870
    Intel: GMA45, HD3000 (Core i5 2500K)

    Comment


    • #22
      What about writing a script that generate this ugly/faster code from the nice/slower code? There are already some machine generated files in mesa, e.g.:
      src/mesa/main/remap_helper.h

      Comment


      • #23
        Originally posted by bridgman View Post
        I should probably mention that I may have oversimplified airlied's response
        so where would one go to find his non-simplified response? some irc-logs, a mailing list, somewhere different?

        I didn't find anything on this forum, and I have this suspicion that googling for "airlied's response" won't get me too far

        Comment


        • #25
          thank you! He didn't mention any details, though one might interpret his words in a way that fits my macro-hack.

          < airlied> did you try just makeing radeon_cd_write_x_dwords and a variable macro?
          [..]
          < airlied> I could get behind a variable arg function
          I'll do some performance-tests of my macro when I find the time.

          Comment


          • #26
            alright, did the tests.

            I don't use radeon/mesa yet (HD5770 -> fglrx), so this benchmark is pretty synthetic. But it's easy to see that the variadic function is slow, gcc fails.

            sourcecode

            Code:
            ~> gcc -v
            gcc version 4.3.4 (Gentoo 4.3.4 p1.0, pie-10.1.5)
            
            ~> gcc -O2 benchmark.c && ./a.out
            benchmark with 50000000 runs, section_ndw = 0
            direct calling:
             single:     1540ms
             double:     620ms
             triple:     610ms
             six:        620ms
            variadic macro:
             single (v): 2380ms
             double (v): 1580ms
             triple (v): 1400ms
             six (v):    1160ms

            Comment


            • #27
              I believe this needs to be said.

              In colleges these days when they talk about writing fast code and they beat it into students heads "There are three things to keep in mind when writing fast code, the algorithm, the algorithm, and the algorithm!". For a simple computer systems where the CPU (with a single execution pipe) does it all this is mostly true. When you add a cpu with a dual pipeline architecture with branch prediction and caches, then implementation starts to sneak its way into playing a more important role. Add more complexity to the system like sharing the memory bus and data with more processor cores or bus mastering peripherals then implementation can be almost or just as important as the algorithm.


              This being said some consideration needs to be paid to what the compiler spits out.


              Original C code
              BEGIN_BATCH_NO_AUTOSTATE(6);
              R600_OUT_BATCH(CP_PACKET3(R600_IT_SET_CTL_CONST, 1));
              R600_OUT_BATCH( mmSQ_VTX_BASE_VTX_LOC - ASIC_CTL_CONST_BASE_INDEX);
              R600_OUT_BATCH( 0);

              R600_OUT_BATCH( CP_PACKET3(R600_IT_SET_CTL_CONST, 1));
              R600_OUT_BATCH( mmSQ_VTX_START_INST_LOC - ASIC_CTL_CONST_BASE_INDEX);
              R600_OUT_BATCH( 0);
              END_BATCH();

              Assembly output
              testl %ebx, %ebx
              je .L96
              incl 20(%rax)
              .L96:
              movq 16(%rsp), %rsi
              movq 1328(%rsi), %rax
              movl 8(%rax), %edx
              movq (%rax), %rcx
              mov %edx, %ebx
              incl %edx
              movl $0, (%rcx,%rbx,4)
              movl 16(%rax), %ecx
              movl %edx, 8(%rax)
              testl %ecx, %ecx
              je .L97
              incl 20(%rax)
              .L97:
              movq 16(%rsp), %rdx
              movq 1328(%rdx), %rax
              movl 8(%rax), %edx
              movq (%rax), %rcx
              mov %edx, %ebx
              incl %edx
              movl $0, (%rcx,%rbx,4)
              movl %edx, 8(%rax)
              movl 16(%rax), %edx
              testl %edx, %edx
              je .L98
              incl 20(%rax)
              .L98:
              movq 16(%rsp), %rcx
              movq 1328(%rcx), %rax
              movl 8(%rax), %edx
              movq (%rax), %rcx
              mov %edx, %ebx
              incl %edx
              movl $-1073647872, (%rcx,%rbx,4)
              movl 16(%rax), %r15d
              movl %edx, 8(%rax)
              testl %r15d, %r15d
              je .L99
              incl 20(%rax)
              .L99:
              movq 16(%rsp), %rsi
              movq 1328(%rsi), %rax
              movl 8(%rax), %edx
              movq (%rax), %rcx
              mov %edx, %ebx
              incl %edx
              movl $1, (%rcx,%rbx,4)
              movl 16(%rax), %r14d
              movl %edx, 8(%rax)
              testl %r14d, %r14d
              je .L100
              incl 20(%rax)
              .L100:
              movq 16(%rsp), %rdx
              movq 1328(%rdx), %rax
              movl 8(%rax), %edx
              movq (%rax), %rcx
              mov %edx, %ebx
              incl %edx
              movl $0, (%rcx,%rbx,4)
              movl 16(%rax), %r13d
              movl %edx, 8(%rax)
              testl %r13d, %r13d
              jne .L117

              <one more block similar to the ones above goes here>


              Now for what I replaced it with

              Changed C code

              BEGIN_BATCH_NO_AUTOSTATE(6);
              R600_OUT_BATCHX6(CP_PACKET3(R600_IT_SET_CTL_CONST, 1),
              /* R600_OUT_BATCH( */ mmSQ_VTX_BASE_VTX_LOC - ASIC_CTL_CONST_BASE_INDEX,
              /* R600_OUT_BATCH( */ 0,

              /* R600_OUT_BATCH( */ CP_PACKET3(R600_IT_SET_CTL_CONST, 1),
              /* R600_OUT_BATCH( */ mmSQ_VTX_START_INST_LOC - ASIC_CTL_CONST_BASE_INDEX,
              /* R600_OUT_BATCH( */ 0);

              Changed Code Assembly output

              movq 16(%rsp), %rcx
              movq 1328(%rcx), %rax
              mov 8(%rax), %ecx
              movq (%rax), %rdx
              movl $-1073647872, (%rdx,%rcx,4)
              movl 8(%rax), %ecx
              movq (%rax), %rdx
              incl %ecx
              movl $0, (%rdx,%rcx,4)
              movl 8(%rax), %ecx
              movq (%rax), %rdx
              addl $2, %ecx
              movl $0, (%rdx,%rcx,4)
              movl 8(%rax), %ecx
              movq (%rax), %rdx
              addl $3, %ecx
              movl $-1073647872, (%rdx,%rcx,4)
              movl 8(%rax), %ecx
              movq (%rax), %rdx
              addl $4, %ecx
              movl $1, (%rdx,%rcx,4)
              movl 8(%rax), %ecx
              movq (%rax), %rdx
              addl $5, %ecx
              movl $0, (%rdx,%rcx,4)
              movl 16(%rax), %r9d
              addl $6, 8(%rax)
              testl %r9d, %r9d
              je .L76
              addl $6, 20(%rax)
              .L76:


              The question is now whether it is worth it to hide the ugly macros in a couple of header files and inconvenience programmers a little while writing the main code to cut execution time by more than half.

              Comment


              • #28
                To be fair some of the core developers are warming to the general idea, They just object to the 12 hacky macros that implement it.

                Comment


                • #29
                  I have some problems with your patches:


                  (Ogg Theora)
                  ## VGA ##
                  AMD: X1950XTX, HD3870, HD5870
                  Intel: GMA45, HD3000 (Core i5 2500K)

                  Comment


                  • #30
                    Here is dmesg drm output:


                    Patches are applied against git master, a patch which disable vsync is applied too.
                    ## VGA ##
                    AMD: X1950XTX, HD3870, HD5870
                    Intel: GMA45, HD3000 (Core i5 2500K)

                    Comment

                    Working...
                    X