Announcement

Collapse
No announcement yet.

Is Assembly Still Relevant To Most Linux Software?

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • in this case there isnt much difference between sse and sse2
    sse2 is mostly operations for integers (mmx for 128bit registers) and a couple op's for non-temporal store and cache hints

    gcc(4.8) gives me scalar code all the time
    Code:
     67e:   f3 0f 10 4f 80          movss  xmm1,DWORD PTR [rdi-0x80]
     683:   f3 0f 59 46 d0          mulss  xmm0,DWORD PTR [rsi-0x30]
     688:   f3 0f 59 4e d4          mulss  xmm1,DWORD PTR [rsi-0x2c]
     68d:   f3 0f 58 c1             addss  xmm0,xmm1
    i also tried some more specific compiler options like
    gcc -o matrixm.o matrixm.c -shared -O3 -ftree-slp-vectorize -ffast-math -msse2

    even with hints in C code
    Code:
    	vertex = __builtin_assume_aligned (vertex, 32);
    	matrix = __builtin_assume_aligned (matrix, 32);
    	result = __builtin_assume_aligned (result, 32);
    avx, fma and xop, on the other hand, have instructions that are great for this kind of operations like VFMADDPS and HADDPS
    the compiler does use them when possible, but from what i tried its still scalar

    threading assembly code is as easy as threading C code
    in fact i think my loop would do better threaded then the C one as a cache line is from what i can tell usually 64bits and the scalar code uses just the first 32bits
    i could be wrong if, for example, the cpu sees that and loads the whole cache into 2 registers

    i tried -flto now and it does give better performance
    maybe cuz it aligned the loop (my loop isnt aligned either, tried now, it didnt help much)

    this is for example of sse usefulness
    if someone wants to use the loop in a BLAS library il finish it to work in all cases
    and it can also be used for software fallback when opengl3.x is not available, like with laptops
    fun thing is i will be using this to shave off a % or 2 from a game, will need to change couple lines thou
    writing a loop here and there isnt that demanding

    debugging is done by following the flow of the program
    load -> shuffle -> load2 -> shuffle_together
    what is and should be in the affected register is written
    you can rename them as you wish
    i admit its not as clear as copying parts of C code
    then again SIMD and MIMD can require different algorithms so its at least good to know about it
    i personally put stores and calls to print to stdout to know whats going on
    (gdb returns an address of the problem, then you can look at the disassembly where is the problem)

    i feel good documentation is the way to better quality software, not limiting all programers to one way of doing things
    then again i do this for a hobby so what do i know

    for software production it probably dosent matter at all
    most people have many core cpus so optimizing that couple % dosent matter
    then again sometimes its useful
    like in scientific things, encryption, databases, physics on the cpu
    also on code that changes flow frequently based on the results of calculations, that is better done on the cpu

    unfortunately i cant test your loop since it returns wrong calculations, and it does only a third of the calculations necessary
    still 177472 * 3 is 532416
    hmm, guess its because of less cache usage
    nice try thou, similar loop is in gcc's vectorization manual

    PS i also tried icc and llvm on this site, llvm is 3.0 thou
    Last edited by gens; 05-03-2013, 11:59 AM.

    Comment


    • Originally posted by gens View Post
      unfortunately i cant test your loop since it returns wrong calculations, and it does only a third of the calculations necessary
      still 177472 * 3 is 532416
      hmm, guess its because of less cache usage
      nice try thou, similar loop is in gcc's vectorization manual

      PS i also tried icc and llvm on this site, llvm is 3.0 thou
      You're right, I mistyped and you're completely right: by 3 times. So with -flto and if you multiply by 3 the computation it will get into very few percents under the assembly.
      As you pointed out in a previous post of yours, this optimization was significant, let's say 30% of your application was executing this loop. So with your 15% speedup, you will make your application to run 4% (30%*15%) overall. If -flto optimizes other functions, not only the tiny loop you've mentioned, you cannot get more than 5% overall, right?

      And one solution is hard-sweating and hard to debug, and my "solution" was not only found broken, but "it does only one third of the calculations" would not be necessary in the first place.

      This said, I am still curious why you can't use glLoadMatrix (that require 0% CPU), or snapshotting (if you have an animation or something like that) where you can optimize many times your CPU usage. Or why not using 4x4 matrices as the compiler maybe will use SSE(2) to sped up the computation. Yeah, I know, too much memory, but still, memory used with real speedup, when your application seem so much locked into huge speed requirements.

      OpenGL is supported on any OS, and glLoadMatrix is available in any graphic card on the market, and if it isn't it is well optimized (with SSE and such) by the OS vendors for the target machine, so there is no reason to you to optimize. It is supported in hardware from first S3 Savage 2000, NVidia Geforce 256 or Ati Radeon. Really history stuff, like 10 years ago.

      Isn't your design faulty, and assembly "backend solution" only tries to hide your problem, instead of solving it?
      Last edited by ciplogic; 05-05-2013, 09:55 AM.

      Comment


      • Originally posted by gens View Post
        (...)
        i tried -flto now and it does give better performance
        maybe cuz it aligned the loop (my loop isnt aligned either, tried now, it didnt help much)

        this is for example of sse usefulness
        if someone wants to use the loop in a BLAS library il finish it to work in all cases
        and it can also be used for software fallback when opengl3.x is not available, like with laptops
        fun thing is i will be using this to shave off a % or 2 from a game, will need to change couple lines thou
        writing a loop here and there isnt that demanding


        PS i also tried icc and llvm on this site, llvm is 3.0 thou
        Try this code:

        Code:
        #include <stdio.h>
        #include <sys/time.h>
        
        unsigned long long int rdtsc(void)
        {
           unsigned long long int x;
           unsigned a, d;
        
           __asm__ volatile("rdtsc" : "=a" (a), "=d" (d));
        
           return ((unsigned long long)a) | (((unsigned long long)d) << 32);;
        }
        
        float matrices[10000][9];
        float vertices[10000][3];
        float result[10000][3];
        
        void compute(int count ) {
        	int i,j,k;
        	float partial;
        	for( i=0;i<count;i++) {		
        		for(j = 0; j<3; j++)
        		{
        			partial = 0.0f;			
        			for(k = 0; k<3; k++)
        				partial += vertices[i][k] * matrices[i][j*3+k];
        			result[i][j] = partial;		
        		}
        	}
        }
        
        
        int main() {
        	int i;
        	int count = 10000;
        	
        	float tmp=0;
        	for( i=0; i<count*3; i++) {
        		vertices[i/3][i%3]=tmp;
        		tmp=tmp+1;
        	}
        	tmp = 0.0f;	
        	for( i=0; i<count*9; i++) {
        		matrices[i/9][i%9]=tmp;
        		tmp=tmp+1;
        	}
        	unsigned long long ts = rdtsc();
        	
        	compute( count );
        	
        	printf("elapsed ticks: %llu\n", rdtsc() - ts);
        	
        	for( i=0; i<24; i++) {
        		printf("%f ", result[i]);
        	}
        	printf("\n");
        	return 0;
        }
        I am not fully sure, but it looks that is auto-vectorized, or even it isn't, the performance went up (I changed a bit the logic, so change it accordingly, if there are bugs or it doesn't generate the same numbers at the end) and I think that it would be good enough (even is not auto-vectorized):
        Code:
         ./a.out 
        elapsed ticks: 471560
        The original timings were:
        Originally posted by ciplogic View Post
        So, I rerun the tests as you've suggested:
        Code:
        $ g++ -O3 matrix_test.c matrixm.c 
        $ ./a.out 
        elapsed ticks: 745912
        And here is the kicker:
        Code:
        $ g++ -O3 -flto matrix_test.c matrixm.c 
        $ ./a.out 
        elapsed ticks: 647984
        (...)
        May you confirm the numbers on your machine?
        So this rewrite will speedup on my machine by 58% (745912 /471560 ticks). I think that if you will rewrite the loop similarly in your machine, the code will run faster on your machine than your assembly. I know that machines differ, and maybe it will run slower, or maybe you will take the assembly of the loop that is 58% percent faster and you can find an optimization that happen on Intel Compiler and doesn't happen in GCC case, so you will state: but still the case remains: GCC, given a bug is reported, will be able to fix your loop and all other code written similarly, but your assembly will remain unoptimized, slower than the C version.

        Comment


        • a program that gives wrong results is infinitely slower then any program that does

          in opengl for shaders you need at least opengl3.something
          this kinda thing in games is for fallback, or laptops that have weak gpu's
          and i choose it as a generic example, theres plenty of other math things that can benefit
          encoding/decoding, encryption, everywhere where you got a loop that uses lots of cpu or memory bandwidth (compressing textures for the gpu ?)
          idk, all i know is there are cases where hand writing a loop can speed it up 28% (or more)

          scalar code can not easily be faster then vectorized
          SIMD does less cache messing, overall less instructions to decode etc.

          unrolled loops also help.. to a point

          i got an new AMD now, but i think intel still suffers from half aligned reads/writes (16byte aligned is usually the best)
          that would give a huge advantage to the custom loop
          maybe il test on the laptop some day

          PS i didnt even profile the code i wrote so its not reordered, that could speed it up a shade
          also i did simple packing, meaning a couple shuffles can be removed too

          PPS data flow is FAST, till it hits the cache limit
          in this case (didnt calculate) i think it hits it near the end
          that explains the speedup from processing just one third
          Last edited by gens; 05-05-2013, 08:53 PM.

          Comment


          • Originally posted by gens View Post
            a program that gives wrong results is infinitely slower then any program that does

            in opengl for shaders you need at least opengl3.something
            this kinda thing in games is for fallback, or laptops that have weak gpu's
            This is where you don't understand how OpenGL works. Let me rephrase: OpenGL historically is used to draw primitives on screen. Any OpenGL 1.1 graphic card compliant will accelerate up-to 8 lights in hardware and geometric transformations you draw on screen in GPU. This is commercially known as Transform&Lighting (introduced with GeForce 256/Savage 2000) and is also part of DirectX7 video cards. So if in your target problem, you have to apply many transformations to millions of vertices, you shouldn't multiply them on CPU, but simply state glLoadMatrix (or glMultiplyMatrix) and the driver/video card combination will do it for you. For millions of points per second, with zero CPU usage.

            The vertex/pixel shaders are small programs that can modify the standard flow of vertices/pixels with your custom processing: for vertices you can compute the particles in your particle generator again with little or no CPU usage, or a wave of a sea using a trigonometric function, or a processing to make the picture blurred.

            But as for your problem, if you can feed the graphic card if you need the final points to be displayed, you don't need to multiply them on the CPU. Even more, if your video card is OpenGL 2.0 compliant (a huge numbers of cards do support this, roughly a DirectX8 card like GeForce 5200+ or a Radeon 9000+, or a newer Intel video integrated video card), you cal load the vertices in the video memory as vertex buffers, and you don't need to make copies from CPU to GPU when you draw them. This is one reason why games with complex graphics run on the phones that are memory bandwidth limited. This is again, for your specific problem with the twist that your final points have to be draw on screen eventually (using OpenGL).


            Originally posted by gens View Post
            and i choose it as a generic example, theres plenty of other math things that can benefit
            encoding/decoding, encryption, everywhere where you got a loop that uses lots of cpu or memory bandwidth (compressing textures for the gpu ?)
            idk, all i know is there are cases where hand writing a loop can speed it up 28% (or more)

            scalar code can not easily be faster then vectorized
            SIMD does less cache messing, overall less instructions to decode etc.

            unrolled loops also help.. to a point

            i got an new AMD now, but i think intel still suffers from half aligned reads/writes (16byte aligned is usually the best)
            that would give a huge advantage to the custom loop
            maybe il test on the laptop some day
            If you get a new AMD, most likely you could have a X2, X3 or even a Phenom with 6 cores (to not talk about 8 core AMD), so your 30% speedup isn't it not a real deal? Or maybe an APU? So again, why not targeting those? People will buy more CPUs that support the GPGPU computations or the multi-core code. FWIW even phones today are 2 cores, and tablets tend to go to 4 cores soon.

            Originally posted by gens View Post
            PS i didnt even profile the code i wrote so its not reordered, that could speed it up a shade
            also i did simple packing, meaning a couple shuffles can be removed too

            PPS data flow is FAST, till it hits the cache limit
            in this case (didnt calculate) i think it hits it near the end
            that explains the speedup from processing just one third
            Did you tried the loop where matrices are defined like 2 dimensional? Was it auto-vectorized on your machine? Was it faster than on your machine's assembly implementation (it was doing at least the same computation)? Can you give some numbers?

            As I'm working with C# my C/assembly skills are not as great, and for what it worth, I did not know how to add your simple file in my Linux C++ IDE, but this is for a reason. I don't write micro-benchmarks on daily basis, and when I do run them, I run them to target an use-case. And as usual, use-cases involve many components. Many times I noticed that components run slow and in many cases using caching is much practical than assembly: a connection to internet to get a list of users is always way slower than any other component of your system, even written in Python or Ruby. If this is optimized, like saving the users for a time like 1 hour, it simply means that user will be able after a "Loading" screen to get instant UI working.

            This I think is where we differ, you think performance for performance sake, when I think performance if it doesn't hurt the user. Of course, with your approach if applied extensively, the user will maybe have instant running applications, and in my case I will have users that don't have annoying pauses, and even if they appear I try to minimize them with some sake defaults. I think your example shows this: you look for your operation to be faster, when I try to find the whole problem space to find optimization opportunities. At last: you don't seem concerned with assembly, but with assembly as "it cannot get faster than this", and you're in denial when people get code fast enough to not matter even if your assembly is faster or not. The "aligned" (or mainly non-aliased) pointers and cache performance can be made by the compiler with little work on your own, and if users can get into 30% (or even faster, with the latest implementation than your default assembler) of assembly speed range, I think that this proves the point that assembly is irrelevant, in your toy-software example.

            "data flow is FAST, till it hits the cache limit"
            What you're talking about? Data Flow Analysis of the compilers? That enable optimizations? Or processing of the data is faster if just it happen let's say in L1 cache? If is the second, then you are certainly wrong about your program to write in assembly, as 30% speedup can be all lost if you touch L2 or L3, and if you code your C++ code to fit well in L1, you can get bigger performance than your ugly assembly.

            Comment


            • Originally posted by gens View Post
              a program that gives wrong results is infinitely slower then any program that does
              (...)
              unrolled loops also help.. to a point

              i got an new AMD now, but i think intel still suffers from half aligned reads/writes (16byte aligned is usually the best)

              So this is my "final" C code that does the same number of compiler improvement with no alignment. Using basically: "-O3 -ffast-math -flto" will run faster than the most aligned code I could write (look on the next block of code)
              Code:
              #include <stdio.h>
              #include <stdlib.h>
              #include <memory.h>
              #include <sys/time.h>
              
              unsigned long long int rdtsc(void)
              {
                 unsigned long long int x;
                 unsigned a, d;
              
                 __asm__ volatile("rdtsc" : "=a" (a), "=d" (d));
              
                 return ((unsigned long long)a) | (((unsigned long long)d) << 32);;
              }
              
              float matrices[10000][9];
              float vertices[10000][3];
              float result[10000][3];
              
              void compute(int count ) {
              	int i,j,k;
              	float partial;
              	float res[3];
              	for( i=0;i<count;i++) {
              			
              		for(j = 0; j<3; j++)
              		{
              			partial = 0.0f;			
              			for(k = 0; k<3; k++)
              				partial += vertices[i][k] * matrices[i][j*3+k];
              			res[j] = partial;
              		}
              
              		memcpy(&result, &res, sizeof(float)*3);
              	}
              }
              
              
              int main() {
              	int i;
              	int count = 10000;
              	
              	float tmp=0.0f;
              	for( i=0; i<count*3; i++) {
              		vertices[i/3][i%3]=tmp;
              		tmp=tmp+1;
              	}
              	tmp = 0.0f;	
              	for( i=0; i<count*9; i++) {
              		matrices[i/9][i%9]=tmp;
              		tmp=tmp+1;
              	}
              	unsigned long long ts = rdtsc();
              	
              	compute( count );
              	
              	printf("elapsed ticks: %llu\n", rdtsc() - ts);
              	
              	for( i=0; i<24; i++) {
              		printf("%f ", result[i/3][i%3]);
              	}
              	printf("\n");
              	return 0;
              }
              2nd version:
              Code:
              void compute(
              	float * __restrict matrix, 
              	float * __restrict vertex, 
              	float * __restrict result, 
              	int count ) {
              	int i,j,k;
              
              	float *m = __builtin_assume_aligned(matrix, 16);
              	float *v = __builtin_assume_aligned(vertex, 16);
              	float *r = __builtin_assume_aligned(result, 16);
              
              	for( i=0;i<count;i++) {
              		
              		for(j=0;j<3;j++)
              		{
              			float accumulator = 0.0f;
              			for (k=0;k<3;k++)
              				accumulator+= m[j]*v[k];
              			result[j] = accumulator;
              		}
              		
              		m += 9;
              		r += 3;
              		v += 3;
              	}
              }
              
              #include <stdio.h>
              #include <sys/time.h>
              
              unsigned long long int rdtsc(void)
              {
                 unsigned long long int x;
                 unsigned a, d;
              
                 __asm__ volatile("rdtsc" : "=a" (a), "=d" (d));
              
                 return ((unsigned long long)a) | (((unsigned long long)d) << 32);;
              }
              
              int main() {
              	float matrices[100000];
              	float vertices[100000];
              	float result[100000];
              	int i;
              	int count = 10000;
              	float *ptrmat, *ptrvert, *ptrres;
              	
              	float tmp=0;
              	for( i=0; i<count*3; i++) {
              		matrices[i]=tmp;
              		vertices[i]=tmp;
              		tmp=tmp+1;
              	}
              	
              	ptrmat = &matrices[0];
              	ptrvert = &vertices[0];
              	ptrres = &result[0];
              	
              	unsigned long long ts = rdtsc();
              	
              	compute( ptrmat, ptrvert, ptrres, count );
              	
              	printf("elapsed ticks: %llu\n", rdtsc() - ts);
              	
              	for( i=0; i<24; i++) {
              		printf("%f ", result[i]);
              	}
              	printf("\n");
              	return 0;
              }
              Changed data structure to store bi-dimensional arrays:
              Code:
              $ ./a.out 
              elapsed ticks: 263760
              Same data structure (2nd implementation) with alignment hints and merging of the loops
              Code:
              $ ./a.out 
              elapsed ticks: 331592
              I think that none of the implementation were in fact vectorized, but they run smoking fast and faster (I think, I don't know how to compile the assembly part, using as tool!?) than the +30% speedup. Both run more than 100% faster than the original code I was given by you (that they were running in the 700000 ticks zone). So maybe is time for you to improve your C skills to get to speed with fast C code!?

              Comment


              • Originally posted by Steph1ani1e
                Virtual machines have come a long way in speed, and with multicores now the norm, managed code can be as fast or sometimes even more than unmanaged code
                Fully agree with you, but you know, imagine if you feel somewhat out of control if the VM doesn't optimize the code you run. I think here it lies the frustration of the "assembly folks" here. Or maybe that it occupy more memory or that it starts a bit slower. These issues can be addressed and they have been for some time, but still the impression that if you stay far away of hardware it means you will miss this micro-management of every instruction makes you feel bad.

                Imagine that you are the boss of the department and you cannot say anyone how to use it's precious time, but just to give directives for all company. This would be C/C++ languages. Imagine then that you are a boss of a division or a subdivision, this is like writing in a VM like Java or C#. Sure, your VM may be inefficient sometimes, but many times will do a really good job. The C++ compilers do an excellent job too.

                And some people may ask: Why would you want to use C# if its slower than C++?, when you can use directly assembly?

                And I think the main reason is that the higher your level is in your department, the bigger the probability you can make more changes in the world. If you are just sweeping and making every small table clean, you cannot change the entire company, but maybe the department your working, but if you manage a division, your division can give a bit worse service at the micro-management level, but they can give to you really clear information like they can notify you as a customer with an SMS, they can arrange everything to met you when you need them, etc. These changes can be made if you are in control, but if you are just trying to optimize that the email will arrive faster, by 10 seconds, instead of being in 40 seconds, it will be in 30, means nothing for user, but the power to have the decision of receiving the email in the first place is really crucial.

                Comment


                • i compiled your code, its still scalar
                  only seems few segments are interwoven a bit
                  i dont wanna examine the resulting code further

                  Code:
                  		result[0] = matrix[0]*vertex[0] + matrix[1]*vertex[1] +matrix[2]*vertex[2];
                  		result[1] = matrix[3]*vertex[0] + matrix[4]*vertex[1] +matrix[5]*vertex[2];
                  		result[2] = matrix[6]*vertex[0] + matrix[7]*vertex[1] +matrix[8]*vertex[2];
                  is the simplest way i know of to multiply 3x3 by 3x1 matrices
                  note the vertices used, it cant be made into a shorter loop (at least i cant think of any way now)
                  you can segment it into common operations, but i dont think it would help


                  glMultMatrix is 4x4 by 4x1
                  sse code for that would be just a few lines long as there is no need to shuffle that much
                  as far as i know for 3x3 matrices you need shaders
                  but idk that much about opengl programing


                  hmmm
                  by data flow i meant reading and writing to the memory
                  when the cpu reads from the memory, it goes thru the L2 cache (probably L3 too if you got it) and thru the L1 cache
                  when writing same thing happens
                  data goes from registers to Lwhatever and when the cpu finds time it writes it to RAM

                  the memory management unit (MMU) tries to plan ahead, deciding what to keep in the cache and what to evict
                  its no easy task to do when the cpu just rolls data like an idiot, but it tries its best
                  one way we can help it is by writing directly to RAM, bypassing the cache
                  problem with that is that its a lot slower then writing to cache, but still its faster then writing to a full cache
                  (almost the same goes for reading)

                  i'm used to linux having lots of programs to measure that kind of things
                  like perf that can count cache misses
                  or cachebench to see the limits of... well the cache

                  bdw why didnt you say so
                  heres a windows(64bit) version of the loop
                  hope it works, didnt test it at all


                  on a more personal note:
                  i dont like OO languages, but i can see why they are useful
                  i really dont like the "cpu is fast enough for slow code" mentality but i like firefox (bdw, you need yasm to compile firefox)
                  and i wont tell other people "you have to program in C"
                  and it bothers me how people talk that C++/C#/whatever is better
                  its all good, but its all different in basic mentality
                  bdw i like C more then other languages cuz its "portable assembler" and thus the closest to writing it yourself
                  (most other languages are far more abstracted from machine code)

                  for end a quote:

                  Software efficiency halves every 18 months, compensating Moore’s Law.
                  — May’s Law

                  taking in effect that Moore's law is no longer as valid as it was, the future is slow
                  Last edited by gens; 05-07-2013, 09:51 AM.

                  Comment


                  • Originally posted by gens View Post
                    i compiled your code, its still scalar
                    only seems few segments are interwoven a bit
                    i dont wanna examine the resulting code further
                    So for you SSE (or using any parallel code) is for the sake of parallelism. The scalar code is faster than the SSE code you wrote, so why use SSE code in the first place!? Did I miss something?

                    Originally posted by gens View Post
                    Code:
                    		result[0] = matrix[0]*vertex[0] + matrix[1]*vertex[1] +matrix[2]*vertex[2];
                    		result[1] = matrix[3]*vertex[0] + matrix[4]*vertex[1] +matrix[5]*vertex[2];
                    		result[2] = matrix[6]*vertex[0] + matrix[7]*vertex[1] +matrix[8]*vertex[2];
                    is the simplest way i know of to multiply 3x3 by 3x1 matrices
                    note the vertices used, it cant be made into a shorter loop (at least i cant think of any way now)
                    you can segment it into common operations, but i dont think it would help


                    glMultMatrix is 4x4 by 4x1
                    sse code for that would be just a few lines long as there is no need to shuffle that much
                    as far as i know for 3x3 matrices you need shaders
                    but idk that much about opengl programing
                    So let me clarify, you seem to just talk about glLoadMatrix/glMultiplyMatrix that doesn't match your size of matrix... but it seems you never used it. glLoadMatrix is used BEFORE you display something. So, let's say you have 10k points, and ONE matrix to multiply them. You will do write something like that to display them, if you don't display them, you have to write silly C loops or assembly:
                    Code:
                    glLoadMatrix(a4x4MatrixOfYour3x3Matrix); 
                    drawGl(yourPoints);
                    If drawGl happens to contain the the vertices in the video card as VBO (Vertex buffer object), these two operations are basically with 0% CPU (you will have to make a 4x4 matrix once version of your 3x3 matrix).

                    You can't use glLoadMatrix to do your computation though, and you can't use shaders for this either, you need to use CUDA or OpenCL which is another talk all-together. Anyway, as far as I understand, a real program that works with these many points, most likely will display them eventually, so there is no point to make weird computations to show that assembly is faster than C, but to see how can you use things that people do have on their video card, including the glLoadMatrix call.

                    on a more personal note:
                    i dont like OO languages, but i can see why they are useful
                    i really dont like the "cpu is fast enough for slow code" mentality but i like firefox (bdw, you need yasm to compile firefox)
                    and i wont tell other people "you have to program in C"
                    and it bothers me how people talk that C++/C#/whatever is better
                    its all good, but its all different in basic mentality
                    bdw i like C more then other languages cuz its "portable assembler" and thus the closest to writing it yourself
                    (most other languages are far more abstracted from machine code)

                    for end a quote:

                    Software efficiency halves every 18 months, compensating Moore’s Law.
                    — May’s Law

                    taking in effect that Moore's law is no longer as valid as it was, the future is slow
                    You seem to not like "C++/C#/whatever" because C is a portable assembly and is fast, but what you say makes literally no sense. Yes, there is a mentality of inefficiency inside VMs or higher level languages, but this is not an excuse to say that for this reason we have to target just performance. What about buffer overflows? Or NullPointerException (or a Invalid Read At Address 0x0000000c kind of stuff), don't you want that the runtime to be able to recover after these errors?

                    At last: what stops you write fast code with C++? Given all equal C++ is faster than C (was discussed earlier by Google experts and on this topic too) as you have access to assembly, all stuff that C has, + templates that can precompute many things at compile time and do aggressive inlining.

                    What stops you write fast code in C#? There are game engines written in C# that run well on my some years old phone.

                    I do know both C++ and C# and I cannot say that the language speed decreased 2x times every 18 months (or every 2 years for that matter), and half of the reason is that even the software bloat increased many times, the main slow items are still: the rotating disk drives, the internet, the CD, and so on that have huge latencies (if you don't have an SSD, if you have SSD the performance should be many times better than a rotating disk). Having a Pentium 3/Xp experience in 2001 was in many ways much worse than what you will have today with Windows7/i7-3K CPUs. Maybe they boot in the same time, but the Visual Studio is much more responsive (or Eclipse, or pick your tool), the IDE or the tools give to you much more relevant information, anti-aliasing for fonts, searching for symbols in all solution, many times more screen information/resolution.

                    If you analyze speciffic language features, let's say C# Linq, yes, I can agree that it is like 15% slower than iterating using the most optimized form of your loop, but this construct makes it hard (or impossible) to make common mistakes with it. If you suggest that "dynamic" keyword in C# is 10x times slower than a virtual call, I also agree, but (and there is always a but), the way the code that was written before to do the "dynamic" functionality is much more error prone (is like Assembly compared to C# defaut style) or reflection code was really ugly: to maintain, to understand and many times slower than Dynamic.

                    At last, why do you carry so much about the ticks? If you do a ls command, do you care how fast the ls can parse the answer, or how fast it gives to you the answer? With this I mean: ls can be optimized to be light on CPU, but it can be optimized to be light on disk accesses, and the 2nd one can be faster than the first, even it occupy maybe much more memory/CPU as it can keep a local cache of INode and it's associated data.

                    You say you like Firefox, but also look to the following facts about FF:
                    - it is written in C++/COM like coding (they try to remove XPCOM completely, but not to remove C++) so is very OOP
                    - the code management solution is Hg, which is written in Python
                    - it uses PGO (Profile Guided Optimizations) and they tune their C++ code based on C++ optimization styles, so there is virtually no assembly in all codebase of Firefox
                    - it pushes hard JavaScript, and AsmJs is a well behaving subset of fast JS, They bet on the power of compilers as LLVM to give big performance speedups, not on tuning assembly
                    - it use SqlLite everywhere (for history, bookmarks, etc.), the way to optimize the slowdowns is to make the code multithreaded
                    - most of speedups in the future theme Aurora are based on GPU profiling
                    - IonMonkey compiles off-thread the bytecode to native code

                    Do you see common themes from what I said before and real software? GPU and multi-threading are the main ways that Firefox took in the last years to optimize the speed. Why not post on FF forums to use assembly everywhere, though I don't think you will have any fans

                    Comment


                    • Originally posted by gens View Post
                      and it bothers me how people talk that C++/C#/whatever is better
                      its all good, but its all different in basic mentality
                      bdw i like C more then other languages cuz its "portable assembler" and thus the closest to writing it yourself
                      (most other languages are far more abstracted from machine code)
                      C++ is as fast as C on the C subset, and faster than C on the C++ subset.
                      The only things one can dislike about C++ compared to C are:
                      - the language can be too complicated for limited hardware (embeded, GPU) and very low level (OS kernels)
                      - the language does not enforce any coding style, which can be a pain for a project without strong leadership

                      Originally posted by gens View Post
                      for end a quote:

                      Software efficiency halves every 18 months, compensating Moore’s Law.
                      — May’s Law

                      taking in effect that Moore's law is no longer as valid as it was, the future is slow
                      There's a reason software efficiency compensate computations speed: it's because there is one thing that doesn't change: the users.
                      There is no need to double the fps of games every year, or halve the time needed for a pop-up to appear.
                      On the other hand, you can have twice as many games, or games two times better (because you can make game assets more easily if you don't have to optimize them that much, and that is what takes time in a game), or twice as cheap. Your pick! Or application more beautiful, or that can do more, etc, etc.. And all that, still at the same acceptable speed: the user speed.

                      Comment


                      • I've been told by some embedded developers that they use C rather than C++ because with it they can predict and account for every byte of memory that gets used and allocated: allocating for an array and a size variable is absolutely predictable in terms of memory consumption, std::vector (et. al.) isn't.

                        Comment


                        • Originally posted by archibald View Post
                          I've been told by some embedded developers that they use C rather than C++ because with it they can predict and account for every byte of memory that gets used and allocated: allocating for an array and a size variable is absolutely predictable in terms of memory consumption, std::vector (et. al.) isn't.
                          And I can say add another two reasons: as the are no abstractions (like std::vector) by default, people rely in "circular buffers" and "memory pools" in code that are fast "abstractions". C++ code is also larger, by a little, so CPUs with small cache can get a little performance penalty. C++ exceptions (but I never heard to be used in embedded code) are also not performance predictable.

                          But people like gens, they don't argue against memory usage, but against performance and in a weird way, as if assembly always translate in performance. And even theoretically this is the case, practically is not true (anymore). Compilers are mature and can optimize code better than at least gens' assembly output and they will get better as time goes.

                          Similarly, all reasons that I say that people don't use embedding code (like every byte of memory), can be used as simple as don't use STL (so C++ code is not the issue) and use templates to avoid wrong macro expansions. There is no reason why memory pools or arenas cannot be used into C++, Firefox uses JavaScript Heap Compartments which are very similar with memory pools. Memory pools are also accessible from C# (of course, the heap works a bit different) and nothing stops any user to write one circular buffer in a high level language as Java.

                          Memory is an important resource and it is important to be optimized against and I fully think that C can achieve this better than Java. But performance in the terms is not today's assembly main strength.

                          I will call gens once again to test the assembly code vs the C code with some optimization flags that were described and to put publicly the numbers. Or to say a real use-case where his multiplication can make sense and cannot be optimized with caching or putting the computation on the video card. He tries to play a game where he makes the rules, he plays the game and he wins.

                          This is plain silly, and in the process, he makes equivocation that if his assembly code is faster than his C code, this means that assembly is faster than C.

                          Comment


                          • why are you so passionate about this ?
                            bdw sry, i forgot to tell you its in fasm, but i gave you 64bit linux .o files to link against

                            here
                            numbers

                            asm
                            Code:
                            elapsed ticks: 921086
                            5.000000 14.000000 23.000000 122.000000 158.000000 194.000000 401.000000 464.000000 527.000000 842.000000 932.000000 1022.000000 1445.000000 1562.000000 1679.000000 2210.000000 2354.000000 2498.000000 3137.000000 3308.000000 3479.000000 4226.000000 4424.000000 4622.000000
                            gcc, (almost) no sse
                            Code:
                            elapsed ticks: 1178980
                            5.000000 14.000000 23.000000 122.000000 158.000000 194.000000 401.000000 464.000000 527.000000 842.000000 932.000000 1022.000000 1445.000000 1562.000000 1679.000000 2210.000000 2354.000000 2498.000000 3137.000000 3308.000000 3479.000000 4226.000000 4424.000000 4622.000000
                            yours
                            Code:
                            elapsed ticks: 357943
                            8098740224.000000 8099010560.000000 8099280384.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
                            see the problem ?

                            and for testing i choose to have separate .o files so rdtsc dosent get optimised and the results get somewhat accurate
                            in assembly id use cpuid to get bit more accurate results
                            i read in C it is better to use gettimeofday() on modern, Hz scaling, cpus

                            and no
                            calling opengl is not 0% cpu thing
                            data has to be copied to the gpu thru PCI-e(or whatever)
                            and that is a big part of this operation, reading and writing to RAM
                            at that time the cpu cant do much anything else anyway
                            maybe on PS4 with its one gddr for everything it would be, but not on normal household computers

                            and to remind, i sayd something like "when you cant get more performance out of your code then you can write loops that can benefit from assembly in assembly"

                            http://git.xonotic.org/?p=xonotic/da...322fa2;hb=HEAD
                            what sse is good for

                            http://sourceware.org/git/?p=glibc.g...4caa19a7f88eba
                            when a compiler writes something like that, then its good


                            @ erendorn

                            true, software is tailored for the user ofc
                            but now, years after windows 98 with everything that happened to computers, do i still need office to take a few seconds for a query ?

                            funny thing is GAS assembly sintax was made that way 'cuz at the time it was too cpu consuming to make the assembler check every label against a register



                            another thing i noticed is that common operations get.. backported id guess.. to the instruction set of a cpu
                            like matrix things in 3D now and later in SSE and others
                            and MMX for whatever math they do for pictures

                            that reminds me; adjusting mathematical algorithms to a cpu is, from what i see, easier to do directly in cpu instructions

                            just to underline
                            THIS IS ABOUT EXTERMINATING ALL ASSEMBLY THAT CAN BE EXTERMINATED
                            and probably all assembly in OSS is there for performance reasons, meaning a few loops here and there. Not whole unmaintainable programs

                            so im not talking about C++, java, Python or any of the many languages out there
                            im talking about most used simple loops overall

                            before i forget,
                            objective programing adds memory and cpu overhead
                            its usually just a bit on performance, not much, but its there
                            on the other hand, virtual machines are in a world of their own, but they can (in theory) run on anything that has the required binary

                            tradeoffs all around
                            just bothers me why people hate assembly
                            its a different way of programing, maybe that seems hard to understand
                            first time i programed in C it was weird compared to QB that i knew

                            PS when compilers get as good as humans, il change my mind to "assembly is a good way to learn about cpu's and their bottlenecks"
                            Last edited by gens; 05-07-2013, 08:08 PM.

                            Comment


                            • Originally posted by gens View Post
                              why are you so passionate about this ?
                              Because one of every 3-4 things you say is false. And adding factoids or false claims doesn't prove points. For example this quote:
                              Originally posted by gens View Post
                              and no
                              calling opengl is not 0% cpu thing
                              data has to be copied to the gpu thru PCI-e(or whatever)
                              and that is a big part of this operation, reading and writing to RAM
                              at that time the cpu cant do much anything else anyway
                              maybe on PS4 with its one gddr for everything it would be, but not on normal household computers
                              In fact, it is 0% CPU thing if is used as I've said before it is executed on drivers and the call content doesn't require almost any CPU. Look to this glDrawArrays function. But not only this:
                              Vertex Buffer Objects are by design with no memory copying:
                              A Vertex Buffer Object (VBO) is an OpenGL feature that provides methods for uploading data (vertex, normal vector, color, etc.) to the video device for non-immediate-mode rendering. VBOs offer substantial performance gains over immediate mode rendering primarily because the data resides in the video device memory rather than the system memory and so it can be rendered directly by the video device.
                              Before OpenGL 1.5, there are compiled arrays which again will imply almost no CPU too.

                              Originally posted by gens View Post
                              and to remind, i sayd something like "when you cant get more performance out of your code then you can write loops that can benefit from assembly in assembly"

                              http://git.xonotic.org/?p=xonotic/da...322fa2;hb=HEAD
                              what sse is good for

                              http://sourceware.org/git/?p=glibc.g...4caa19a7f88eba
                              when a compiler writes something like that, then its good
                              ... and when a compiler doesn't write something like that, it writes a good enough code. This is why the Xonotic codebase has only macros but no assembly, did I read it right?

                              Originally posted by gens View Post
                              before i forget,
                              objective programing adds memory and cpu overhead
                              its usually just a bit on performance, not much, but its there
                              on the other hand, virtual machines are in a world of their own, but they can (in theory) run on anything that has the required binary
                              I wrote a C++ version of the code and it was behaving really well (faster than your original code, a bit slower than the __restrict kind of coding):

                              Code:
                              class Vertex2d
                              {
                              public:
                              	float vertices[3];
                              };
                              class Matrix2d
                              {
                              public:
                              	float matrix[9];
                              	void multiplyVertices(const Vertex2d & src, Vertex2d & dest) const;
                              };
                              
                              void Matrix2d::multiplyVertices(const Vertex2d & src, Vertex2d & dest) const
                              {
                              	for(int j=0;j<3;j++) {
                              		float accumulator = 0.0f;
                              		for (int k=0;k<3;k++)
                              			accumulator+= matrix[j*3+k]*src.vertices[k];
                              		dest.vertices[j] = accumulator;
                              	}
                              }
                              
                              void compute(
                              	Matrix2d * matrix, 
                              	Vertex2d * vertex, 
                              	Vertex2d * result, 
                              	int count ) {
                              	for(int i=0;i<count;i++) {		
                              		matrix->multiplyVertices(*vertex, *result);
                              		matrix ++;
                              		result ++;
                              		vertex ++;
                              	}
                              }
                              
                              #include <stdio.h>
                              #include <sys/time.h>
                              
                              unsigned long long int rdtsc(void)
                              {
                                 unsigned long long int x;
                                 unsigned a, d;
                              
                                 __asm__ volatile("rdtsc" : "=a" (a), "=d" (d));
                              
                                 return ((unsigned long long)a) | (((unsigned long long)d) << 32);;
                              }
                              
                              int main() {
                              	Matrix2d matrices[10000];
                              	Vertex2d vertices[100000];
                              	Vertex2d result[100000];
                              	int i;
                              	int count = 10000;
                              	Matrix2d *ptrmat;
                              	Vertex2d *ptrvert, *ptrres;
                              	
                              	float tmp=0.0f;
                              	for( i=0; i<count*3; i++) {
                              		vertices[i/3].vertices[i%3]=tmp;
                              		tmp=tmp+1;
                              	}
                              	tmp = 0.0f;	
                              	for( i=0; i<count*9; i++) {
                              		matrices[i/9].matrix[i%9]=tmp;
                              		tmp=tmp+1;
                              	}
                              	ptrmat = &matrices[0];
                              	ptrvert = &vertices[0];
                              	ptrres = &result[0];
                              	
                              	unsigned long long ts = rdtsc();
                              	
                              	compute( ptrmat, ptrvert, ptrres, count );
                              	
                              	printf("elapsed ticks: %llu\n", rdtsc() - ts);
                              	
                              	for( i=0; i<24; i++) {
                              		printf("%f ", result[i/3].vertices[i%3]);
                              	}
                              	printf("\n");
                              	return 0;
                              }
                              So C++ is not that slow either. I could bother to precompute the all computation (or most of it) at the compile time, like described here but I will not. I will likely prefer the solution of glLoadMatrix as is written for me (the most of the code).

                              Which extra memory/performance C++ implies? The virtual call? It is an opt-in feature, so nothing stops you to not use them. I can say things that even you want to keep a low level codebase with C "blend", C++ can give to you advantages:
                              - const methods and references where compiler can know to optimize away many computations. If you work with reference counting (smart pointers), using constant reference to them will not add one reference and destroy it
                              - templates are ugly to be written, as assembly is, but when you succeed to fix the template errors, the result is a better more restricted set of your C++ code, not the reverse (like a more error-prone version, when is written in assembly). By this, many macros can be safer to be written as templates and the compiler mostly can inline them for small operations
                              - can you make a benchmark when a virtual call is slower than a call through a function pointer? In my understanding both codes run as fast, but maybe I am wrong
                              - move-semantics in C++ 11 can remove some copies of your object creations (as it is specified), so the argument that RAII gives inefficient code doesn't hold too much water today regarding "a lot of copies" are made. In fact GCC was optimizing away some copies much earlier and you have to specify to *not* optimize away these copies

                              tradeoffs all around
                              just bothers me why people hate assembly
                              its a different way of programing, maybe that seems hard to understand
                              first time i programed in C it was weird compared to QB that i knew

                              PS when compilers get as good as humans, il change my mind to "assembly is a good way to learn about cpu's and their bottlenecks"
                              Yes, but CPU is not the only bottleneck, and I think here is at least my problem with assembly. Assembly is not up-to-date excluding for a small elite of programmers working into the compilers or in scientific world maybe, and this because the compilers do not "auto-vectorize" your code. Assembly doesn't speedup as OpenMP does, or OpenGL/OpenCL does. Assembly doesn't speedup your disk/network/database access, and when you improve using all these other components, assembly is the odd man out.

                              As I've told you, I've did assembly but was in another era, mostly when SSE (1) was just introduced, but I wasn't writing SSE codes (as you don't write AVX either) so I know what is it all about. I knew fairly well 32 bit assembly (and 16 bit as I've studied in university) and with my tiny optimizations if I would wait 3-4 years, I will get them in many cases for free, or if I didn't, there was an assembly written library over internet that I could use in my project.

                              In fact I don't know QB (QuickBasic!?), but even the latest version of VB (6, not .Net ones) had a compiler, that if in an alternate history, this compiler would be more nourished, could make the VB to be still today around for it's good performance.

                              Should assembly be scrapped away? Of course not! I think that assembly should exist today (in case of performance reasons, not atomics) just with intrinsic primitives. Hopefully they should be as generic as Mono.SIMD is. There is no point to write assembly today, most of the times you can write code that is too close to assembly performance in C++, and you can write Java or C# code (that if you ignore the startup time that is slower) that is really close to this assembly too, without bothering on why the loop is not SIMDed, but how you can optimize your application.
                              Last edited by ciplogic; 05-08-2013, 02:03 AM.

                              Comment


                              • intrinsics are as portable as any assembly
                                thing that bothered me when writing them was that i didnt know exactly how many registers i got left
                                good about them is that the compiler reorders the instructions

                                and again, you can thread assembly

                                il make a proper loop with avx when theres time, this sse was not as optimized as it could be

                                im as entitled to talk about C++ as you are on assembly
                                so here is some paper on OO vs Procedural programing
                                http://scholar.lib.vt.edu/theses/ava...ted/thesis.pdf

                                Comment

                                Working...
                                X