Announcement

Collapse
No announcement yet.

Is Assembly Still Relevant To Most Linux Software?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • in this case there isnt much difference between sse and sse2
    sse2 is mostly operations for integers (mmx for 128bit registers) and a couple op's for non-temporal store and cache hints

    gcc(4.8) gives me scalar code all the time
    Code:
     67e:   f3 0f 10 4f 80          movss  xmm1,DWORD PTR [rdi-0x80]
     683:   f3 0f 59 46 d0          mulss  xmm0,DWORD PTR [rsi-0x30]
     688:   f3 0f 59 4e d4          mulss  xmm1,DWORD PTR [rsi-0x2c]
     68d:   f3 0f 58 c1             addss  xmm0,xmm1
    i also tried some more specific compiler options like
    gcc -o matrixm.o matrixm.c -shared -O3 -ftree-slp-vectorize -ffast-math -msse2

    even with hints in C code
    Code:
    	vertex = __builtin_assume_aligned (vertex, 32);
    	matrix = __builtin_assume_aligned (matrix, 32);
    	result = __builtin_assume_aligned (result, 32);
    avx, fma and xop, on the other hand, have instructions that are great for this kind of operations like VFMADDPS and HADDPS
    the compiler does use them when possible, but from what i tried its still scalar

    threading assembly code is as easy as threading C code
    in fact i think my loop would do better threaded then the C one as a cache line is from what i can tell usually 64bits and the scalar code uses just the first 32bits
    i could be wrong if, for example, the cpu sees that and loads the whole cache into 2 registers

    i tried -flto now and it does give better performance
    maybe cuz it aligned the loop (my loop isnt aligned either, tried now, it didnt help much)

    this is for example of sse usefulness
    if someone wants to use the loop in a BLAS library il finish it to work in all cases
    and it can also be used for software fallback when opengl3.x is not available, like with laptops
    fun thing is i will be using this to shave off a % or 2 from a game, will need to change couple lines thou
    writing a loop here and there isnt that demanding

    debugging is done by following the flow of the program
    load -> shuffle -> load2 -> shuffle_together
    what is and should be in the affected register is written
    you can rename them as you wish
    i admit its not as clear as copying parts of C code
    then again SIMD and MIMD can require different algorithms so its at least good to know about it
    i personally put stores and calls to print to stdout to know whats going on
    (gdb returns an address of the problem, then you can look at the disassembly where is the problem)

    i feel good documentation is the way to better quality software, not limiting all programers to one way of doing things
    then again i do this for a hobby so what do i know

    for software production it probably dosent matter at all
    most people have many core cpus so optimizing that couple % dosent matter
    then again sometimes its useful
    like in scientific things, encryption, databases, physics on the cpu
    also on code that changes flow frequently based on the results of calculations, that is better done on the cpu

    unfortunately i cant test your loop since it returns wrong calculations, and it does only a third of the calculations necessary
    still 177472 * 3 is 532416
    hmm, guess its because of less cache usage
    nice try thou, similar loop is in gcc's vectorization manual

    PS i also tried icc and llvm on this site, llvm is 3.0 thou
    Last edited by gens; 03 May 2013, 11:59 AM.

    Comment


    • Originally posted by gens View Post
      unfortunately i cant test your loop since it returns wrong calculations, and it does only a third of the calculations necessary
      still 177472 * 3 is 532416
      hmm, guess its because of less cache usage
      nice try thou, similar loop is in gcc's vectorization manual

      PS i also tried icc and llvm on this site, llvm is 3.0 thou
      You're right, I mistyped and you're completely right: by 3 times. So with -flto and if you multiply by 3 the computation it will get into very few percents under the assembly.
      As you pointed out in a previous post of yours, this optimization was significant, let's say 30% of your application was executing this loop. So with your 15% speedup, you will make your application to run 4% (30%*15%) overall. If -flto optimizes other functions, not only the tiny loop you've mentioned, you cannot get more than 5% overall, right?

      And one solution is hard-sweating and hard to debug, and my "solution" was not only found broken, but "it does only one third of the calculations" would not be necessary in the first place.

      This said, I am still curious why you can't use glLoadMatrix (that require 0% CPU), or snapshotting (if you have an animation or something like that) where you can optimize many times your CPU usage. Or why not using 4x4 matrices as the compiler maybe will use SSE(2) to sped up the computation. Yeah, I know, too much memory, but still, memory used with real speedup, when your application seem so much locked into huge speed requirements.

      OpenGL is supported on any OS, and glLoadMatrix is available in any graphic card on the market, and if it isn't it is well optimized (with SSE and such) by the OS vendors for the target machine, so there is no reason to you to optimize. It is supported in hardware from first S3 Savage 2000, NVidia Geforce 256 or Ati Radeon. Really history stuff, like 10 years ago.

      Isn't your design faulty, and assembly "backend solution" only tries to hide your problem, instead of solving it?
      Last edited by ciplogic; 05 May 2013, 09:55 AM.

      Comment


      • Originally posted by gens View Post
        (...)
        i tried -flto now and it does give better performance
        maybe cuz it aligned the loop (my loop isnt aligned either, tried now, it didnt help much)

        this is for example of sse usefulness
        if someone wants to use the loop in a BLAS library il finish it to work in all cases
        and it can also be used for software fallback when opengl3.x is not available, like with laptops
        fun thing is i will be using this to shave off a % or 2 from a game, will need to change couple lines thou
        writing a loop here and there isnt that demanding


        PS i also tried icc and llvm on this site, llvm is 3.0 thou
        Try this code:

        Code:
        #include <stdio.h>
        #include <sys/time.h>
        
        unsigned long long int rdtsc(void)
        {
           unsigned long long int x;
           unsigned a, d;
        
           __asm__ volatile("rdtsc" : "=a" (a), "=d" (d));
        
           return ((unsigned long long)a) | (((unsigned long long)d) << 32);;
        }
        
        float matrices[10000][9];
        float vertices[10000][3];
        float result[10000][3];
        
        void compute(int count ) {
        	int i,j,k;
        	float partial;
        	for( i=0;i<count;i++) {		
        		for(j = 0; j<3; j++)
        		{
        			partial = 0.0f;			
        			for(k = 0; k<3; k++)
        				partial += vertices[i][k] * matrices[i][j*3+k];
        			result[i][j] = partial;		
        		}
        	}
        }
        
        
        int main() {
        	int i;
        	int count = 10000;
        	
        	float tmp=0;
        	for( i=0; i<count*3; i++) {
        		vertices[i/3][i%3]=tmp;
        		tmp=tmp+1;
        	}
        	tmp = 0.0f;	
        	for( i=0; i<count*9; i++) {
        		matrices[i/9][i%9]=tmp;
        		tmp=tmp+1;
        	}
        	unsigned long long ts = rdtsc();
        	
        	compute( count );
        	
        	printf("elapsed ticks: %llu\n", rdtsc() - ts);
        	
        	for( i=0; i<24; i++) {
        		printf("%f ", result[i]);
        	}
        	printf("\n");
        	return 0;
        }
        I am not fully sure, but it looks that is auto-vectorized, or even it isn't, the performance went up (I changed a bit the logic, so change it accordingly, if there are bugs or it doesn't generate the same numbers at the end) and I think that it would be good enough (even is not auto-vectorized):
        Code:
         ./a.out 
        elapsed ticks: 471560
        The original timings were:
        Originally posted by ciplogic View Post
        So, I rerun the tests as you've suggested:
        Code:
        $ g++ -O3 matrix_test.c matrixm.c 
        $ ./a.out 
        elapsed ticks: [B]745912[/B]
        And here is the kicker:
        Code:
        $ g++ -O3 [B]-flto[/B] matrix_test.c matrixm.c 
        $ ./a.out 
        elapsed ticks: [B]647984[/B]
        (...)
        May you confirm the numbers on your machine?
        So this rewrite will speedup on my machine by 58% (745912 /471560 ticks). I think that if you will rewrite the loop similarly in your machine, the code will run faster on your machine than your assembly. I know that machines differ, and maybe it will run slower, or maybe you will take the assembly of the loop that is 58% percent faster and you can find an optimization that happen on Intel Compiler and doesn't happen in GCC case, so you will state: but still the case remains: GCC, given a bug is reported, will be able to fix your loop and all other code written similarly, but your assembly will remain unoptimized, slower than the C version.

        Comment


        • a program that gives wrong results is infinitely slower then any program that does

          in opengl for shaders you need at least opengl3.something
          this kinda thing in games is for fallback, or laptops that have weak gpu's
          and i choose it as a generic example, theres plenty of other math things that can benefit
          encoding/decoding, encryption, everywhere where you got a loop that uses lots of cpu or memory bandwidth (compressing textures for the gpu ?)
          idk, all i know is there are cases where hand writing a loop can speed it up 28% (or more)

          scalar code can not easily be faster then vectorized
          SIMD does less cache messing, overall less instructions to decode etc.

          unrolled loops also help.. to a point

          i got an new AMD now, but i think intel still suffers from half aligned reads/writes (16byte aligned is usually the best)
          that would give a huge advantage to the custom loop
          maybe il test on the laptop some day

          PS i didnt even profile the code i wrote so its not reordered, that could speed it up a shade
          also i did simple packing, meaning a couple shuffles can be removed too

          PPS data flow is FAST, till it hits the cache limit
          in this case (didnt calculate) i think it hits it near the end
          that explains the speedup from processing just one third
          Last edited by gens; 05 May 2013, 08:53 PM.

          Comment


          • Originally posted by gens View Post
            a program that gives wrong results is infinitely slower then any program that does

            in opengl for shaders you need at least opengl3.something
            this kinda thing in games is for fallback, or laptops that have weak gpu's
            This is where you don't understand how OpenGL works. Let me rephrase: OpenGL historically is used to draw primitives on screen. Any OpenGL 1.1 graphic card compliant will accelerate up-to 8 lights in hardware and geometric transformations you draw on screen in GPU. This is commercially known as Transform&Lighting (introduced with GeForce 256/Savage 2000) and is also part of DirectX7 video cards. So if in your target problem, you have to apply many transformations to millions of vertices, you shouldn't multiply them on CPU, but simply state glLoadMatrix (or glMultiplyMatrix) and the driver/video card combination will do it for you. For millions of points per second, with zero CPU usage.

            The vertex/pixel shaders are small programs that can modify the standard flow of vertices/pixels with your custom processing: for vertices you can compute the particles in your particle generator again with little or no CPU usage, or a wave of a sea using a trigonometric function, or a processing to make the picture blurred.

            But as for your problem, if you can feed the graphic card if you need the final points to be displayed, you don't need to multiply them on the CPU. Even more, if your video card is OpenGL 2.0 compliant (a huge numbers of cards do support this, roughly a DirectX8 card like GeForce 5200+ or a Radeon 9000+, or a newer Intel video integrated video card), you cal load the vertices in the video memory as vertex buffers, and you don't need to make copies from CPU to GPU when you draw them. This is one reason why games with complex graphics run on the phones that are memory bandwidth limited. This is again, for your specific problem with the twist that your final points have to be draw on screen eventually (using OpenGL).


            Originally posted by gens View Post
            and i choose it as a generic example, theres plenty of other math things that can benefit
            encoding/decoding, encryption, everywhere where you got a loop that uses lots of cpu or memory bandwidth (compressing textures for the gpu ?)
            idk, all i know is there are cases where hand writing a loop can speed it up 28% (or more)

            scalar code can not easily be faster then vectorized
            SIMD does less cache messing, overall less instructions to decode etc.

            unrolled loops also help.. to a point

            i got an new AMD now, but i think intel still suffers from half aligned reads/writes (16byte aligned is usually the best)
            that would give a huge advantage to the custom loop
            maybe il test on the laptop some day
            If you get a new AMD, most likely you could have a X2, X3 or even a Phenom with 6 cores (to not talk about 8 core AMD), so your 30% speedup isn't it not a real deal? Or maybe an APU? So again, why not targeting those? People will buy more CPUs that support the GPGPU computations or the multi-core code. FWIW even phones today are 2 cores, and tablets tend to go to 4 cores soon.

            Originally posted by gens View Post
            PS i didnt even profile the code i wrote so its not reordered, that could speed it up a shade
            also i did simple packing, meaning a couple shuffles can be removed too

            PPS data flow is FAST, till it hits the cache limit
            in this case (didnt calculate) i think it hits it near the end
            that explains the speedup from processing just one third
            Did you tried the loop where matrices are defined like 2 dimensional? Was it auto-vectorized on your machine? Was it faster than on your machine's assembly implementation (it was doing at least the same computation)? Can you give some numbers?

            As I'm working with C# my C/assembly skills are not as great, and for what it worth, I did not know how to add your simple file in my Linux C++ IDE, but this is for a reason. I don't write micro-benchmarks on daily basis, and when I do run them, I run them to target an use-case. And as usual, use-cases involve many components. Many times I noticed that components run slow and in many cases using caching is much practical than assembly: a connection to internet to get a list of users is always way slower than any other component of your system, even written in Python or Ruby. If this is optimized, like saving the users for a time like 1 hour, it simply means that user will be able after a "Loading" screen to get instant UI working.

            This I think is where we differ, you think performance for performance sake, when I think performance if it doesn't hurt the user. Of course, with your approach if applied extensively, the user will maybe have instant running applications, and in my case I will have users that don't have annoying pauses, and even if they appear I try to minimize them with some sake defaults. I think your example shows this: you look for your operation to be faster, when I try to find the whole problem space to find optimization opportunities. At last: you don't seem concerned with assembly, but with assembly as "it cannot get faster than this", and you're in denial when people get code fast enough to not matter even if your assembly is faster or not. The "aligned" (or mainly non-aliased) pointers and cache performance can be made by the compiler with little work on your own, and if users can get into 30% (or even faster, with the latest implementation than your default assembler) of assembly speed range, I think that this proves the point that assembly is irrelevant, in your toy-software example.

            "data flow is FAST, till it hits the cache limit"
            What you're talking about? Data Flow Analysis of the compilers? That enable optimizations? Or processing of the data is faster if just it happen let's say in L1 cache? If is the second, then you are certainly wrong about your program to write in assembly, as 30% speedup can be all lost if you touch L2 or L3, and if you code your C++ code to fit well in L1, you can get bigger performance than your ugly assembly.

            Comment


            • Originally posted by gens View Post
              a program that gives wrong results is infinitely slower then any program that does
              (...)
              unrolled loops also help.. to a point

              i got an new AMD now, but i think intel still suffers from half aligned reads/writes (16byte aligned is usually the best)

              So this is my "final" C code that does the same number of compiler improvement with no alignment. Using basically: "-O3 -ffast-math -flto" will run faster than the most aligned code I could write (look on the next block of code)
              Code:
              #include <stdio.h>
              #include <stdlib.h>
              #include <memory.h>
              #include <sys/time.h>
              
              unsigned long long int rdtsc(void)
              {
                 unsigned long long int x;
                 unsigned a, d;
              
                 __asm__ volatile("rdtsc" : "=a" (a), "=d" (d));
              
                 return ((unsigned long long)a) | (((unsigned long long)d) << 32);;
              }
              
              float matrices[10000][9];
              float vertices[10000][3];
              float result[10000][3];
              
              void compute(int count ) {
              	int i,j,k;
              	float partial;
              	float res[3];
              	for( i=0;i<count;i++) {
              			
              		for(j = 0; j<3; j++)
              		{
              			partial = 0.0f;			
              			for(k = 0; k<3; k++)
              				partial += vertices[i][k] * matrices[i][j*3+k];
              			res[j] = partial;
              		}
              
              		memcpy(&result, &res, sizeof(float)*3);
              	}
              }
              
              
              int main() {
              	int i;
              	int count = 10000;
              	
              	float tmp=0.0f;
              	for( i=0; i<count*3; i++) {
              		vertices[i/3][i%3]=tmp;
              		tmp=tmp+1;
              	}
              	tmp = 0.0f;	
              	for( i=0; i<count*9; i++) {
              		matrices[i/9][i%9]=tmp;
              		tmp=tmp+1;
              	}
              	unsigned long long ts = rdtsc();
              	
              	compute( count );
              	
              	printf("elapsed ticks: %llu\n", rdtsc() - ts);
              	
              	for( i=0; i<24; i++) {
              		printf("%f ", result[i/3][i%3]);
              	}
              	printf("\n");
              	return 0;
              }
              2nd version:
              Code:
              void compute(
              	float * __restrict matrix, 
              	float * __restrict vertex, 
              	float * __restrict result, 
              	int count ) {
              	int i,j,k;
              
              	float *m = __builtin_assume_aligned(matrix, 16);
              	float *v = __builtin_assume_aligned(vertex, 16);
              	float *r = __builtin_assume_aligned(result, 16);
              
              	for( i=0;i<count;i++) {
              		
              		for(j=0;j<3;j++)
              		{
              			float accumulator = 0.0f;
              			for (k=0;k<3;k++)
              				accumulator+= m[j]*v[k];
              			result[j] = accumulator;
              		}
              		
              		m += 9;
              		r += 3;
              		v += 3;
              	}
              }
              
              #include <stdio.h>
              #include <sys/time.h>
              
              unsigned long long int rdtsc(void)
              {
                 unsigned long long int x;
                 unsigned a, d;
              
                 __asm__ volatile("rdtsc" : "=a" (a), "=d" (d));
              
                 return ((unsigned long long)a) | (((unsigned long long)d) << 32);;
              }
              
              int main() {
              	float matrices[100000];
              	float vertices[100000];
              	float result[100000];
              	int i;
              	int count = 10000;
              	float *ptrmat, *ptrvert, *ptrres;
              	
              	float tmp=0;
              	for( i=0; i<count*3; i++) {
              		matrices[i]=tmp;
              		vertices[i]=tmp;
              		tmp=tmp+1;
              	}
              	
              	ptrmat = &matrices[0];
              	ptrvert = &vertices[0];
              	ptrres = &result[0];
              	
              	unsigned long long ts = rdtsc();
              	
              	compute( ptrmat, ptrvert, ptrres, count );
              	
              	printf("elapsed ticks: %llu\n", rdtsc() - ts);
              	
              	for( i=0; i<24; i++) {
              		printf("%f ", result[i]);
              	}
              	printf("\n");
              	return 0;
              }
              Changed data structure to store bi-dimensional arrays:
              Code:
              $ ./a.out 
              elapsed ticks: 263760
              Same data structure (2nd implementation) with alignment hints and merging of the loops
              Code:
              $ ./a.out 
              elapsed ticks: 331592
              I think that none of the implementation were in fact vectorized, but they run smoking fast and faster (I think, I don't know how to compile the assembly part, using as tool!?) than the +30% speedup. Both run more than 100% faster than the original code I was given by you (that they were running in the 700000 ticks zone). So maybe is time for you to improve your C skills to get to speed with fast C code!?

              Comment


              • Originally posted by Steph1ani1e
                Virtual machines have come a long way in speed, and with multicores now the norm, managed code can be as fast or sometimes even more than unmanaged code
                Fully agree with you, but you know, imagine if you feel somewhat out of control if the VM doesn't optimize the code you run. I think here it lies the frustration of the "assembly folks" here. Or maybe that it occupy more memory or that it starts a bit slower. These issues can be addressed and they have been for some time, but still the impression that if you stay far away of hardware it means you will miss this micro-management of every instruction makes you feel bad.

                Imagine that you are the boss of the department and you cannot say anyone how to use it's precious time, but just to give directives for all company. This would be C/C++ languages. Imagine then that you are a boss of a division or a subdivision, this is like writing in a VM like Java or C#. Sure, your VM may be inefficient sometimes, but many times will do a really good job. The C++ compilers do an excellent job too.

                And some people may ask: Why would you want to use C# if its slower than C++?, when you can use directly assembly?

                And I think the main reason is that the higher your level is in your department, the bigger the probability you can make more changes in the world. If you are just sweeping and making every small table clean, you cannot change the entire company, but maybe the department your working, but if you manage a division, your division can give a bit worse service at the micro-management level, but they can give to you really clear information like they can notify you as a customer with an SMS, they can arrange everything to met you when you need them, etc. These changes can be made if you are in control, but if you are just trying to optimize that the email will arrive faster, by 10 seconds, instead of being in 40 seconds, it will be in 30, means nothing for user, but the power to have the decision of receiving the email in the first place is really crucial.

                Comment


                • i compiled your code, its still scalar
                  only seems few segments are interwoven a bit
                  i dont wanna examine the resulting code further

                  Code:
                  		result[0] = matrix[0]*vertex[0] + matrix[1]*vertex[1] +matrix[2]*vertex[2];
                  		result[1] = matrix[3]*vertex[0] + matrix[4]*vertex[1] +matrix[5]*vertex[2];
                  		result[2] = matrix[6]*vertex[0] + matrix[7]*vertex[1] +matrix[8]*vertex[2];
                  is the simplest way i know of to multiply 3x3 by 3x1 matrices
                  note the vertices used, it cant be made into a shorter loop (at least i cant think of any way now)
                  you can segment it into common operations, but i dont think it would help


                  glMultMatrix is 4x4 by 4x1
                  sse code for that would be just a few lines long as there is no need to shuffle that much
                  as far as i know for 3x3 matrices you need shaders
                  but idk that much about opengl programing


                  hmmm
                  by data flow i meant reading and writing to the memory
                  when the cpu reads from the memory, it goes thru the L2 cache (probably L3 too if you got it) and thru the L1 cache
                  when writing same thing happens
                  data goes from registers to Lwhatever and when the cpu finds time it writes it to RAM

                  the memory management unit (MMU) tries to plan ahead, deciding what to keep in the cache and what to evict
                  its no easy task to do when the cpu just rolls data like an idiot, but it tries its best
                  one way we can help it is by writing directly to RAM, bypassing the cache
                  problem with that is that its a lot slower then writing to cache, but still its faster then writing to a full cache
                  (almost the same goes for reading)

                  i'm used to linux having lots of programs to measure that kind of things
                  like perf that can count cache misses
                  or cachebench to see the limits of... well the cache

                  bdw why didnt you say so
                  heres a windows(64bit) version of the loop
                  hope it works, didnt test it at all


                  on a more personal note:
                  i dont like OO languages, but i can see why they are useful
                  i really dont like the "cpu is fast enough for slow code" mentality but i like firefox (bdw, you need yasm to compile firefox)
                  and i wont tell other people "you have to program in C"
                  and it bothers me how people talk that C++/C#/whatever is better
                  its all good, but its all different in basic mentality
                  bdw i like C more then other languages cuz its "portable assembler" and thus the closest to writing it yourself
                  (most other languages are far more abstracted from machine code)

                  for end a quote:

                  Software efficiency halves every 18 months, compensating Moore’s Law.
                  — May’s Law

                  taking in effect that Moore's law is no longer as valid as it was, the future is slow
                  Last edited by gens; 07 May 2013, 09:51 AM.

                  Comment


                  • Originally posted by gens View Post
                    i compiled your code, its still scalar
                    only seems few segments are interwoven a bit
                    i dont wanna examine the resulting code further
                    So for you SSE (or using any parallel code) is for the sake of parallelism. The scalar code is faster than the SSE code you wrote, so why use SSE code in the first place!? Did I miss something?

                    Originally posted by gens View Post
                    Code:
                    		result[0] = matrix[0]*vertex[0] + matrix[1]*vertex[1] +matrix[2]*vertex[2];
                    		result[1] = matrix[3]*vertex[0] + matrix[4]*vertex[1] +matrix[5]*vertex[2];
                    		result[2] = matrix[6]*vertex[0] + matrix[7]*vertex[1] +matrix[8]*vertex[2];
                    is the simplest way i know of to multiply 3x3 by 3x1 matrices
                    note the vertices used, it cant be made into a shorter loop (at least i cant think of any way now)
                    you can segment it into common operations, but i dont think it would help


                    glMultMatrix is 4x4 by 4x1
                    sse code for that would be just a few lines long as there is no need to shuffle that much
                    as far as i know for 3x3 matrices you need shaders
                    but idk that much about opengl programing
                    So let me clarify, you seem to just talk about glLoadMatrix/glMultiplyMatrix that doesn't match your size of matrix... but it seems you never used it. glLoadMatrix is used BEFORE you display something. So, let's say you have 10k points, and ONE matrix to multiply them. You will do write something like that to display them, if you don't display them, you have to write silly C loops or assembly:
                    Code:
                    glLoadMatrix(a4x4MatrixOfYour3x3Matrix); 
                    drawGl(yourPoints);
                    If drawGl happens to contain the the vertices in the video card as VBO (Vertex buffer object), these two operations are basically with 0% CPU (you will have to make a 4x4 matrix once version of your 3x3 matrix).

                    You can't use glLoadMatrix to do your computation though, and you can't use shaders for this either, you need to use CUDA or OpenCL which is another talk all-together. Anyway, as far as I understand, a real program that works with these many points, most likely will display them eventually, so there is no point to make weird computations to show that assembly is faster than C, but to see how can you use things that people do have on their video card, including the glLoadMatrix call.

                    on a more personal note:
                    i dont like OO languages, but i can see why they are useful
                    i really dont like the "cpu is fast enough for slow code" mentality but i like firefox (bdw, you need yasm to compile firefox)
                    and i wont tell other people "you have to program in C"
                    and it bothers me how people talk that C++/C#/whatever is better
                    its all good, but its all different in basic mentality
                    bdw i like C more then other languages cuz its "portable assembler" and thus the closest to writing it yourself
                    (most other languages are far more abstracted from machine code)

                    for end a quote:

                    Software efficiency halves every 18 months, compensating Moore?s Law.
                    ? May?s Law

                    taking in effect that Moore's law is no longer as valid as it was, the future is slow
                    You seem to not like "C++/C#/whatever" because C is a portable assembly and is fast, but what you say makes literally no sense. Yes, there is a mentality of inefficiency inside VMs or higher level languages, but this is not an excuse to say that for this reason we have to target just performance. What about buffer overflows? Or NullPointerException (or a Invalid Read At Address 0x0000000c kind of stuff), don't you want that the runtime to be able to recover after these errors?

                    At last: what stops you write fast code with C++? Given all equal C++ is faster than C (was discussed earlier by Google experts and on this topic too) as you have access to assembly, all stuff that C has, + templates that can precompute many things at compile time and do aggressive inlining.

                    What stops you write fast code in C#? There are game engines written in C# that run well on my some years old phone.

                    I do know both C++ and C# and I cannot say that the language speed decreased 2x times every 18 months (or every 2 years for that matter), and half of the reason is that even the software bloat increased many times, the main slow items are still: the rotating disk drives, the internet, the CD, and so on that have huge latencies (if you don't have an SSD, if you have SSD the performance should be many times better than a rotating disk). Having a Pentium 3/Xp experience in 2001 was in many ways much worse than what you will have today with Windows7/i7-3K CPUs. Maybe they boot in the same time, but the Visual Studio is much more responsive (or Eclipse, or pick your tool), the IDE or the tools give to you much more relevant information, anti-aliasing for fonts, searching for symbols in all solution, many times more screen information/resolution.

                    If you analyze speciffic language features, let's say C# Linq, yes, I can agree that it is like 15% slower than iterating using the most optimized form of your loop, but this construct makes it hard (or impossible) to make common mistakes with it. If you suggest that "dynamic" keyword in C# is 10x times slower than a virtual call, I also agree, but (and there is always a but), the way the code that was written before to do the "dynamic" functionality is much more error prone (is like Assembly compared to C# defaut style) or reflection code was really ugly: to maintain, to understand and many times slower than Dynamic.

                    At last, why do you carry so much about the ticks? If you do a ls command, do you care how fast the ls can parse the answer, or how fast it gives to you the answer? With this I mean: ls can be optimized to be light on CPU, but it can be optimized to be light on disk accesses, and the 2nd one can be faster than the first, even it occupy maybe much more memory/CPU as it can keep a local cache of INode and it's associated data.

                    You say you like Firefox, but also look to the following facts about FF:
                    - it is written in C++/COM like coding (they try to remove XPCOM completely, but not to remove C++) so is very OOP
                    - the code management solution is Hg, which is written in Python
                    - it uses PGO (Profile Guided Optimizations) and they tune their C++ code based on C++ optimization styles, so there is virtually no assembly in all codebase of Firefox
                    - it pushes hard JavaScript, and AsmJs is a well behaving subset of fast JS, They bet on the power of compilers as LLVM to give big performance speedups, not on tuning assembly
                    - it use SqlLite everywhere (for history, bookmarks, etc.), the way to optimize the slowdowns is to make the code multithreaded
                    - most of speedups in the future theme Aurora are based on GPU profiling
                    - IonMonkey compiles off-thread the bytecode to native code

                    Do you see common themes from what I said before and real software? GPU and multi-threading are the main ways that Firefox took in the last years to optimize the speed. Why not post on FF forums to use assembly everywhere, though I don't think you will have any fans

                    Comment


                    • Originally posted by gens View Post
                      and it bothers me how people talk that C++/C#/whatever is better
                      its all good, but its all different in basic mentality
                      bdw i like C more then other languages cuz its "portable assembler" and thus the closest to writing it yourself
                      (most other languages are far more abstracted from machine code)
                      C++ is as fast as C on the C subset, and faster than C on the C++ subset.
                      The only things one can dislike about C++ compared to C are:
                      - the language can be too complicated for limited hardware (embeded, GPU) and very low level (OS kernels)
                      - the language does not enforce any coding style, which can be a pain for a project without strong leadership

                      Originally posted by gens View Post
                      for end a quote:

                      Software efficiency halves every 18 months, compensating Moore?s Law.
                      ? May?s Law

                      taking in effect that Moore's law is no longer as valid as it was, the future is slow
                      There's a reason software efficiency compensate computations speed: it's because there is one thing that doesn't change: the users.
                      There is no need to double the fps of games every year, or halve the time needed for a pop-up to appear.
                      On the other hand, you can have twice as many games, or games two times better (because you can make game assets more easily if you don't have to optimize them that much, and that is what takes time in a game), or twice as cheap. Your pick! Or application more beautiful, or that can do more, etc, etc.. And all that, still at the same acceptable speed: the user speed.

                      Comment

                      Working...
                      X