Announcement

**gens** · 03 May 2013, 11:50 AM

in this case there isnt much difference between sse and sse2
sse2 is mostly operations for integers (mmx for 128bit registers) and a couple op's for non-temporal store and cache hints

gcc(4.8) gives me scalar code all the time

Code:

 67e:   f3 0f 10 4f 80          movss  xmm1,DWORD PTR [rdi-0x80]
 683:   f3 0f 59 46 d0          mulss  xmm0,DWORD PTR [rsi-0x30]
 688:   f3 0f 59 4e d4          mulss  xmm1,DWORD PTR [rsi-0x2c]
 68d:   f3 0f 58 c1             addss  xmm0,xmm1

i also tried some more specific compiler options like
gcc -o matrixm.o matrixm.c -shared -O3 -ftree-slp-vectorize -ffast-math -msse2

even with hints in C code

Code:

	vertex = __builtin_assume_aligned (vertex, 32);
	matrix = __builtin_assume_aligned (matrix, 32);
	result = __builtin_assume_aligned (result, 32);

avx, fma and xop, on the other hand, have instructions that are great for this kind of operations like VFMADDPS and HADDPS
the compiler does use them when possible, but from what i tried its still scalar

threading assembly code is as easy as threading C code
in fact i think my loop would do better threaded then the C one as a cache line is from what i can tell usually 64bits and the scalar code uses just the first 32bits
i could be wrong if, for example, the cpu sees that and loads the whole cache into 2 registers

i tried -flto now and it does give better performance
maybe cuz it aligned the loop (my loop isnt aligned either, tried now, it didnt help much)

this is for example of sse usefulness
if someone wants to use the loop in a BLAS library il finish it to work in all cases
and it can also be used for software fallback when opengl3.x is not available, like with laptops
fun thing is i will be using this to shave off a % or 2 from a game, will need to change couple lines thou
writing a loop here and there isnt that demanding

debugging is done by following the flow of the program
load -> shuffle -> load2 -> shuffle_together
what is and should be in the affected register is written
you can rename them as you wish
i admit its not as clear as copying parts of C code
then again SIMD and MIMD can require different algorithms so its at least good to know about it
i personally put stores and calls to print to stdout to know whats going on
(gdb returns an address of the problem, then you can look at the disassembly where is the problem)

i feel good documentation is the way to better quality software, not limiting all programers to one way of doing things
then again i do this for a hobby so what do i know

for software production it probably dosent matter at all
most people have many core cpus so optimizing that couple % dosent matter
then again sometimes its useful
like in scientific things, encryption, databases, physics on the cpu
also on code that changes flow frequently based on the results of calculations, that is better done on the cpu

unfortunately i cant test your loop since it returns wrong calculations, and it does only a third of the calculations necessary
still 177472 * 3 is 532416
hmm, guess its because of less cache usage
nice try thou, similar loop is in gcc's vectorization manual

PS i also tried icc and llvm on this site, llvm is 3.0 thou

**ciplogic** · 05 May 2013, 09:46 AM

Originally posted by gens View Post

unfortunately i cant test your loop since it returns wrong calculations, and it does only a third of the calculations necessary
still 177472 * 3 is 532416
hmm, guess its because of less cache usage
nice try thou, similar loop is in gcc's vectorization manual

PS i also tried icc and llvm on this site, llvm is 3.0 thou

You're right, I mistyped and you're completely right: by 3 times. So with -flto and if you multiply by 3 the computation it will get into very few percents under the assembly.
As you pointed out in a previous post of yours, this optimization was significant, let's say 30% of your application was executing this loop. So with your 15% speedup, you will make your application to run 4% (30%*15%) overall. If -flto optimizes other functions, not only the tiny loop you've mentioned, you cannot get more than 5% overall, right?

And one solution is hard-sweating and hard to debug, and my "solution" was not only found broken, but "it does only one third of the calculations" would not be necessary in the first place.

This said, I am still curious why you can't use glLoadMatrix (that require 0% CPU), or snapshotting (if you have an animation or something like that) where you can optimize many times your CPU usage. Or why not using 4x4 matrices as the compiler maybe will use SSE(2) to sped up the computation. Yeah, I know, too much memory, but still, memory used with real speedup, when your application seem so much locked into huge speed requirements.

OpenGL is supported on any OS, and glLoadMatrix is available in any graphic card on the market, and if it isn't it is well optimized (with SSE and such) by the OS vendors for the target machine, so there is no reason to you to optimize. It is supported in hardware from first S3 Savage 2000, NVidia Geforce 256 or Ati Radeon. Really history stuff, like 10 years ago.

Isn't your design faulty, and assembly "backend solution" only tries to hide your problem, instead of solving it?

**ciplogic** · 05 May 2013, 11:16 AM

Originally posted by gens View Post

(...)
i tried -flto now and it does give better performance
maybe cuz it aligned the loop (my loop isnt aligned either, tried now, it didnt help much)

this is for example of sse usefulness
if someone wants to use the loop in a BLAS library il finish it to work in all cases
and it can also be used for software fallback when opengl3.x is not available, like with laptops
fun thing is i will be using this to shave off a % or 2 from a game, will need to change couple lines thou
writing a loop here and there isnt that demanding

PS i also tried icc and llvm on this site, llvm is 3.0 thou

Try this code:

Code:

#include <stdio.h>
#include <sys/time.h>

unsigned long long int rdtsc(void)
{
   unsigned long long int x;
   unsigned a, d;

   __asm__ volatile("rdtsc" : "=a" (a), "=d" (d));

   return ((unsigned long long)a) | (((unsigned long long)d) << 32);;
}

float matrices[10000][9];
float vertices[10000][3];
float result[10000][3];

void compute(int count ) {
	int i,j,k;
	float partial;
	for( i=0;i<count;i++) {		
		for(j = 0; j<3; j++)
		{
			partial = 0.0f;			
			for(k = 0; k<3; k++)
				partial += vertices[i][k] * matrices[i][j*3+k];
			result[i][j] = partial;		
		}
	}
}


int main() {
	int i;
	int count = 10000;
	
	float tmp=0;
	for( i=0; i<count*3; i++) {
		vertices[i/3][i%3]=tmp;
		tmp=tmp+1;
	}
	tmp = 0.0f;	
	for( i=0; i<count*9; i++) {
		matrices[i/9][i%9]=tmp;
		tmp=tmp+1;
	}
	unsigned long long ts = rdtsc();
	
	compute( count );
	
	printf("elapsed ticks: %llu\n", rdtsc() - ts);
	
	for( i=0; i<24; i++) {
		printf("%f ", result[i]);
	}
	printf("\n");
	return 0;
}

I am not fully sure, but it looks that is auto-vectorized, or even it isn't, the performance went up (I changed a bit the logic, so change it accordingly, if there are bugs or it doesn't generate the same numbers at the end) and I think that it would be good enough (even is not auto-vectorized):

Code:

 ./a.out 
elapsed ticks: 471560

The original timings were:

Originally posted by ciplogic View Post

So, I rerun the tests as you've suggested:

Code:

$ g++ -O3 matrix_test.c matrixm.c 
$ ./a.out 
elapsed ticks: [B]745912[/B]

And here is the kicker:

Code:

$ g++ -O3 [B]-flto[/B] matrix_test.c matrixm.c 
$ ./a.out 
elapsed ticks: [B]647984[/B]

(...)
May you confirm the numbers on your machine?

So this rewrite will speedup on my machine by 58% (745912 /471560 ticks). I think that if you will rewrite the loop similarly in your machine, the code will run faster on your machine than your assembly. I know that machines differ, and maybe it will run slower, or maybe you will take the assembly of the loop that is 58% percent faster and you can find an optimization that happen on Intel Compiler and doesn't happen in GCC case, so you will state: but still the case remains: GCC, given a bug is reported, will be able to fix your loop and all other code written similarly, but your assembly will remain unoptimized, slower than the C version.

**gens** · 05 May 2013, 08:47 PM

a program that gives wrong results is infinitely slower then any program that does

in opengl for shaders you need at least opengl3.something
this kinda thing in games is for fallback, or laptops that have weak gpu's
and i choose it as a generic example, theres plenty of other math things that can benefit
encoding/decoding, encryption, everywhere where you got a loop that uses lots of cpu or memory bandwidth (compressing textures for the gpu ?)
idk, all i know is there are cases where hand writing a loop can speed it up 28% (or more)

scalar code can not easily be faster then vectorized
SIMD does less cache messing, overall less instructions to decode etc.

unrolled loops also help.. to a point

i got an new AMD now, but i think intel still suffers from half aligned reads/writes (16byte aligned is usually the best)
that would give a huge advantage to the custom loop
maybe il test on the laptop some day

PS i didnt even profile the code i wrote so its not reordered, that could speed it up a shade
also i did simple packing, meaning a couple shuffles can be removed too

PPS data flow is FAST, till it hits the cache limit
in this case (didnt calculate) i think it hits it near the end
that explains the speedup from processing just one third

**ciplogic** · 06 May 2013, 01:28 AM

Originally posted by gens View Post

a program that gives wrong results is infinitely slower then any program that does

in opengl for shaders you need at least opengl3.something
this kinda thing in games is for fallback, or laptops that have weak gpu's

This is where you don't understand how OpenGL works. Let me rephrase: OpenGL historically is used to draw primitives on screen. Any OpenGL 1.1 graphic card compliant will accelerate up-to 8 lights in hardware and geometric transformations you draw on screen in GPU. This is commercially known as Transform&Lighting (introduced with GeForce 256/Savage 2000) and is also part of DirectX7 video cards. So if in your target problem, you have to apply many transformations to millions of vertices, you shouldn't multiply them on CPU, but simply state glLoadMatrix (or glMultiplyMatrix) and the driver/video card combination will do it for you. For millions of points per second, with zero CPU usage.

The vertex/pixel shaders are small programs that can modify the standard flow of vertices/pixels with your custom processing: for vertices you can compute the particles in your particle generator again with little or no CPU usage, or a wave of a sea using a trigonometric function, or a processing to make the picture blurred.

But as for your problem, if you can feed the graphic card if you need the final points to be displayed, you don't need to multiply them on the CPU. Even more, if your video card is OpenGL 2.0 compliant (a huge numbers of cards do support this, roughly a DirectX8 card like GeForce 5200+ or a Radeon 9000+, or a newer Intel video integrated video card), you cal load the vertices in the video memory as vertex buffers, and you don't need to make copies from CPU to GPU when you draw them. This is one reason why games with complex graphics run on the phones that are memory bandwidth limited. This is again, for your specific problem with the twist that your final points have to be draw on screen eventually (using OpenGL).

Originally posted by gens View Post

and i choose it as a generic example, theres plenty of other math things that can benefit
encoding/decoding, encryption, everywhere where you got a loop that uses lots of cpu or memory bandwidth (compressing textures for the gpu ?)
idk, all i know is there are cases where hand writing a loop can speed it up 28% (or more)

scalar code can not easily be faster then vectorized
SIMD does less cache messing, overall less instructions to decode etc.

unrolled loops also help.. to a point

i got an new AMD now, but i think intel still suffers from half aligned reads/writes (16byte aligned is usually the best)
that would give a huge advantage to the custom loop
maybe il test on the laptop some day

If you get a new AMD, most likely you could have a X2, X3 or even a Phenom with 6 cores (to not talk about 8 core AMD), so your 30% speedup isn't it not a real deal? Or maybe an APU? So again, why not targeting those? People will buy more CPUs that support the GPGPU computations or the multi-core code. FWIW even phones today are 2 cores, and tablets tend to go to 4 cores soon.

Originally posted by gens View Post

PS i didnt even profile the code i wrote so its not reordered, that could speed it up a shade
also i did simple packing, meaning a couple shuffles can be removed too

PPS data flow is FAST, till it hits the cache limit
in this case (didnt calculate) i think it hits it near the end
that explains the speedup from processing just one third

Did you tried the loop where matrices are defined like 2 dimensional? Was it auto-vectorized on your machine? Was it faster than on your machine's assembly implementation (it was doing at least the same computation)? Can you give some numbers?

As I'm working with C# my C/assembly skills are not as great, and for what it worth, I did not know how to add your simple file in my Linux C++ IDE, but this is for a reason. I don't write micro-benchmarks on daily basis, and when I do run them, I run them to target an use-case. And as usual, use-cases involve many components. Many times I noticed that components run slow and in many cases using caching is much practical than assembly: a connection to internet to get a list of users is always way slower than any other component of your system, even written in Python or Ruby. If this is optimized, like saving the users for a time like 1 hour, it simply means that user will be able after a "Loading" screen to get instant UI working.

This I think is where we differ, you think performance for performance sake, when I think performance if it doesn't hurt the user. Of course, with your approach if applied extensively, the user will maybe have instant running applications, and in my case I will have users that don't have annoying pauses, and even if they appear I try to minimize them with some sake defaults. I think your example shows this: you look for your operation to be faster, when I try to find the whole problem space to find optimization opportunities. At last: you don't seem concerned with assembly, but with assembly as "it cannot get faster than this", and you're in denial when people get code fast enough to not matter even if your assembly is faster or not. The "aligned" (or mainly non-aliased) pointers and cache performance can be made by the compiler with little work on your own, and if users can get into 30% (or even faster, with the latest implementation than your default assembler) of assembly speed range, I think that this proves the point that assembly is irrelevant, in your toy-software example.

"data flow is FAST, till it hits the cache limit"
What you're talking about? Data Flow Analysis of the compilers? That enable optimizations? Or processing of the data is faster if just it happen let's say in L1 cache? If is the second, then you are certainly wrong about your program to write in assembly, as 30% speedup can be all lost if you touch L2 or L3, and if you code your C++ code to fit well in L1, you can get bigger performance than your ugly assembly.

**ciplogic** · 06 May 2013, 03:58 AM

Originally posted by gens View Post

a program that gives wrong results is infinitely slower then any program that does
(...)
unrolled loops also help.. to a point

i got an new AMD now, but i think intel still suffers from half aligned reads/writes (16byte aligned is usually the best)

So this is my "final" C code that does the same number of compiler improvement with no alignment. Using basically: "-O3 -ffast-math -flto" will run faster than the most aligned code I could write (look on the next block of code)

Code:

#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
#include <sys/time.h>

unsigned long long int rdtsc(void)
{
   unsigned long long int x;
   unsigned a, d;

   __asm__ volatile("rdtsc" : "=a" (a), "=d" (d));

   return ((unsigned long long)a) | (((unsigned long long)d) << 32);;
}

float matrices[10000][9];
float vertices[10000][3];
float result[10000][3];

void compute(int count ) {
	int i,j,k;
	float partial;
	float res[3];
	for( i=0;i<count;i++) {
			
		for(j = 0; j<3; j++)
		{
			partial = 0.0f;			
			for(k = 0; k<3; k++)
				partial += vertices[i][k] * matrices[i][j*3+k];
			res[j] = partial;
		}

		memcpy(&result, &res, sizeof(float)*3);
	}
}


int main() {
	int i;
	int count = 10000;
	
	float tmp=0.0f;
	for( i=0; i<count*3; i++) {
		vertices[i/3][i%3]=tmp;
		tmp=tmp+1;
	}
	tmp = 0.0f;	
	for( i=0; i<count*9; i++) {
		matrices[i/9][i%9]=tmp;
		tmp=tmp+1;
	}
	unsigned long long ts = rdtsc();
	
	compute( count );
	
	printf("elapsed ticks: %llu\n", rdtsc() - ts);
	
	for( i=0; i<24; i++) {
		printf("%f ", result[i/3][i%3]);
	}
	printf("\n");
	return 0;
}

2nd version:

Code:

void compute(
	float * __restrict matrix, 
	float * __restrict vertex, 
	float * __restrict result, 
	int count ) {
	int i,j,k;

	float *m = __builtin_assume_aligned(matrix, 16);
	float *v = __builtin_assume_aligned(vertex, 16);
	float *r = __builtin_assume_aligned(result, 16);

	for( i=0;i<count;i++) {
		
		for(j=0;j<3;j++)
		{
			float accumulator = 0.0f;
			for (k=0;k<3;k++)
				accumulator+= m[j]*v[k];
			result[j] = accumulator;
		}
		
		m += 9;
		r += 3;
		v += 3;
	}
}

#include <stdio.h>
#include <sys/time.h>

unsigned long long int rdtsc(void)
{
   unsigned long long int x;
   unsigned a, d;

   __asm__ volatile("rdtsc" : "=a" (a), "=d" (d));

   return ((unsigned long long)a) | (((unsigned long long)d) << 32);;
}

int main() {
	float matrices[100000];
	float vertices[100000];
	float result[100000];
	int i;
	int count = 10000;
	float *ptrmat, *ptrvert, *ptrres;
	
	float tmp=0;
	for( i=0; i<count*3; i++) {
		matrices[i]=tmp;
		vertices[i]=tmp;
		tmp=tmp+1;
	}
	
	ptrmat = &matrices[0];
	ptrvert = &vertices[0];
	ptrres = &result[0];
	
	unsigned long long ts = rdtsc();
	
	compute( ptrmat, ptrvert, ptrres, count );
	
	printf("elapsed ticks: %llu\n", rdtsc() - ts);
	
	for( i=0; i<24; i++) {
		printf("%f ", result[i]);
	}
	printf("\n");
	return 0;
}

Changed data structure to store bi-dimensional arrays:

Code:

$ ./a.out 
elapsed ticks: 263760

Same data structure (2nd implementation) with alignment hints and merging of the loops

Code:

$ ./a.out 
elapsed ticks: 331592

I think that none of the implementation were in fact vectorized, but they run smoking fast and faster (I think, I don't know how to compile the assembly part, using as tool!?) than the +30% speedup. Both run more than 100% faster than the original code I was given by you (that they were running in the 700000 ticks zone). So maybe is time for you to improve your C skills to get to speed with fast C code!?

**ciplogic** · 06 May 2013, 05:04 AM

Originally posted by Steph1ani1e

Virtual machines have come a long way in speed, and with multicores now the norm, managed code can be as fast or sometimes even more than unmanaged code

Fully agree with you, but you know, imagine if you feel somewhat out of control if the VM doesn't optimize the code you run. I think here it lies the frustration of the "assembly folks" here. Or maybe that it occupy more memory or that it starts a bit slower. These issues can be addressed and they have been for some time, but still the impression that if you stay far away of hardware it means you will miss this micro-management of every instruction makes you feel bad.

Imagine that you are the boss of the department and you cannot say anyone how to use it's precious time, but just to give directives for all company. This would be C/C++ languages. Imagine then that you are a boss of a division or a subdivision, this is like writing in a VM like Java or C#. Sure, your VM may be inefficient sometimes, but many times will do a really good job. The C++ compilers do an excellent job too.

And some people may ask: Why would you want to use C# if its slower than C++?, when you can use directly assembly?

And I think the main reason is that the higher your level is in your department, the bigger the probability you can make more changes in the world. If you are just sweeping and making every small table clean, you cannot change the entire company, but maybe the department your working, but if you manage a division, your division can give a bit worse service at the micro-management level, but they can give to you really clear information like they can notify you as a customer with an SMS, they can arrange everything to met you when you need them, etc. These changes can be made if you are in control, but if you are just trying to optimize that the email will arrive faster, by 10 seconds, instead of being in 40 seconds, it will be in 30, means nothing for user, but the power to have the decision of receiving the email in the first place is really crucial.

**gens** · 07 May 2013, 09:49 AM

i compiled your code, its still scalar
only seems few segments are interwoven a bit
i dont wanna examine the resulting code further

Code:

		result[0] = matrix[0]*vertex[0] + matrix[1]*vertex[1] +matrix[2]*vertex[2];
		result[1] = matrix[3]*vertex[0] + matrix[4]*vertex[1] +matrix[5]*vertex[2];
		result[2] = matrix[6]*vertex[0] + matrix[7]*vertex[1] +matrix[8]*vertex[2];

is the simplest way i know of to multiply 3x3 by 3x1 matrices
note the vertices used, it cant be made into a shorter loop (at least i cant think of any way now)
you can segment it into common operations, but i dont think it would help

glMultMatrix is 4x4 by 4x1
sse code for that would be just a few lines long as there is no need to shuffle that much
as far as i know for 3x3 matrices you need shaders
but idk that much about opengl programing

hmmm
by data flow i meant reading and writing to the memory
when the cpu reads from the memory, it goes thru the L2 cache (probably L3 too if you got it) and thru the L1 cache
when writing same thing happens
data goes from registers to Lwhatever and when the cpu finds time it writes it to RAM

the memory management unit (MMU) tries to plan ahead, deciding what to keep in the cache and what to evict
its no easy task to do when the cpu just rolls data like an idiot, but it tries its best
one way we can help it is by writing directly to RAM, bypassing the cache
problem with that is that its a lot slower then writing to cache, but still its faster then writing to a full cache
(almost the same goes for reading)

i'm used to linux having lots of programs to measure that kind of things
like perf that can count cache misses
or cachebench to see the limits of... well the cache

bdw why didnt you say so
heres a windows(64bit) version of the loop
hope it works, didnt test it at all

on a more personal note:
i dont like OO languages, but i can see why they are useful
i really dont like the "cpu is fast enough for slow code" mentality but i like firefox (bdw, you need yasm to compile firefox)
and i wont tell other people "you have to program in C"
and it bothers me how people talk that C++/C#/whatever is better
its all good, but its all different in basic mentality
bdw i like C more then other languages cuz its "portable assembler" and thus the closest to writing it yourself
(most other languages are far more abstracted from machine code)

for end a quote:

Software efficiency halves every 18 months, compensating Moore’s Law.
— May’s Law

taking in effect that Moore's law is no longer as valid as it was, the future is slow

**ciplogic** · 07 May 2013, 11:01 AM

Originally posted by gens View Post

i compiled your code, its still scalar
only seems few segments are interwoven a bit
i dont wanna examine the resulting code further

So for you SSE (or using any parallel code) is for the sake of parallelism. The scalar code is faster than the SSE code you wrote, so why use SSE code in the first place!? Did I miss something?

Originally posted by gens View Post

Code:

		result[0] = matrix[0]*vertex[0] + matrix[1]*vertex[1] +matrix[2]*vertex[2];
		result[1] = matrix[3]*vertex[0] + matrix[4]*vertex[1] +matrix[5]*vertex[2];
		result[2] = matrix[6]*vertex[0] + matrix[7]*vertex[1] +matrix[8]*vertex[2];

is the simplest way i know of to multiply 3x3 by 3x1 matrices
note the vertices used, it cant be made into a shorter loop (at least i cant think of any way now)
you can segment it into common operations, but i dont think it would help

glMultMatrix is 4x4 by 4x1
sse code for that would be just a few lines long as there is no need to shuffle that much
as far as i know for 3x3 matrices you need shaders
but idk that much about opengl programing

So let me clarify, you seem to just talk about glLoadMatrix/glMultiplyMatrix that doesn't match your size of matrix... but it seems you never used it. glLoadMatrix is used BEFORE you display something. So, let's say you have 10k points, and ONE matrix to multiply them. You will do write something like that to display them, if you don't display them, you have to write silly C loops or assembly:

Code:

glLoadMatrix(a4x4MatrixOfYour3x3Matrix); 
drawGl(yourPoints);

If drawGl happens to contain the the vertices in the video card as VBO (Vertex buffer object), these two operations are basically with 0% CPU (you will have to make a 4x4 matrix once version of your 3x3 matrix).

You can't use glLoadMatrix to do your computation though, and you can't use shaders for this either, you need to use CUDA or OpenCL which is another talk all-together. Anyway, as far as I understand, a real program that works with these many points, most likely will display them eventually, so there is no point to make weird computations to show that assembly is faster than C, but to see how can you use things that people do have on their video card, including the glLoadMatrix call.

on a more personal note:
i dont like OO languages, but i can see why they are useful
i really dont like the "cpu is fast enough for slow code" mentality but i like firefox (bdw, you need yasm to compile firefox)
and i wont tell other people "you have to program in C"
and it bothers me how people talk that C++/C#/whatever is better
its all good, but its all different in basic mentality
bdw i like C more then other languages cuz its "portable assembler" and thus the closest to writing it yourself
(most other languages are far more abstracted from machine code)

for end a quote:

Software efficiency halves every 18 months, compensating Moore?s Law.
? May?s Law

taking in effect that Moore's law is no longer as valid as it was, the future is slow

You seem to not like "C++/C#/whatever" because C is a portable assembly and is fast, but what you say makes literally no sense. Yes, there is a mentality of inefficiency inside VMs or higher level languages, but this is not an excuse to say that for this reason we have to target just performance. What about buffer overflows? Or NullPointerException (or a Invalid Read At Address 0x0000000c kind of stuff), don't you want that the runtime to be able to recover after these errors?

At last: what stops you write fast code with C++? Given all equal C++ is faster than C (was discussed earlier by Google experts and on this topic too) as you have access to assembly, all stuff that C has, + templates that can precompute many things at compile time and do aggressive inlining.

What stops you write fast code in C#? There are game engines written in C# that run well on my some years old phone.

I do know both C++ and C# and I cannot say that the language speed decreased 2x times every 18 months (or every 2 years for that matter), and half of the reason is that even the software bloat increased many times, the main slow items are still: the rotating disk drives, the internet, the CD, and so on that have huge latencies (if you don't have an SSD, if you have SSD the performance should be many times better than a rotating disk). Having a Pentium 3/Xp experience in 2001 was in many ways much worse than what you will have today with Windows7/i7-3K CPUs. Maybe they boot in the same time, but the Visual Studio is much more responsive (or Eclipse, or pick your tool), the IDE or the tools give to you much more relevant information, anti-aliasing for fonts, searching for symbols in all solution, many times more screen information/resolution.

If you analyze speciffic language features, let's say C# Linq, yes, I can agree that it is like 15% slower than iterating using the most optimized form of your loop, but this construct makes it hard (or impossible) to make common mistakes with it. If you suggest that "dynamic" keyword in C# is 10x times slower than a virtual call, I also agree, but (and there is always a but), the way the code that was written before to do the "dynamic" functionality is much more error prone (is like Assembly compared to C# defaut style) or reflection code was really ugly: to maintain, to understand and many times slower than Dynamic.

At last, why do you carry so much about the ticks? If you do a ls command, do you care how fast the ls can parse the answer, or how fast it gives to you the answer? With this I mean: ls can be optimized to be light on CPU, but it can be optimized to be light on disk accesses, and the 2nd one can be faster than the first, even it occupy maybe much more memory/CPU as it can keep a local cache of INode and it's associated data.

You say you like Firefox, but also look to the following facts about FF:
- it is written in C++/COM like coding (they try to remove XPCOM completely, but not to remove C++) so is very OOP
- the code management solution is Hg, which is written in Python
- it uses PGO (Profile Guided Optimizations) and they tune their C++ code based on C++ optimization styles, so there is virtually no assembly in all codebase of Firefox
- it pushes hard JavaScript, and AsmJs is a well behaving subset of fast JS, They bet on the power of compilers as LLVM to give big performance speedups, not on tuning assembly
- it use SqlLite everywhere (for history, bookmarks, etc.), the way to optimize the slowdowns is to make the code multithreaded
- most of speedups in the future theme Aurora are based on GPU profiling
- IonMonkey compiles off-thread the bytecode to native code

Do you see common themes from what I said before and real software? GPU and multi-threading are the main ways that Firefox took in the last years to optimize the speed. Why not post on FF forums to use assembly everywhere, though I don't think you will have any fans

**erendorn** · 07 May 2013, 12:05 PM

Originally posted by gens View Post

and it bothers me how people talk that C++/C#/whatever is better
its all good, but its all different in basic mentality
bdw i like C more then other languages cuz its "portable assembler" and thus the closest to writing it yourself
(most other languages are far more abstracted from machine code)

C++ is as fast as C on the C subset, and faster than C on the C++ subset.
The only things one can dislike about C++ compared to C are:
- the language can be too complicated for limited hardware (embeded, GPU) and very low level (OS kernels)
- the language does not enforce any coding style, which can be a pain for a project without strong leadership

Originally posted by gens View Post

for end a quote:

Software efficiency halves every 18 months, compensating Moore?s Law.
? May?s Law

taking in effect that Moore's law is no longer as valid as it was, the future is slow

There's a reason software efficiency compensate computations speed: it's because there is one thing that doesn't change: the users.
There is no need to double the fps of games every year, or halve the time needed for a pop-up to appear.
On the other hand, you can have twice as many games, or games two times better (because you can make game assets more easily if you don't have to optimize them that much, and that is what takes time in a game), or twice as cheap. Your pick! Or application more beautiful, or that can do more, etc, etc.. And all that, still at the same acceptable speed: the user speed.

Announcement

Is Assembly Still Relevant To Most Linux Software?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment