Announcement

**archibald** · 07 May 2013, 12:28 PM

I've been told by some embedded developers that they use C rather than C++ because with it they can predict and account for every byte of memory that gets used and allocated: allocating for an array and a size variable is absolutely predictable in terms of memory consumption, std::vector (et. al.) isn't.

**ciplogic** · 07 May 2013, 03:04 PM

Originally posted by archibald View Post

I've been told by some embedded developers that they use C rather than C++ because with it they can predict and account for every byte of memory that gets used and allocated: allocating for an array and a size variable is absolutely predictable in terms of memory consumption, std::vector (et. al.) isn't.

And I can say add another two reasons: as the are no abstractions (like std::vector) by default, people rely in "circular buffers" and "memory pools" in code that are fast "abstractions". C++ code is also larger, by a little, so CPUs with small cache can get a little performance penalty. C++ exceptions (but I never heard to be used in embedded code) are also not performance predictable.

But people like gens, they don't argue against memory usage, but against performance and in a weird way, as if assembly always translate in performance. And even theoretically this is the case, practically is not true (anymore). Compilers are mature and can optimize code better than at least gens' assembly output and they will get better as time goes.

Similarly, all reasons that I say that people don't use embedding code (like every byte of memory), can be used as simple as don't use STL (so C++ code is not the issue) and use templates to avoid wrong macro expansions. There is no reason why memory pools or arenas cannot be used into C++, Firefox uses JavaScript Heap Compartments which are very similar with memory pools. Memory pools are also accessible from C# (of course, the heap works a bit different) and nothing stops any user to write one circular buffer in a high level language as Java.

Memory is an important resource and it is important to be optimized against and I fully think that C can achieve this better than Java. But performance in the terms is not today's assembly main strength.

I will call gens once again to test the assembly code vs the C code with some optimization flags that were described and to put publicly the numbers. Or to say a real use-case where his multiplication can make sense and cannot be optimized with caching or putting the computation on the video card. He tries to play a game where he makes the rules, he plays the game and he wins.

This is plain silly, and in the process, he makes equivocation that if his assembly code is faster than his C code, this means that assembly is faster than C.

**gens** · 07 May 2013, 07:53 PM

why are you so passionate about this ?
bdw sry, i forgot to tell you its in fasm, but i gave you 64bit linux .o files to link against

here
numbers

asm

Code:

elapsed ticks: 921086
5.000000 14.000000 23.000000 122.000000 158.000000 194.000000 401.000000 464.000000 527.000000 842.000000 932.000000 1022.000000 1445.000000 1562.000000 1679.000000 2210.000000 2354.000000 2498.000000 3137.000000 3308.000000 3479.000000 4226.000000 4424.000000 4622.000000

gcc, (almost) no sse

Code:

elapsed ticks: 1178980
5.000000 14.000000 23.000000 122.000000 158.000000 194.000000 401.000000 464.000000 527.000000 842.000000 932.000000 1022.000000 1445.000000 1562.000000 1679.000000 2210.000000 2354.000000 2498.000000 3137.000000 3308.000000 3479.000000 4226.000000 4424.000000 4622.000000

yours

Code:

elapsed ticks: 357943
8098740224.000000 8099010560.000000 8099280384.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

see the problem ?

and for testing i choose to have separate .o files so rdtsc dosent get optimised and the results get somewhat accurate
in assembly id use cpuid to get bit more accurate results
i read in C it is better to use gettimeofday() on modern, Hz scaling, cpus

and no
calling opengl is not 0% cpu thing
data has to be copied to the gpu thru PCI-e(or whatever)
and that is a big part of this operation, reading and writing to RAM
at that time the cpu cant do much anything else anyway
maybe on PS4 with its one gddr for everything it would be, but not on normal household computers

and to remind, i sayd something like "when you cant get more performance out of your code then you can write loops that can benefit from assembly in assembly"

git.xonotic.org Git - xonotic/darkplaces.git/blob - mod_skeletal_animatevertices_generic.c

http://git.xonotic.org/?p=xonotic/darkplaces.git;a=blob;f=mod_skeletal_animatevertices_generic.c;h=43563457212300837fde2d38ffcb70ba2a322fa2;hb=HEAD

what sse is good for

sourceware.org Git - glibc.git/blob - sysdeps/x86_64/multiarch/memcpy-ssse3.S

http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memcpy-ssse3.S;h=9a878d35ff9173f411260966cf05a8eb538f929d;hb=6fb8cbcb58a29fff73eb2101b34caa19a7f88eba

when a compiler writes something like that, then its good

@ erendorn

true, software is tailored for the user ofc
but now, years after windows 98 with everything that happened to computers, do i still need office to take a few seconds for a query ?

funny thing is GAS assembly sintax was made that way 'cuz at the time it was too cpu consuming to make the assembler check every label against a register

another thing i noticed is that common operations get.. backported id guess.. to the instruction set of a cpu
like matrix things in 3D now and later in SSE and others
and MMX for whatever math they do for pictures

that reminds me; adjusting mathematical algorithms to a cpu is, from what i see, easier to do directly in cpu instructions

just to underline
THIS IS ABOUT EXTERMINATING ALL ASSEMBLY THAT CAN BE EXTERMINATED
and probably all assembly in OSS is there for performance reasons, meaning a few loops here and there. Not whole unmaintainable programs

so im not talking about C++, java, Python or any of the many languages out there
im talking about most used simple loops overall

before i forget,
objective programing adds memory and cpu overhead
its usually just a bit on performance, not much, but its there
on the other hand, virtual machines are in a world of their own, but they can (in theory) run on anything that has the required binary

tradeoffs all around
just bothers me why people hate assembly
its a different way of programing, maybe that seems hard to understand
first time i programed in C it was weird compared to QB that i knew

PS when compilers get as good as humans, il change my mind to "assembly is a good way to learn about cpu's and their bottlenecks"

**ciplogic** · 08 May 2013, 02:01 AM

Originally posted by gens View Post

why are you so passionate about this ?

Because one of every 3-4 things you say is false. And adding factoids or false claims doesn't prove points. For example this quote:

Originally posted by gens View Post

and no
calling opengl is not 0% cpu thing
data has to be copied to the gpu thru PCI-e(or whatever)
and that is a big part of this operation, reading and writing to RAM
at that time the cpu cant do much anything else anyway
maybe on PS4 with its one gddr for everything it would be, but not on normal household computers

In fact, it is 0% CPU thing if is used as I've said before it is executed on drivers and the call content doesn't require almost any CPU. Look to this glDrawArrays function. But not only this:
Vertex Buffer Objects are by design with no memory copying:

A Vertex Buffer Object (VBO) is an OpenGL feature that provides methods for uploading data (vertex, normal vector, color, etc.) to the video device for non-immediate-mode rendering. VBOs offer substantial performance gains over immediate mode rendering primarily because the data resides in the video device memory rather than the system memory and so it can be rendered directly by the video device.

Before OpenGL 1.5, there are compiled arrays which again will imply almost no CPU too.

Originally posted by gens View Post

and to remind, i sayd something like "when you cant get more performance out of your code then you can write loops that can benefit from assembly in assembly"

git.xonotic.org Git - xonotic/darkplaces.git/blob - mod_skeletal_animatevertices_generic.c

http://git.xonotic.org/?p=xonotic/darkplaces.git;a=blob;f=mod_skeletal_animatevertices_generic.c;h=43563457212300837fde2d38ffcb70ba2a322fa2;hb=HEAD

what sse is good for

sourceware.org Git - glibc.git/blob - sysdeps/x86_64/multiarch/memcpy-ssse3.S

http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memcpy-ssse3.S;h=9a878d35ff9173f411260966cf05a8eb538f929d;hb=6fb8cbcb58a29fff73eb2101b34caa19a7f88eba

when a compiler writes something like that, then its good

... and when a compiler doesn't write something like that, it writes a good enough code. This is why the Xonotic codebase has only macros but no assembly, did I read it right?

Originally posted by gens View Post

before i forget,
objective programing adds memory and cpu overhead
its usually just a bit on performance, not much, but its there
on the other hand, virtual machines are in a world of their own, but they can (in theory) run on anything that has the required binary

I wrote a C++ version of the code and it was behaving really well (faster than your original code, a bit slower than the __restrict kind of coding):

Code:

class Vertex2d
{
public:
	float vertices[3];
};
class Matrix2d
{
public:
	float matrix[9];
	void multiplyVertices(const Vertex2d & src, Vertex2d & dest) const;
};

void Matrix2d::multiplyVertices(const Vertex2d & src, Vertex2d & dest) const
{
	for(int j=0;j<3;j++) {
		float accumulator = 0.0f;
		for (int k=0;k<3;k++)
			accumulator+= matrix[j*3+k]*src.vertices[k];
		dest.vertices[j] = accumulator;
	}
}

void compute(
	Matrix2d * matrix, 
	Vertex2d * vertex, 
	Vertex2d * result, 
	int count ) {
	for(int i=0;i<count;i++) {		
		matrix->multiplyVertices(*vertex, *result);
		matrix ++;
		result ++;
		vertex ++;
	}
}

#include <stdio.h>
#include <sys/time.h>

unsigned long long int rdtsc(void)
{
   unsigned long long int x;
   unsigned a, d;

   __asm__ volatile("rdtsc" : "=a" (a), "=d" (d));

   return ((unsigned long long)a) | (((unsigned long long)d) << 32);;
}

int main() {
	Matrix2d matrices[10000];
	Vertex2d vertices[100000];
	Vertex2d result[100000];
	int i;
	int count = 10000;
	Matrix2d *ptrmat;
	Vertex2d *ptrvert, *ptrres;
	
	float tmp=0.0f;
	for( i=0; i<count*3; i++) {
		vertices[i/3].vertices[i%3]=tmp;
		tmp=tmp+1;
	}
	tmp = 0.0f;	
	for( i=0; i<count*9; i++) {
		matrices[i/9].matrix[i%9]=tmp;
		tmp=tmp+1;
	}
	ptrmat = &matrices[0];
	ptrvert = &vertices[0];
	ptrres = &result[0];
	
	unsigned long long ts = rdtsc();
	
	compute( ptrmat, ptrvert, ptrres, count );
	
	printf("elapsed ticks: %llu\n", rdtsc() - ts);
	
	for( i=0; i<24; i++) {
		printf("%f ", result[i/3].vertices[i%3]);
	}
	printf("\n");
	return 0;
}

So C++ is not that slow either. I could bother to precompute the all computation (or most of it) at the compile time, like described here but I will not. I will likely prefer the solution of glLoadMatrix as is written for me (the most of the code).

Which extra memory/performance C++ implies? The virtual call? It is an opt-in feature, so nothing stops you to not use them. I can say things that even you want to keep a low level codebase with C "blend", C++ can give to you advantages:
- const methods and references where compiler can know to optimize away many computations. If you work with reference counting (smart pointers), using constant reference to them will not add one reference and destroy it
- templates are ugly to be written, as assembly is, but when you succeed to fix the template errors, the result is a better more restricted set of your C++ code, not the reverse (like a more error-prone version, when is written in assembly). By this, many macros can be safer to be written as templates and the compiler mostly can inline them for small operations
- can you make a benchmark when a virtual call is slower than a call through a function pointer? In my understanding both codes run as fast, but maybe I am wrong
- move-semantics in C++ 11 can remove some copies of your object creations (as it is specified), so the argument that RAII gives inefficient code doesn't hold too much water today regarding "a lot of copies" are made. In fact GCC was optimizing away some copies much earlier and you have to specify to *not* optimize away these copies

tradeoffs all around
just bothers me why people hate assembly
its a different way of programing, maybe that seems hard to understand
first time i programed in C it was weird compared to QB that i knew

PS when compilers get as good as humans, il change my mind to "assembly is a good way to learn about cpu's and their bottlenecks"

Yes, but CPU is not the only bottleneck, and I think here is at least my problem with assembly. Assembly is not up-to-date excluding for a small elite of programmers working into the compilers or in scientific world maybe, and this because the compilers do not "auto-vectorize" your code. Assembly doesn't speedup as OpenMP does, or OpenGL/OpenCL does. Assembly doesn't speedup your disk/network/database access, and when you improve using all these other components, assembly is the odd man out.

As I've told you, I've did assembly but was in another era, mostly when SSE (1) was just introduced, but I wasn't writing SSE codes (as you don't write AVX either) so I know what is it all about. I knew fairly well 32 bit assembly (and 16 bit as I've studied in university) and with my tiny optimizations if I would wait 3-4 years, I will get them in many cases for free, or if I didn't, there was an assembly written library over internet that I could use in my project.

In fact I don't know QB (QuickBasic!?), but even the latest version of VB (6, not .Net ones) had a compiler, that if in an alternate history, this compiler would be more nourished, could make the VB to be still today around for it's good performance.

Should assembly be scrapped away? Of course not! I think that assembly should exist today (in case of performance reasons, not atomics) just with intrinsic primitives. Hopefully they should be as generic as Mono.SIMD is. There is no point to write assembly today, most of the times you can write code that is too close to assembly performance in C++, and you can write Java or C# code (that if you ignore the startup time that is slower) that is really close to this assembly too, without bothering on why the loop is not SIMDed, but how you can optimize your application.

**gens** · 08 May 2013, 07:46 PM

intrinsics are as portable as any assembly
thing that bothered me when writing them was that i didnt know exactly how many registers i got left
good about them is that the compiler reorders the instructions

and again, you can thread assembly

il make a proper loop with avx when theres time, this sse was not as optimized as it could be

im as entitled to talk about C++ as you are on assembly
so here is some paper on OO vs Procedural programing

ETDs: Virginia Tech Electronic Theses and Dissertations

http://scholar.lib.vt.edu/theses/available/etd-05222003-162844/unrestricted/thesis.pdf

**Ray7brian2** · 09 May 2013, 02:29 AM

This was done to see how much real Assembly is being used, to see what the code was used for, and whether it was worth porting to 64-bit ARM / AArch64 / ARMv8...

**ciplogic** · 09 May 2013, 02:32 AM

Originally posted by gens View Post

intrinsics are as portable as any assembly
thing that bothered me when writing them was that i didnt know exactly how many registers i got left
good about them is that the compiler reorders the instructions

and again, you can thread assembly

il make a proper loop with avx when theres time, this sse was not as optimized as it could be

im as entitled to talk about C++ as you are on assembly
so here is some paper on OO vs Procedural programing
http://scholar.lib.vt.edu/theses/ava...ted/thesis.pdf

From the paper:

6.4 Summary
From the data gathered on these three applications and from the above discussion, we may
conclude that careful design in OO paradigm can yield appreciable performance. We summarize
below, the most important points about OO design and performance issues:

And the performance issues were also shown inside the introduction of the paper: multiple function calls, virtual calls and creation of small objects.
And for this there are simple solutions that the paper stated them:
- put inline in headers
- for critical loop make a static version of your code that calls methods directly
- allocate objects them on stack and use references and constant references (move semantics will also help on it)

If you read the paper, the runtime penalty was like 4.09 % with the default design, but after taking in account the C++ benefits, it was sometimes even faster (with a hand-tuned version). But 4% slow down because you have multiple copies is ugly, but having a leak because memory don't free automatically (a "feature" that can happen more likely in C than in C++).

At last, -flto (or -O4) is mostly done to address this, inlining over the objects boundaries and giving the compiler to inline many small objects.

Originally posted by gens View Post

il make a proper loop with avx when theres time, this sse was not as optimized as it could be

As for me, I don't think it worth the effort. Try for your own program to see if you can learn how to improve your programming skills.

It looks to me that you have in mind some game programming (I may be wrong though), and you can see that the GPU is the resource that is a limiting factor for most games: http://www.anandtech.com/show/6934/c...tigpu-at-1440p (in one GPU configuration). Try to leverage this. When I play Crysis 2 (a great game btw), it was frustrating to play it just 1280x1024, but the CPU was not the issue. If you do some CAD like programming (even I doubt it), I can say that in big systems your updating logic matter much more, and I say this because I was writing to one, and when you have hundred or thousands of pieces and some of them impact the others, it is more important to have a big framework that computes the impact. As for the language I was working for it, was C#. C# was before was optimized like 30% of runtime, but after some optimizations, they were like 10% of runtime (look here for details, here [url=http://narocad.blogspot.com/2009/06/again-fixes-and-benchmark-part-ii.html]after the optimizations[url] where I did notice that the slowest component was updating the tree view, the 2nd was the C++ component and the visualization engine in C++ was written to not work with that many shapes)

**i386reaper** · 09 May 2013, 10:09 AM

Assembly is still important to Linux in the kernel. Even today, the second most common language in the Linux Kernel is assembly after C. Assembly gives Linus, Hartman and others the ability to properly the design in detail of parts of the kernel then would cause performance bottlenecks if designed in C. This allows the designers and maintainers of Linux to make it extremely fast and efficient.

This is in contrast to all the BSDs where they wrote their entire OS in C even in places where using assembly is critical. The result? BSDs are one of the slowest OS ever. Even slower then windows.

Don't believe me? see it for yourselves:http://svn.freebsd.org/base/head/

If you do a `find ./ -name "*.asm" -print`, you find nothing. There are some *.S files but it turns out that those are just extra-baggage left over from when they copy-paste AT&T code which resulted in the USL vs BSDI lawsuit. These *.S files are never referenced in any of the Makefiles.

That as well as the fact that, the source tree one big heavy pile of garbage full of spaghetti code which just shows how of a crappy mess BSD is.

No wander why BSD kernels are so un-portable and slow. Worse, they are even trying to rewrite everything (including the kernel) in C++ all because of clang. What retards.

**ciplogic** · 09 May 2013, 10:32 AM

Originally posted by i386reaper View Post

Assembly is still important to Linux in the kernel. Even today, the second most common language in the Linux Kernel is assembly after C. Assembly gives Linus, Hartman and others the ability to properly the design in detail of parts of the kernel then would cause performance bottlenecks if designed in C. This allows the designers and maintainers of Linux to make it extremely fast and efficient.

This is in contrast to all the BSDs where they wrote their entire OS in C even in places where using assembly is critical. The result? BSDs are one of the slowest OS ever. Even slower then windows.

Don't believe me? see it for yourselves:http://svn.freebsd.org/base/head/

If you do a `find ./ -name "*.asm" -print`, you find nothing. There are some *.S files but it turns out that those are just extra-baggage left over from when they copy-paste AT&T code which resulted in the USL vs BSDI lawsuit. These *.S files are never referenced in any of the Makefiles.

That as well as the fact that, the source tree one big heavy pile of garbage full of spaghetti code which just shows how of a crappy mess BSD is.

No wander why BSD kernels are so un-portable and slow. Worse, they are even trying to rewrite everything (including the kernel) in C++ all because of clang. What retards.

BSD bashing I think is not warranted. Even it may be written in C, the slowness of BSD (like in OS X) stays many times not in assembly but in other factors, like some implementations had a big lock in kernel. I talk about thins like this. In fact a big part of Windows is written in C++ today and many parts are written in C for a long time and no one is complaining of how slow it is (some people still do, but this is again not because of how assembly is written or not).

FreeBSD has to be slower for many reasons (and Mac OS X) than Linux and this includes:
- many critical modules of Linux are not compiled as modules but part of the kernel
- Linux has more manpower and more interest to be fast in supercomputers so SGI and IBM contributed heavily in it
- A lot of hardware companies profile and tune Linux still (for example Intel)
- the file system (Ext4) is faster in general than FreeBSD
- FreeBSD is compiled against an older GCC (4.2) because GCC 4.3 / GPL3 is not compatible with BSD license

If you add all these together, it is to be expected, with or without assembly that FreeBSD is slow(er than Linux).

Going back to the topic's talk "is assembly still relevant to most Linux software"? Linux kernel is not most Linux software

A scan of the source states clearly that Linux has 2.9% assembly (compared with: 94.5% C). If you substract the atomics or system calls, or work arounds like to free the cache when a context switch is happening) or requests to CPU to go in a lower state (all of them require assembly, C will not do the cut), and multiply with the platforms that Linux is supported, the assembly usage is really minimal.

Source: https://www.ohloh.net/p/linux/analys...guages_summary

If we go to FreeBSD, the history is really similar, and there is assembly: 91.4% C / 2.4% Assembly (similar with Linux rates), but the BSD kernel is many times smaller

https://www.ohloh.net/p/freebsd/analyses/latest/languages_summary

**ciplogic** · 16 May 2013, 02:20 AM

Beating the dead horse:

As for my own code, my typical workflow is something like:

Write function in Objective-C. If fast enough, stop.
Rewrite function to use a better algorithm or data structure. If fast enough, stop.
Rewrite function in C. If fast enough, stop.
Rewrite function in multi-threaded C with Grand Central Dispatch. If fast enough, stop.
Rewrite function in OpenCL.

With JavaScript, optimization stops at step #2. Even with the promising new asm.js, that would get me to step #2.5 — still slower than C, and a far cry from multi-threaded C or OpenCL. Developing for Mac, I have more tools at my disposal for clearing out performance bottlenecks and delivering a superb user experience. (Xcode's profiler, by the way, is generally excellent.)

This article discusses why a developer will use OS X and why it doesn't use a web platform to deliver the application. So is it not about assembly, but I can reflect the same experience in C#, excluding the step: "rewrite the function in C", which most of the time makes no sense (Objective C as C# has an overhead, for C++ would not be the case). At least for my applications, I found that step 3 would be phrased like: "Use NGen on client machine".

Announcement

Is Assembly Still Relevant To Most Linux Software?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment