I've been told by some embedded developers that they use C rather than C++ because with it they can predict and account for every byte of memory that gets used and allocated: allocating for an array and a size variable is absolutely predictable in terms of memory consumption, std::vector (et. al.) isn't.
Announcement
Collapse
No announcement yet.
Is Assembly Still Relevant To Most Linux Software?
Collapse
X
-
Originally posted by archibald View PostI've been told by some embedded developers that they use C rather than C++ because with it they can predict and account for every byte of memory that gets used and allocated: allocating for an array and a size variable is absolutely predictable in terms of memory consumption, std::vector (et. al.) isn't.
But people like gens, they don't argue against memory usage, but against performance and in a weird way, as if assembly always translate in performance. And even theoretically this is the case, practically is not true (anymore). Compilers are mature and can optimize code better than at least gens' assembly output and they will get better as time goes.
Similarly, all reasons that I say that people don't use embedding code (like every byte of memory), can be used as simple as don't use STL (so C++ code is not the issue) and use templates to avoid wrong macro expansions. There is no reason why memory pools or arenas cannot be used into C++, Firefox uses JavaScript Heap Compartments which are very similar with memory pools. Memory pools are also accessible from C# (of course, the heap works a bit different) and nothing stops any user to write one circular buffer in a high level language as Java.
Memory is an important resource and it is important to be optimized against and I fully think that C can achieve this better than Java. But performance in the terms is not today's assembly main strength.
I will call gens once again to test the assembly code vs the C code with some optimization flags that were described and to put publicly the numbers. Or to say a real use-case where his multiplication can make sense and cannot be optimized with caching or putting the computation on the video card. He tries to play a game where he makes the rules, he plays the game and he wins.
This is plain silly, and in the process, he makes equivocation that if his assembly code is faster than his C code, this means that assembly is faster than C.
Comment
-
why are you so passionate about this ?
bdw sry, i forgot to tell you its in fasm, but i gave you 64bit linux .o files to link against
here
numbers
asm
Code:elapsed ticks: 921086 5.000000 14.000000 23.000000 122.000000 158.000000 194.000000 401.000000 464.000000 527.000000 842.000000 932.000000 1022.000000 1445.000000 1562.000000 1679.000000 2210.000000 2354.000000 2498.000000 3137.000000 3308.000000 3479.000000 4226.000000 4424.000000 4622.000000
Code:elapsed ticks: 1178980 5.000000 14.000000 23.000000 122.000000 158.000000 194.000000 401.000000 464.000000 527.000000 842.000000 932.000000 1022.000000 1445.000000 1562.000000 1679.000000 2210.000000 2354.000000 2498.000000 3137.000000 3308.000000 3479.000000 4226.000000 4424.000000 4622.000000
Code:elapsed ticks: 357943 8098740224.000000 8099010560.000000 8099280384.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
and for testing i choose to have separate .o files so rdtsc dosent get optimised and the results get somewhat accurate
in assembly id use cpuid to get bit more accurate results
i read in C it is better to use gettimeofday() on modern, Hz scaling, cpus
and no
calling opengl is not 0% cpu thing
data has to be copied to the gpu thru PCI-e(or whatever)
and that is a big part of this operation, reading and writing to RAM
at that time the cpu cant do much anything else anyway
maybe on PS4 with its one gddr for everything it would be, but not on normal household computers
and to remind, i sayd something like "when you cant get more performance out of your code then you can write loops that can benefit from assembly in assembly"
what sse is good for
when a compiler writes something like that, then its good
@ erendorn
true, software is tailored for the user ofc
but now, years after windows 98 with everything that happened to computers, do i still need office to take a few seconds for a query ?
funny thing is GAS assembly sintax was made that way 'cuz at the time it was too cpu consuming to make the assembler check every label against a register
another thing i noticed is that common operations get.. backported id guess.. to the instruction set of a cpu
like matrix things in 3D now and later in SSE and others
and MMX for whatever math they do for pictures
that reminds me; adjusting mathematical algorithms to a cpu is, from what i see, easier to do directly in cpu instructions
just to underline
THIS IS ABOUT EXTERMINATING ALL ASSEMBLY THAT CAN BE EXTERMINATED
and probably all assembly in OSS is there for performance reasons, meaning a few loops here and there. Not whole unmaintainable programs
so im not talking about C++, java, Python or any of the many languages out there
im talking about most used simple loops overall
before i forget,
objective programing adds memory and cpu overhead
its usually just a bit on performance, not much, but its there
on the other hand, virtual machines are in a world of their own, but they can (in theory) run on anything that has the required binary
tradeoffs all around
just bothers me why people hate assembly
its a different way of programing, maybe that seems hard to understand
first time i programed in C it was weird compared to QB that i knew
PS when compilers get as good as humans, il change my mind to "assembly is a good way to learn about cpu's and their bottlenecks"Last edited by gens; 07 May 2013, 08:08 PM.
Comment
-
Originally posted by gens View Postwhy are you so passionate about this ?
Originally posted by gens View Postand no
calling opengl is not 0% cpu thing
data has to be copied to the gpu thru PCI-e(or whatever)
and that is a big part of this operation, reading and writing to RAM
at that time the cpu cant do much anything else anyway
maybe on PS4 with its one gddr for everything it would be, but not on normal household computers
Vertex Buffer Objects are by design with no memory copying:
A Vertex Buffer Object (VBO) is an OpenGL feature that provides methods for uploading data (vertex, normal vector, color, etc.) to the video device for non-immediate-mode rendering. VBOs offer substantial performance gains over immediate mode rendering primarily because the data resides in the video device memory rather than the system memory and so it can be rendered directly by the video device.
Originally posted by gens View Post
Originally posted by gens View Postbefore i forget,
objective programing adds memory and cpu overhead
its usually just a bit on performance, not much, but its there
on the other hand, virtual machines are in a world of their own, but they can (in theory) run on anything that has the required binary
Code:class Vertex2d { public: float vertices[3]; }; class Matrix2d { public: float matrix[9]; void multiplyVertices(const Vertex2d & src, Vertex2d & dest) const; }; void Matrix2d::multiplyVertices(const Vertex2d & src, Vertex2d & dest) const { for(int j=0;j<3;j++) { float accumulator = 0.0f; for (int k=0;k<3;k++) accumulator+= matrix[j*3+k]*src.vertices[k]; dest.vertices[j] = accumulator; } } void compute( Matrix2d * matrix, Vertex2d * vertex, Vertex2d * result, int count ) { for(int i=0;i<count;i++) { matrix->multiplyVertices(*vertex, *result); matrix ++; result ++; vertex ++; } } #include <stdio.h> #include <sys/time.h> unsigned long long int rdtsc(void) { unsigned long long int x; unsigned a, d; __asm__ volatile("rdtsc" : "=a" (a), "=d" (d)); return ((unsigned long long)a) | (((unsigned long long)d) << 32);; } int main() { Matrix2d matrices[10000]; Vertex2d vertices[100000]; Vertex2d result[100000]; int i; int count = 10000; Matrix2d *ptrmat; Vertex2d *ptrvert, *ptrres; float tmp=0.0f; for( i=0; i<count*3; i++) { vertices[i/3].vertices[i%3]=tmp; tmp=tmp+1; } tmp = 0.0f; for( i=0; i<count*9; i++) { matrices[i/9].matrix[i%9]=tmp; tmp=tmp+1; } ptrmat = &matrices[0]; ptrvert = &vertices[0]; ptrres = &result[0]; unsigned long long ts = rdtsc(); compute( ptrmat, ptrvert, ptrres, count ); printf("elapsed ticks: %llu\n", rdtsc() - ts); for( i=0; i<24; i++) { printf("%f ", result[i/3].vertices[i%3]); } printf("\n"); return 0; }
Which extra memory/performance C++ implies? The virtual call? It is an opt-in feature, so nothing stops you to not use them. I can say things that even you want to keep a low level codebase with C "blend", C++ can give to you advantages:
- const methods and references where compiler can know to optimize away many computations. If you work with reference counting (smart pointers), using constant reference to them will not add one reference and destroy it
- templates are ugly to be written, as assembly is, but when you succeed to fix the template errors, the result is a better more restricted set of your C++ code, not the reverse (like a more error-prone version, when is written in assembly). By this, many macros can be safer to be written as templates and the compiler mostly can inline them for small operations
- can you make a benchmark when a virtual call is slower than a call through a function pointer? In my understanding both codes run as fast, but maybe I am wrong
- move-semantics in C++ 11 can remove some copies of your object creations (as it is specified), so the argument that RAII gives inefficient code doesn't hold too much water today regarding "a lot of copies" are made. In fact GCC was optimizing away some copies much earlier and you have to specify to *not* optimize away these copies
tradeoffs all around
just bothers me why people hate assembly
its a different way of programing, maybe that seems hard to understand
first time i programed in C it was weird compared to QB that i knew
PS when compilers get as good as humans, il change my mind to "assembly is a good way to learn about cpu's and their bottlenecks"
As I've told you, I've did assembly but was in another era, mostly when SSE (1) was just introduced, but I wasn't writing SSE codes (as you don't write AVX either) so I know what is it all about. I knew fairly well 32 bit assembly (and 16 bit as I've studied in university) and with my tiny optimizations if I would wait 3-4 years, I will get them in many cases for free, or if I didn't, there was an assembly written library over internet that I could use in my project.
In fact I don't know QB (QuickBasic!?), but even the latest version of VB (6, not .Net ones) had a compiler, that if in an alternate history, this compiler would be more nourished, could make the VB to be still today around for it's good performance.
Should assembly be scrapped away? Of course not! I think that assembly should exist today (in case of performance reasons, not atomics) just with intrinsic primitives. Hopefully they should be as generic as Mono.SIMD is. There is no point to write assembly today, most of the times you can write code that is too close to assembly performance in C++, and you can write Java or C# code (that if you ignore the startup time that is slower) that is really close to this assembly too, without bothering on why the loop is not SIMDed, but how you can optimize your application.Last edited by ciplogic; 08 May 2013, 02:03 AM.
Comment
-
intrinsics are as portable as any assembly
thing that bothered me when writing them was that i didnt know exactly how many registers i got left
good about them is that the compiler reorders the instructions
and again, you can thread assembly
il make a proper loop with avx when theres time, this sse was not as optimized as it could be
im as entitled to talk about C++ as you are on assembly
so here is some paper on OO vs Procedural programing
Comment
-
Originally posted by gens View Postintrinsics are as portable as any assembly
thing that bothered me when writing them was that i didnt know exactly how many registers i got left
good about them is that the compiler reorders the instructions
and again, you can thread assembly
il make a proper loop with avx when theres time, this sse was not as optimized as it could be
im as entitled to talk about C++ as you are on assembly
so here is some paper on OO vs Procedural programing
http://scholar.lib.vt.edu/theses/ava...ted/thesis.pdf
6.4 Summary
From the data gathered on these three applications and from the above discussion, we may
conclude that careful design in OO paradigm can yield appreciable performance. We summarize
below, the most important points about OO design and performance issues:
And for this there are simple solutions that the paper stated them:
- put inline in headers
- for critical loop make a static version of your code that calls methods directly
- allocate objects them on stack and use references and constant references (move semantics will also help on it)
If you read the paper, the runtime penalty was like 4.09 % with the default design, but after taking in account the C++ benefits, it was sometimes even faster (with a hand-tuned version). But 4% slow down because you have multiple copies is ugly, but having a leak because memory don't free automatically (a "feature" that can happen more likely in C than in C++).
At last, -flto (or -O4) is mostly done to address this, inlining over the objects boundaries and giving the compiler to inline many small objects.
Originally posted by gens View Postil make a proper loop with avx when theres time, this sse was not as optimized as it could be
It looks to me that you have in mind some game programming (I may be wrong though), and you can see that the GPU is the resource that is a limiting factor for most games: http://www.anandtech.com/show/6934/c...tigpu-at-1440p (in one GPU configuration). Try to leverage this. When I play Crysis 2 (a great game btw), it was frustrating to play it just 1280x1024, but the CPU was not the issue. If you do some CAD like programming (even I doubt it), I can say that in big systems your updating logic matter much more, and I say this because I was writing to one, and when you have hundred or thousands of pieces and some of them impact the others, it is more important to have a big framework that computes the impact. As for the language I was working for it, was C#. C# was before was optimized like 30% of runtime, but after some optimizations, they were like 10% of runtime (look here for details, here [url=http://narocad.blogspot.com/2009/06/again-fixes-and-benchmark-part-ii.html]after the optimizations[url] where I did notice that the slowest component was updating the tree view, the 2nd was the C++ component and the visualization engine in C++ was written to not work with that many shapes)
Comment
-
Assembly is still important to Linux in the kernel. Even today, the second most common language in the Linux Kernel is assembly after C. Assembly gives Linus, Hartman and others the ability to properly the design in detail of parts of the kernel then would cause performance bottlenecks if designed in C. This allows the designers and maintainers of Linux to make it extremely fast and efficient.
This is in contrast to all the BSDs where they wrote their entire OS in C even in places where using assembly is critical. The result? BSDs are one of the slowest OS ever. Even slower then windows.
Don't believe me? see it for yourselves:http://svn.freebsd.org/base/head/
If you do a `find ./ -name "*.asm" -print`, you find nothing. There are some *.S files but it turns out that those are just extra-baggage left over from when they copy-paste AT&T code which resulted in the USL vs BSDI lawsuit. These *.S files are never referenced in any of the Makefiles.
That as well as the fact that, the source tree one big heavy pile of garbage full of spaghetti code which just shows how of a crappy mess BSD is.
No wander why BSD kernels are so un-portable and slow. Worse, they are even trying to rewrite everything (including the kernel) in C++ all because of clang. What retards.
Comment
-
Originally posted by i386reaper View PostAssembly is still important to Linux in the kernel. Even today, the second most common language in the Linux Kernel is assembly after C. Assembly gives Linus, Hartman and others the ability to properly the design in detail of parts of the kernel then would cause performance bottlenecks if designed in C. This allows the designers and maintainers of Linux to make it extremely fast and efficient.
This is in contrast to all the BSDs where they wrote their entire OS in C even in places where using assembly is critical. The result? BSDs are one of the slowest OS ever. Even slower then windows.
Don't believe me? see it for yourselves:http://svn.freebsd.org/base/head/
If you do a `find ./ -name "*.asm" -print`, you find nothing. There are some *.S files but it turns out that those are just extra-baggage left over from when they copy-paste AT&T code which resulted in the USL vs BSDI lawsuit. These *.S files are never referenced in any of the Makefiles.
That as well as the fact that, the source tree one big heavy pile of garbage full of spaghetti code which just shows how of a crappy mess BSD is.
No wander why BSD kernels are so un-portable and slow. Worse, they are even trying to rewrite everything (including the kernel) in C++ all because of clang. What retards.
FreeBSD has to be slower for many reasons (and Mac OS X) than Linux and this includes:
- many critical modules of Linux are not compiled as modules but part of the kernel
- Linux has more manpower and more interest to be fast in supercomputers so SGI and IBM contributed heavily in it
- A lot of hardware companies profile and tune Linux still (for example Intel)
- the file system (Ext4) is faster in general than FreeBSD
- FreeBSD is compiled against an older GCC (4.2) because GCC 4.3 / GPL3 is not compatible with BSD license
If you add all these together, it is to be expected, with or without assembly that FreeBSD is slow(er than Linux).
Going back to the topic's talk "is assembly still relevant to most Linux software"? Linux kernel is not most Linux software A scan of the source states clearly that Linux has 2.9% assembly (compared with: 94.5% C). If you substract the atomics or system calls, or work arounds like to free the cache when a context switch is happening) or requests to CPU to go in a lower state (all of them require assembly, C will not do the cut), and multiply with the platforms that Linux is supported, the assembly usage is really minimal.
Source: https://www.ohloh.net/p/linux/analys...guages_summary
If we go to FreeBSD, the history is really similar, and there is assembly: 91.4% C / 2.4% Assembly (similar with Linux rates), but the BSD kernel is many times smaller
Comment
-
Beating the dead horse:
As for my own code, my typical workflow is something like:- Write function in Objective-C. If fast enough, stop.
- Rewrite function to use a better algorithm or data structure. If fast enough, stop.
- Rewrite function in C. If fast enough, stop.
- Rewrite function in multi-threaded C with Grand Central Dispatch. If fast enough, stop.
- Rewrite function in OpenCL.
With JavaScript, optimization stops at step #2. Even with the promising new asm.js, that would get me to step #2.5 — still slower than C, and a far cry from multi-threaded C or OpenCL. Developing for Mac, I have more tools at my disposal for clearing out performance bottlenecks and delivering a superb user experience. (Xcode's profiler, by the way, is generally excellent.)
Comment
Comment