Announcement

**Obscene_CNN** · 02 April 2013, 02:29 PM

You can do several things in assembler faster than you can in C for one main reason, you are not constrained by C's rules. For example in x86 assembler you can determine which of 4 values you have with only one compare. On just about all processor architectures you can return more than one value without using a pointer. In assembly you can also return a result without using a single register or memory by using condition flags. Try doing any of these with C.

You don't see assembly much in programs today. In CS (today) they place an emphasis on one single operation per function. With everything split up into tiny functions you can't truly leverage assembly's true power. Also with everything split up into tiny functions what gains you do get are overshadowed by function calling overhead. One reason why people don't bother to optimize their code today is the bottle neck in performance is this coding philosophy of doing a single operation per function. In other words if if a program spends less than 1% of its time in any function the most gain you can get by optimizing a function is less than 1%.

This problem is so bad that people don't even try to write fast tight code even in video drivers.

example of video driver code (changed to protect the identity of the guilty)

static int somechip_interp_flat(struct somechip_shader_ctx *ctx, int input)
{
int i, r;
struct some_gpu_bytecode_alu alu;

for (i = 0; i < 4; i++) {
memset(&alu, 0, sizeof(struct some_gpu_bytecode_alu));

alu.inst = SOME_ALU_INSTRUCTION_INTERP_LOAD_P0;

alu.dst.sel = ctx->shader->input[input].gpr;
alu.dst.write = 1;

alu.dst.chan = i;

alu.src[0].sel = SOME_ALU_SRC_PARAM_BASE + ctx->shader->input[input].lds_pos;
alu.src[0].chan = i;

if (i == 3)
alu.last = 1;
r = some_alu_bytecode_add_alu(ctx->bc, &alu);
if (r)
return r;
}
return 0;
}

How it should be written for performance

static int somechip_interp_flat(struct somechip_shader_ctx *ctx, int input)
{
int i, r;
struct some_gpu_bytecode_alu alu;

memset(&alu, 0, sizeof(struct some_gpu_bytecode_alu));

alu.inst = SOME_ALU_INSTRUCTION_INTERP_LOAD_P0;

alu.dst.sel = ctx->shader->input[input].gpr;
alu.dst.write = 1;

alu.src[0].sel = SOME_ALU_SRC_PARAM_BASE + ctx->shader->input[input].lds_pos;

for (i = 0; i < 4; i++) {

alu.dst.chan = i;

alu.src[0].chan = i;

if (i == 3)
alu.last = 1;
r = some_alu_bytecode_add_alu(ctx->bc, &alu);
if (unlikely(r))
break;
}
return r;
}

One might argue that its the shader code generated that is the important part. However slow CPU code and unneeded memory writes do delay the issue of the shader code to the gpu.

**Wildfire** · 02 April 2013, 02:45 PM

Originally posted by Obscene_CNN View Post

You can do several things in assembler faster than you can in C for one main reason, you are not constrained by C's rules. For example in x86 assembler you can determine which of 4 values you have with only one compare. On just about all processor architectures you can return more than one value without using a pointer. In assembly you can also return a result without using a single register or memory by using condition flags. Try doing any of these with C.

You're comparing apples and oranges. Sure _you_ can't do that with C, because you don't have to. The question is, what can an optimizing compiler do with C? Do you know for a fact that a modern compiler doesn't use any of those tricks when compiling C to ASM/machine code?

**Obscene_CNN** · 02 April 2013, 03:11 PM

Originally posted by Wildfire View Post

You're comparing apples and oranges. Sure _you_ can't do that with C, because you don't have to. The question is, what can an optimizing compiler do with C? Do you know for a fact that a modern compiler doesn't use any of those tricks when compiling C to ASM/machine code?

I have never encountered one. Can you show me one that can? (GCC can't)

I know that they can't do the tricks on returning more than one variable because it violates the C spec and will clobber debuggers.

**GreatEmerald** · 02 April 2013, 03:59 PM

The philosophy of writing small functions is for readability and reusability sake. Sure, it's not easy to optimise that, but the vast majority of programs don't need to be optimised much anyway. While the gains from writing code faster, having it easily readable and portable far outweigh any benefits the additional optimisations would give.

Of course, in case of performance-critical tasks (kernel, graphics drivers, etc.) it could make sense to sacrifice that for additional speed, yes. But those are not common cases in the least.

**Qaz`** · 02 April 2013, 04:03 PM

Originally posted by ldesnogu View Post

Sorry but given that intrinsics are CPU/ISA-specific, you'd still have to port your code, which is the point of the article: what x86 code exists that needs ARM64 porting?

That was more a reply to the discussion on the general the general usefulness of asm today, but even so I sure would prefer porting intrinsics-code over pure asm. If I'm the one to choose approach I prefer writing an abstraction layer for intrinsic code with force-inlines and macros if I have a lot of simd-code and a new platform.

**gens** · 02 April 2013, 04:17 PM

Originally posted by GreatEmerald View Post

Of course, in case of performance-critical tasks (kernel, graphics drivers, etc.) it could make sense to sacrifice that for additional speed, yes. But those are not common cases in the least.

databases, parts of 3D engines (probably gonna be moved to opengl, but still), simulators and everything needing specialized loops that make a big % in performance
firefox and webkit/chrome have plenty loops in assembly
oh, and HPC benefits from knowing the execution time of a loop as it can remove the need for semaphores (or whatever they call em)

Kazushige Goto - Wikipedia

http://en.wikipedia.org/wiki/Kazushige_Goto

bdw, different algorithms are best for different cpu's/coprocessors so writing a generic code can make some platforms not achieve best performance at all

**Obscene_CNN** · 02 April 2013, 04:20 PM

Originally posted by GreatEmerald View Post

The philosophy of writing small functions is for readability and reusability sake. Sure, it's not easy to optimise that, but the vast majority of programs don't need to be optimised much anyway. While the gains from writing code faster, having it easily readable and portable far outweigh any benefits the additional optimisations would give.

Your end user may differ in opinion. Especially when they are the ones that have to wait for it and pay for the hardware to store and run it. Take a look at Microsoft's Surface where half of the flash was eaten up by the base software.

Also a common thing over looked which is more and more important today is power consumption. Memory reads/writes and instruction cycles take energy and the more you have to do to accomplish a task the shorter your battery lasts. Power management is a race to get to sleep.

**hwertz** · 02 April 2013, 06:44 PM

I know that they can't do the tricks on returning more than one variable because it violates the C spec and will clobber debuggers.

You don't know that for sure -- unless you use a low -O option, and possibly -g, gcc doesn't worry about debuggers. And LTO (link time optimization) is not real commonly used yet, but with this used the compiler is not restricted to following C or C++ standards for passing (or returning) information from one function to the next (within your program -- of course, when calling a shared lib you do have to follow standards.) I've messed with LTO on gentoo and it builds most software correctly* (and nice speedups at times). Within a year or two I bet most distros use LTO -- essentially, packages won't follow C or C++ spec at all internally, just when making syscalls and library calls.

Originally posted by Obscene_CNN View Post

Your end user may differ in opinion. Especially when they are the ones that have to wait for it and pay for the hardware to store and run it. Take a look at Microsoft's Surface where half of the flash was eaten up by the base software.

Also a common thing over looked which is more and more important today is power consumption. Memory reads/writes and instruction cycles take energy and the more you have to do to accomplish a task the shorter your battery lasts. Power management is a race to get to sleep.

Hear hear! The "it's so fast, optimization doesn't matter anyway" attitude is a real shame. I'm glad that in fact most software on Linux, they DO worry about the performance of it and don't actually take this attitude. Of course, the biggest gain is from avoiding the use of inefficient algorithms... I.e. O(1) is better than O(n), which is better than O(n^2). This is where programs really fall apart in terms of bloat is when some piece of code uses exponentially more time than it should, as opposed to some few percent different they haven't shaved off.

----------------------
Back on topic...
I'm unsurprised by this result. I've been using Linux since about 1994, and in general it's been a little assembly in jpeg/png/etc. decoding (but not going to kill an ARM to not have this.. plus I think there *is* ARM assembly for this), video encoding and decoding, and so on, not like a text editor or whatever. I'm guessing, those programs (chess, bzip2, gzip, etc.) that vary a lot between compiler versions could have an ideal assembly-language version that's nice and fast... where a lot of programs where the compiler makes little different are within a % or 2 of the ideal assembly version (there's nothing tricky to speed them up.) ARM is I think even an easier case, the instruction set isn't as complex as x86's... if DSP or NEON (SSE-like instructions) won't help, then that's that.

*I don't think LTO build anything *incorrectly* any more, just fails to build a few packages.

**Sergio** · 02 April 2013, 07:10 PM

I've got a small project going on (http://realboyemulator.wordpress.com); it is an emulator for the Nintendo Game Boy. I had the opportunity to implement "the core" of the emulator in x86-64. Using Assembly was more a matter of challenge; it would've been way easier doing it in any high-level language. Anyway here are some advantages I found for using Assembly:

- I was able to keep frequently-accessed values in registers. For example, for simulating the instruction pointer, register %r13 was permanently used. This value is accessed ALL the time.
- I had control over the exact layout of the data structures. For example, an array of structures each describing a machine instruction of the Game Boy's CPU. Each entry is 32 bytes, so indexing can be done through a fast shifting instead of multiplying. I think that this simple trick can be considerable in overall, because it's part of the fetch-and-execute cycle, and this loop is executed for each instruction to emulate.
- Many more little tricks, some of them that take advantage of knowing particularities of the program.

I know that compilers can do all this simple tricks, and quite possibly a compiler would generate better code than what I have done with RealBoy, because there are lots of architectural things I just ignore. However, I get the impression that, in sum, one is capable of doing a better work at optimizing for a particular program. One can take into account various particularities and combine lots of tricks that I think the compiler would have to be human to be capable of doing.

**archibald** · 02 April 2013, 07:13 PM

Originally posted by Obscene_CNN View Post

Your end user may differ in opinion. Especially when they are the ones that have to wait for it and pay for the hardware to store and run it. Take a look at Microsoft's Surface where half of the flash was eaten up by the base software.

You seem to be making the assumption that optimised code equates to smaller code.

End users can and will always complain that their use-case wasn't treated preferentially. An end-user will almost always say "this should be faster", but if presented with the choice "accept it as it is" vs. "go without new features X, Y and Z for 2 months whilst we figure out what's going on, figure out a way of improving it and then subject it to QA", how many of those end-users will take the second option?

I've taken calls from clients who demanded that I make X faster, yet when told "okay, I'll spend my time making X faster, but I'll have to stop working on Y to do so", decide that it isn't so important.

Bottom line: end users want the impossible: perfect software. I want to provide that, but I'm only human.

Announcement

Is Assembly Still Relevant To Most Linux Software?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment