Announcement

**frign** · 02 April 2013, 07:18 PM

More improvements

Originally posted by Obscene_CNN View Post

You can do several things in assembler faster than you can in C for one main reason, you are not constrained by C's rules. For example in x86 assembler you can determine which of 4 values you have with only one compare. On just about all processor architectures you can return more than one value without using a pointer. In assembly you can also return a result without using a single register or memory by using condition flags. Try doing any of these with C.

You don't see assembly much in programs today. In CS (today) they place an emphasis on one single operation per function. With everything split up into tiny functions you can't truly leverage assembly's true power. Also with everything split up into tiny functions what gains you do get are overshadowed by function calling overhead. One reason why people don't bother to optimize their code today is the bottle neck in performance is this coding philosophy of doing a single operation per function. In other words if if a program spends less than 1% of its time in any function the most gain you can get by optimizing a function is less than 1%.

This problem is so bad that people don't even try to write fast tight code even in video drivers.

example of video driver code (changed to protect the identity of the guilty)

Code:

static int somechip_interp_flat(struct somechip_shader_ctx *ctx, int input)
{
        int i, r;
        struct some_gpu_bytecode_alu alu;

        for (i = 0; i < 4; i++) {
                memset(&alu, 0, sizeof(struct some_gpu_bytecode_alu));

                alu.inst = SOME_ALU_INSTRUCTION_INTERP_LOAD_P0;

                alu.dst.sel = ctx->shader->input[input].gpr;
                alu.dst.write = 1;

                alu.dst.chan = i;

                alu.src[0].sel = SOME_ALU_SRC_PARAM_BASE + ctx->shader->input[input].lds_pos;
                alu.src[0].chan = i;

                if (i == 3)
                        alu.last = 1;
                r = some_alu_bytecode_add_alu(ctx->bc, &alu);
                if (r)
                        return r;
        }
        return 0;
}

How it should be written for performance

Code:

static int somechip_interp_flat(struct somechip_shader_ctx *ctx, int input)
{
        int i, r;
        struct some_gpu_bytecode_alu alu;

        memset(&alu, 0, sizeof(struct some_gpu_bytecode_alu));

        alu.inst = SOME_ALU_INSTRUCTION_INTERP_LOAD_P0;

        alu.dst.sel = ctx->shader->input[input].gpr;
        alu.dst.write = 1;

        alu.src[0].sel = SOME_ALU_SRC_PARAM_BASE + ctx->shader->input[input].lds_pos;


        for (i = 0; i < 4; i++) {

                alu.dst.chan = i;

                alu.src[0].chan = i;

                if (i == 3)
                        alu.last = 1;
                r = some_alu_bytecode_add_alu(ctx->bc, &alu);
                if (unlikely(r))
                        break;
        }
        return r; 
}

One might argue that its the shader code generated that is the important part. However slow CPU code and unneeded memory writes do delay the issue of the shader code to the gpu.

I am itching to give my 2 cents here, as there is even more room for improvements, despite being quite minor compared to your initial proposal (Note: The CODE-Tag can be handy):

Code:

static [B](u)int_fastQ_t[/B] somechip_interp_flat(struct somechip_shader_ctx *ctx, int input)
{
     [B](u)int_fastQ_t[/B] r; // I don't know the range of r, 
                     // but it could be determined in the case 
                     // and limited to 16 bits or even 8

     struct some_gpu_bytecode_alu alu;

     memset(&alu, 0, sizeof(struct some_gpu_bytecode_alu));

     alu.inst = SOME_ALU_INSTRUCTION_INTERP_LOAD_P0;

     alu.dst.sel = ctx->shader->input[input].gpr;
     alu.dst.write = 1;

     alu.src[0].sel = SOME_ALU_SRC_PARAM_BASE + ctx->shader->input[input].lds_pos;

     for ([B]uint_fast8_t[/B] i = 0; i < 4; i++) {
          alu.dst.chan = i;
          alu.src[0].chan = i;

          if (i == 3)
               alu.last = 1;
          r = some_alu_bytecode_add_alu(ctx->bc, &alu);
          if (unlikely(r))
               break;
     }
     return r; 
}

Sadly, the order of the for-loop is determined. Looking at "int input", but not knowing its data-range, one might also be able to reduce its size accordingly.

More improvements would actually require to know more about the specific project (I would be happy if you could PM me the name of the project; I really would like to know (an hope it's not Intel)).

Did you commit your changes?

Best regards

FRIGN

**Obscene_CNN** · 02 April 2013, 08:11 PM

Did you commit your changes?

Best regards

FRIGN

FRIGN

I'm working on a patch but it needs more testing and verification before I release it.

Thanks for the tip.

Obscene_CNN

**kkaos** · 02 April 2013, 11:30 PM

Originally posted by lesterchester View Post

(and all crackers use Linux because they would have learned that Linux is the safest)...

BSDs are so insecurity that you probability only need to write your code in C, C++ or even shell to be successful.

BSDs are insecure? Really? Surely this is flamebait. If not, please explain to me how BSD is insecure, and if I'm correct in implying that you meant Linux is more secure than BSD, please tell me why that's the case.

EDITED TO ADD RESPONSE RELEVANT TO ARTICLE:
I don't think it's a secret that few Linux software and drivers are developed in straight assembly. Isn't this why C was created years ago? To replace using assembly for systems programming? And, yes, properly coded assembly should run faster than any code written in any high level language, but portability tends to be a higher priority goal than speed; readability and ease of maintenance are also usually higher priorities.

**erendorn** · 03 April 2013, 03:04 AM

Originally posted by Obscene_CNN View Post

Your end user may differ in opinion. Especially when they are the ones that have to wait for it and pay for the hardware to store and run it. Take a look at Microsoft's Surface where half of the flash was eaten up by the base software.

Also a common thing over looked which is more and more important today is power consumption. Memory reads/writes and instruction cycles take energy and the more you have to do to accomplish a task the shorter your battery lasts. Power management is a race to get to sleep.

Are you suggesting that they should have written Windows 8 in assembly?
Really?
That's ridiculous. With a project of that size (actually, any size above a couple of functions), there are no guaranties that the code would be smaller or faster, while it would be for sure a million times buggier, and more importantly, released in 2100, which end user might not like (and by that time, the 16GB of memory won't matter much).

ASM is ok for fine tuning a couple of bottleneck (emphasis) computation steps. It's not ok in any other situation.

**curaga** · 03 April 2013, 06:07 AM

@frign

It's eerily similar to Radeon code. Bears a striking resemblance, I could even say.

**frign** · 03 April 2013, 07:25 AM

Nice README

Originally posted by curaga View Post

@frign

It's eerily similar to Radeon code. Bears a striking resemblance, I could even say.

The README of the Radeon-driver states "Abandon hope all ye who enter here".

**Obscene_CNN** · 03 April 2013, 01:53 PM

Originally posted by erendorn View Post

Are you suggesting that they should have written Windows 8 in assembly?
Really?
That's ridiculous. With a project of that size (actually, any size above a couple of functions), there are no guaranties that the code would be smaller or faster, while it would be for sure a million times buggier, and more importantly, released in 2100, which end user might not like (and by that time, the 16GB of memory won't matter much).

ASM is ok for fine tuning a couple of bottleneck (emphasis) computation steps. It's not ok in any other situation.

I didn't suggest that windows 8 be written in assembly, just abandon the crappy of philosophy ease of development far out weighs all other considerations. Hell if they would just abandon the use of C++ templates it would cut the size to about 1/4th the size it is now.

And to the contrary large projects are quite possible with assembly in a timely manor with very few bugs.

RollerCoaster Tycoon - Wikipedia

http://en.wikipedia.org/wiki/Roller_Coaster_Tycoon

**TemplarGR** · 03 April 2013, 02:05 PM

Originally posted by Obscene_CNN View Post

I didn't suggest that windows 8 be written in assembly, just abandon the crappy of philosophy ease of development far out weighs all other considerations. Hell if they would just abandon the use of C++ templates it would cut the size to about 1/4th the size it is now.

And to the contrary large projects are quite possible with assembly in a timely manor with very few bugs.

http://en.wikipedia.org/wiki/Roller_Coaster_Tycoon

This made my day...

BTW, since when were Roller Coaster Tycoon written in assembly? Any proof?

**GreatEmerald** · 03 April 2013, 02:22 PM

Originally posted by TemplarGR View Post

This made my day...

A Timely Manor indeed.

**gens** · 03 April 2013, 02:30 PM

Originally posted by TemplarGR View Post

This made my day...

BTW, since when were Roller Coaster Tycoon written in assembly? Any proof?

"they" were not
the first one was

also fasm is written entirely in fasm
it is cross platform and very fast and small

big projects can be made in assembly if planed good

with C++ you can make a working program way faster then in FASM
then you can spend years debugging it

and no, debugging assembly is not that hard as everybody that don't write assembly say
in some cases it is, but in most it is easier then debugging C (idk C++ good enough to talk about debugging it, all i know theres at least 5 ways to write one thing)

Announcement

Is Assembly Still Relevant To Most Linux Software?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment