Announcement

Collapse
No announcement yet.

Is Assembly Still Relevant To Most Linux Software?

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    You can do several things in assembler faster than you can in C for one main reason, you are not constrained by C's rules. For example in x86 assembler you can determine which of 4 values you have with only one compare. On just about all processor architectures you can return more than one value without using a pointer. In assembly you can also return a result without using a single register or memory by using condition flags. Try doing any of these with C.

    You don't see assembly much in programs today. In CS (today) they place an emphasis on one single operation per function. With everything split up into tiny functions you can't truly leverage assembly's true power. Also with everything split up into tiny functions what gains you do get are overshadowed by function calling overhead. One reason why people don't bother to optimize their code today is the bottle neck in performance is this coding philosophy of doing a single operation per function. In other words if if a program spends less than 1% of its time in any function the most gain you can get by optimizing a function is less than 1%.

    This problem is so bad that people don't even try to write fast tight code even in video drivers.

    example of video driver code (changed to protect the identity of the guilty)

    static int somechip_interp_flat(struct somechip_shader_ctx *ctx, int input)
    {
    int i, r;
    struct some_gpu_bytecode_alu alu;

    for (i = 0; i < 4; i++) {
    memset(&alu, 0, sizeof(struct some_gpu_bytecode_alu));

    alu.inst = SOME_ALU_INSTRUCTION_INTERP_LOAD_P0;

    alu.dst.sel = ctx->shader->input[input].gpr;
    alu.dst.write = 1;

    alu.dst.chan = i;

    alu.src[0].sel = SOME_ALU_SRC_PARAM_BASE + ctx->shader->input[input].lds_pos;
    alu.src[0].chan = i;

    if (i == 3)
    alu.last = 1;
    r = some_alu_bytecode_add_alu(ctx->bc, &alu);
    if (r)
    return r;
    }
    return 0;
    }

    How it should be written for performance

    static int somechip_interp_flat(struct somechip_shader_ctx *ctx, int input)
    {
    int i, r;
    struct some_gpu_bytecode_alu alu;

    memset(&alu, 0, sizeof(struct some_gpu_bytecode_alu));

    alu.inst = SOME_ALU_INSTRUCTION_INTERP_LOAD_P0;

    alu.dst.sel = ctx->shader->input[input].gpr;
    alu.dst.write = 1;

    alu.src[0].sel = SOME_ALU_SRC_PARAM_BASE + ctx->shader->input[input].lds_pos;


    for (i = 0; i < 4; i++) {

    alu.dst.chan = i;

    alu.src[0].chan = i;

    if (i == 3)
    alu.last = 1;
    r = some_alu_bytecode_add_alu(ctx->bc, &alu);
    if (unlikely(r))
    break;
    }
    return r;
    }

    One might argue that its the shader code generated that is the important part. However slow CPU code and unneeded memory writes do delay the issue of the shader code to the gpu.

    Comment


    • #32
      Originally posted by Obscene_CNN View Post
      You can do several things in assembler faster than you can in C for one main reason, you are not constrained by C's rules. For example in x86 assembler you can determine which of 4 values you have with only one compare. On just about all processor architectures you can return more than one value without using a pointer. In assembly you can also return a result without using a single register or memory by using condition flags. Try doing any of these with C.
      You're comparing apples and oranges. Sure _you_ can't do that with C, because you don't have to. The question is, what can an optimizing compiler do with C? Do you know for a fact that a modern compiler doesn't use any of those tricks when compiling C to ASM/machine code?

      Comment


      • #33
        Originally posted by Wildfire View Post
        You're comparing apples and oranges. Sure _you_ can't do that with C, because you don't have to. The question is, what can an optimizing compiler do with C? Do you know for a fact that a modern compiler doesn't use any of those tricks when compiling C to ASM/machine code?
        I have never encountered one. Can you show me one that can? (GCC can't)

        I know that they can't do the tricks on returning more than one variable because it violates the C spec and will clobber debuggers.

        Comment


        • #34
          The philosophy of writing small functions is for readability and reusability sake. Sure, it's not easy to optimise that, but the vast majority of programs don't need to be optimised much anyway. While the gains from writing code faster, having it easily readable and portable far outweigh any benefits the additional optimisations would give.

          Of course, in case of performance-critical tasks (kernel, graphics drivers, etc.) it could make sense to sacrifice that for additional speed, yes. But those are not common cases in the least.

          Comment


          • #35
            Originally posted by ldesnogu View Post
            Sorry but given that intrinsics are CPU/ISA-specific, you'd still have to port your code, which is the point of the article: what x86 code exists that needs ARM64 porting?
            That was more a reply to the discussion on the general the general usefulness of asm today, but even so I sure would prefer porting intrinsics-code over pure asm. If I'm the one to choose approach I prefer writing an abstraction layer for intrinsic code with force-inlines and macros if I have a lot of simd-code and a new platform.

            Comment


            • #36
              Originally posted by GreatEmerald View Post
              Of course, in case of performance-critical tasks (kernel, graphics drivers, etc.) it could make sense to sacrifice that for additional speed, yes. But those are not common cases in the least.
              databases, parts of 3D engines (probably gonna be moved to opengl, but still), simulators and everything needing specialized loops that make a big % in performance
              firefox and webkit/chrome have plenty loops in assembly
              oh, and HPC benefits from knowing the execution time of a loop as it can remove the need for semaphores (or whatever they call em)

              http://en.wikipedia.org/wiki/Kazushige_Goto

              bdw, different algorithms are best for different cpu's/coprocessors so writing a generic code can make some platforms not achieve best performance at all
              Last edited by gens; 04-02-2013, 04:20 PM.

              Comment


              • #37
                Originally posted by GreatEmerald View Post
                The philosophy of writing small functions is for readability and reusability sake. Sure, it's not easy to optimise that, but the vast majority of programs don't need to be optimised much anyway. While the gains from writing code faster, having it easily readable and portable far outweigh any benefits the additional optimisations would give.
                Your end user may differ in opinion. Especially when they are the ones that have to wait for it and pay for the hardware to store and run it. Take a look at Microsoft's Surface where half of the flash was eaten up by the base software.

                Also a common thing over looked which is more and more important today is power consumption. Memory reads/writes and instruction cycles take energy and the more you have to do to accomplish a task the shorter your battery lasts. Power management is a race to get to sleep.

                Comment


                • #38
                  I know that they can't do the tricks on returning more than one variable because it violates the C spec and will clobber debuggers.
                  You don't know that for sure -- unless you use a low -O option, and possibly -g, gcc doesn't worry about debuggers. And LTO (link time optimization) is not real commonly used yet, but with this used the compiler is not restricted to following C or C++ standards for passing (or returning) information from one function to the next (within your program -- of course, when calling a shared lib you do have to follow standards.) I've messed with LTO on gentoo and it builds most software correctly* (and nice speedups at times). Within a year or two I bet most distros use LTO -- essentially, packages won't follow C or C++ spec at all internally, just when making syscalls and library calls.

                  Originally posted by Obscene_CNN View Post
                  Your end user may differ in opinion. Especially when they are the ones that have to wait for it and pay for the hardware to store and run it. Take a look at Microsoft's Surface where half of the flash was eaten up by the base software.

                  Also a common thing over looked which is more and more important today is power consumption. Memory reads/writes and instruction cycles take energy and the more you have to do to accomplish a task the shorter your battery lasts. Power management is a race to get to sleep.
                  Hear hear! The "it's so fast, optimization doesn't matter anyway" attitude is a real shame. I'm glad that in fact most software on Linux, they DO worry about the performance of it and don't actually take this attitude. Of course, the biggest gain is from avoiding the use of inefficient algorithms... I.e. O(1) is better than O(n), which is better than O(n^2). This is where programs really fall apart in terms of bloat is when some piece of code uses exponentially more time than it should, as opposed to some few percent different they haven't shaved off.

                  ----------------------
                  Back on topic...
                  I'm unsurprised by this result. I've been using Linux since about 1994, and in general it's been a little assembly in jpeg/png/etc. decoding (but not going to kill an ARM to not have this.. plus I think there *is* ARM assembly for this), video encoding and decoding, and so on, not like a text editor or whatever. I'm guessing, those programs (chess, bzip2, gzip, etc.) that vary a lot between compiler versions could have an ideal assembly-language version that's nice and fast... where a lot of programs where the compiler makes little different are within a % or 2 of the ideal assembly version (there's nothing tricky to speed them up.) ARM is I think even an easier case, the instruction set isn't as complex as x86's... if DSP or NEON (SSE-like instructions) won't help, then that's that.

                  *I don't think LTO build anything *incorrectly* any more, just fails to build a few packages.
                  Last edited by hwertz; 04-02-2013, 06:47 PM.

                  Comment


                  • #39
                    I've got a small project going on (http://realboyemulator.wordpress.com); it is an emulator for the Nintendo Game Boy. I had the opportunity to implement "the core" of the emulator in x86-64. Using Assembly was more a matter of challenge; it would've been way easier doing it in any high-level language. Anyway here are some advantages I found for using Assembly:

                    - I was able to keep frequently-accessed values in registers. For example, for simulating the instruction pointer, register %r13 was permanently used. This value is accessed ALL the time.
                    - I had control over the exact layout of the data structures. For example, an array of structures each describing a machine instruction of the Game Boy's CPU. Each entry is 32 bytes, so indexing can be done through a fast shifting instead of multiplying. I think that this simple trick can be considerable in overall, because it's part of the fetch-and-execute cycle, and this loop is executed for each instruction to emulate.
                    - Many more little tricks, some of them that take advantage of knowing particularities of the program.

                    I know that compilers can do all this simple tricks, and quite possibly a compiler would generate better code than what I have done with RealBoy, because there are lots of architectural things I just ignore. However, I get the impression that, in sum, one is capable of doing a better work at optimizing for a particular program. One can take into account various particularities and combine lots of tricks that I think the compiler would have to be human to be capable of doing.

                    Comment


                    • #40
                      Originally posted by Obscene_CNN View Post
                      Your end user may differ in opinion. Especially when they are the ones that have to wait for it and pay for the hardware to store and run it. Take a look at Microsoft's Surface where half of the flash was eaten up by the base software.
                      You seem to be making the assumption that optimised code equates to smaller code.

                      End users can and will always complain that their use-case wasn't treated preferentially. An end-user will almost always say "this should be faster", but if presented with the choice "accept it as it is" vs. "go without new features X, Y and Z for 2 months whilst we figure out what's going on, figure out a way of improving it and then subject it to QA", how many of those end-users will take the second option?

                      I've taken calls from clients who demanded that I make X faster, yet when told "okay, I'll spend my time making X faster, but I'll have to stop working on Y to do so", decide that it isn't so important.

                      Bottom line: end users want the impossible: perfect software. I want to provide that, but I'm only human.

                      Comment


                      • #41
                        More improvements

                        Originally posted by Obscene_CNN View Post
                        You can do several things in assembler faster than you can in C for one main reason, you are not constrained by C's rules. For example in x86 assembler you can determine which of 4 values you have with only one compare. On just about all processor architectures you can return more than one value without using a pointer. In assembly you can also return a result without using a single register or memory by using condition flags. Try doing any of these with C.

                        You don't see assembly much in programs today. In CS (today) they place an emphasis on one single operation per function. With everything split up into tiny functions you can't truly leverage assembly's true power. Also with everything split up into tiny functions what gains you do get are overshadowed by function calling overhead. One reason why people don't bother to optimize their code today is the bottle neck in performance is this coding philosophy of doing a single operation per function. In other words if if a program spends less than 1% of its time in any function the most gain you can get by optimizing a function is less than 1%.

                        This problem is so bad that people don't even try to write fast tight code even in video drivers.

                        example of video driver code (changed to protect the identity of the guilty)

                        Code:
                        static int somechip_interp_flat(struct somechip_shader_ctx *ctx, int input)
                        {
                                int i, r;
                                struct some_gpu_bytecode_alu alu;
                        
                                for (i = 0; i < 4; i++) {
                                        memset(&alu, 0, sizeof(struct some_gpu_bytecode_alu));
                        
                                        alu.inst = SOME_ALU_INSTRUCTION_INTERP_LOAD_P0;
                        
                                        alu.dst.sel = ctx->shader->input[input].gpr;
                                        alu.dst.write = 1;
                        
                                        alu.dst.chan = i;
                        
                                        alu.src[0].sel = SOME_ALU_SRC_PARAM_BASE + ctx->shader->input[input].lds_pos;
                                        alu.src[0].chan = i;
                        
                                        if (i == 3)
                                                alu.last = 1;
                                        r = some_alu_bytecode_add_alu(ctx->bc, &alu);
                                        if (r)
                                                return r;
                                }
                                return 0;
                        }
                        How it should be written for performance

                        Code:
                        static int somechip_interp_flat(struct somechip_shader_ctx *ctx, int input)
                        {
                                int i, r;
                                struct some_gpu_bytecode_alu alu;
                        
                                memset(&alu, 0, sizeof(struct some_gpu_bytecode_alu));
                        
                                alu.inst = SOME_ALU_INSTRUCTION_INTERP_LOAD_P0;
                        
                                alu.dst.sel = ctx->shader->input[input].gpr;
                                alu.dst.write = 1;
                        
                                alu.src[0].sel = SOME_ALU_SRC_PARAM_BASE + ctx->shader->input[input].lds_pos;
                        
                        
                                for (i = 0; i < 4; i++) {
                        
                                        alu.dst.chan = i;
                        
                                        alu.src[0].chan = i;
                        
                                        if (i == 3)
                                                alu.last = 1;
                                        r = some_alu_bytecode_add_alu(ctx->bc, &alu);
                                        if (unlikely(r))
                                                break;
                                }
                                return r; 
                        }
                        One might argue that its the shader code generated that is the important part. However slow CPU code and unneeded memory writes do delay the issue of the shader code to the gpu.
                        I am itching to give my 2 cents here, as there is even more room for improvements, despite being quite minor compared to your initial proposal (Note: The CODE-Tag can be handy):

                        Code:
                        static (u)int_fastQ_t somechip_interp_flat(struct somechip_shader_ctx *ctx, int input)
                        {
                             (u)int_fastQ_t r; // I don't know the range of r, 
                                             // but it could be determined in the case 
                                             // and limited to 16 bits or even 8
                        
                             struct some_gpu_bytecode_alu alu;
                        
                             memset(&alu, 0, sizeof(struct some_gpu_bytecode_alu));
                        
                             alu.inst = SOME_ALU_INSTRUCTION_INTERP_LOAD_P0;
                        
                             alu.dst.sel = ctx->shader->input[input].gpr;
                             alu.dst.write = 1;
                        
                             alu.src[0].sel = SOME_ALU_SRC_PARAM_BASE + ctx->shader->input[input].lds_pos;
                        
                             for (uint_fast8_t i = 0; i < 4; i++) {
                                  alu.dst.chan = i;
                                  alu.src[0].chan = i;
                        
                                  if (i == 3)
                                       alu.last = 1;
                                  r = some_alu_bytecode_add_alu(ctx->bc, &alu);
                                  if (unlikely(r))
                                       break;
                             }
                             return r; 
                        }
                        Sadly, the order of the for-loop is determined. Looking at "int input", but not knowing its data-range, one might also be able to reduce its size accordingly.

                        More improvements would actually require to know more about the specific project (I would be happy if you could PM me the name of the project; I really would like to know (an hope it's not Intel)).

                        Did you commit your changes?

                        Best regards

                        FRIGN
                        Last edited by frign; 04-02-2013, 07:33 PM.

                        Comment


                        • #42
                          Did you commit your changes?

                          Best regards

                          FRIGN
                          FRIGN

                          I'm working on a patch but it needs more testing and verification before I release it.

                          Thanks for the tip.

                          Obscene_CNN

                          Comment


                          • #43
                            Originally posted by lesterchester View Post
                            (and all crackers use Linux because they would have learned that Linux is the safest)...

                            BSDs are so insecurity that you probability only need to write your code in C, C++ or even shell to be successful.
                            BSDs are insecure? Really? Surely this is flamebait. If not, please explain to me how BSD is insecure, and if I'm correct in implying that you meant Linux is more secure than BSD, please tell me why that's the case.

                            EDITED TO ADD RESPONSE RELEVANT TO ARTICLE:
                            I don't think it's a secret that few Linux software and drivers are developed in straight assembly. Isn't this why C was created years ago? To replace using assembly for systems programming? And, yes, properly coded assembly should run faster than any code written in any high level language, but portability tends to be a higher priority goal than speed; readability and ease of maintenance are also usually higher priorities.
                            Last edited by kkaos; 04-02-2013, 11:37 PM.

                            Comment


                            • #44
                              Originally posted by Obscene_CNN View Post
                              Your end user may differ in opinion. Especially when they are the ones that have to wait for it and pay for the hardware to store and run it. Take a look at Microsoft's Surface where half of the flash was eaten up by the base software.

                              Also a common thing over looked which is more and more important today is power consumption. Memory reads/writes and instruction cycles take energy and the more you have to do to accomplish a task the shorter your battery lasts. Power management is a race to get to sleep.
                              Are you suggesting that they should have written Windows 8 in assembly?
                              Really?
                              That's ridiculous. With a project of that size (actually, any size above a couple of functions), there are no guaranties that the code would be smaller or faster, while it would be for sure a million times buggier, and more importantly, released in 2100, which end user might not like (and by that time, the 16GB of memory won't matter much).

                              ASM is ok for fine tuning a couple of bottleneck (emphasis) computation steps. It's not ok in any other situation.

                              Comment


                              • #45
                                @frign

                                It's eerily similar to Radeon code. Bears a striking resemblance, I could even say.

                                Comment

                                Working...
                                X