Announcement

Collapse
No announcement yet.

Is Assembly Still Relevant To Most Linux Software?

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • This was done to see how much real Assembly is being used, to see what the code was used for, and whether it was worth porting to 64-bit ARM / AArch64 / ARMv8...


    Comment


    • Originally posted by gens View Post
      intrinsics are as portable as any assembly
      thing that bothered me when writing them was that i didnt know exactly how many registers i got left
      good about them is that the compiler reorders the instructions

      and again, you can thread assembly

      il make a proper loop with avx when theres time, this sse was not as optimized as it could be

      im as entitled to talk about C++ as you are on assembly
      so here is some paper on OO vs Procedural programing
      http://scholar.lib.vt.edu/theses/ava...ted/thesis.pdf
      From the paper:
      6.4 Summary
      From the data gathered on these three applications and from the above discussion, we may
      conclude that careful design in OO paradigm can yield appreciable performance. We summarize
      below, the most important points about OO design and performance issues:
      And the performance issues were also shown inside the introduction of the paper: multiple function calls, virtual calls and creation of small objects.
      And for this there are simple solutions that the paper stated them:
      - put inline in headers
      - for critical loop make a static version of your code that calls methods directly
      - allocate objects them on stack and use references and constant references (move semantics will also help on it)

      If you read the paper, the runtime penalty was like 4.09 % with the default design, but after taking in account the C++ benefits, it was sometimes even faster (with a hand-tuned version). But 4% slow down because you have multiple copies is ugly, but having a leak because memory don't free automatically (a "feature" that can happen more likely in C than in C++).

      At last, -flto (or -O4) is mostly done to address this, inlining over the objects boundaries and giving the compiler to inline many small objects.
      Originally posted by gens View Post
      il make a proper loop with avx when theres time, this sse was not as optimized as it could be
      As for me, I don't think it worth the effort. Try for your own program to see if you can learn how to improve your programming skills.

      It looks to me that you have in mind some game programming (I may be wrong though), and you can see that the GPU is the resource that is a limiting factor for most games: http://www.anandtech.com/show/6934/c...tigpu-at-1440p (in one GPU configuration). Try to leverage this. When I play Crysis 2 (a great game btw), it was frustrating to play it just 1280x1024, but the CPU was not the issue. If you do some CAD like programming (even I doubt it), I can say that in big systems your updating logic matter much more, and I say this because I was writing to one, and when you have hundred or thousands of pieces and some of them impact the others, it is more important to have a big framework that computes the impact. As for the language I was working for it, was C#. C# was before was optimized like 30% of runtime, but after some optimizations, they were like 10% of runtime (look here for details, here [url=http://narocad.blogspot.com/2009/06/again-fixes-and-benchmark-part-ii.html]after the optimizations[url] where I did notice that the slowest component was updating the tree view, the 2nd was the C++ component and the visualization engine in C++ was written to not work with that many shapes)

      Comment


      • Assembly is still important to Linux in the kernel. Even today, the second most common language in the Linux Kernel is assembly after C. Assembly gives Linus, Hartman and others the ability to properly the design in detail of parts of the kernel then would cause performance bottlenecks if designed in C. This allows the designers and maintainers of Linux to make it extremely fast and efficient.

        This is in contrast to all the BSDs where they wrote their entire OS in C even in places where using assembly is critical. The result? BSDs are one of the slowest OS ever. Even slower then windows.

        Don't believe me? see it for yourselves:http://svn.freebsd.org/base/head/

        If you do a `find ./ -name "*.asm" -print`, you find nothing. There are some *.S files but it turns out that those are just extra-baggage left over from when they copy-paste AT&T code which resulted in the USL vs BSDI lawsuit. These *.S files are never referenced in any of the Makefiles.

        That as well as the fact that, the source tree one big heavy pile of garbage full of spaghetti code which just shows how of a crappy mess BSD is.

        No wander why BSD kernels are so un-portable and slow. Worse, they are even trying to rewrite everything (including the kernel) in C++ all because of clang. What retards.

        Comment


        • Originally posted by i386reaper View Post
          Assembly is still important to Linux in the kernel. Even today, the second most common language in the Linux Kernel is assembly after C. Assembly gives Linus, Hartman and others the ability to properly the design in detail of parts of the kernel then would cause performance bottlenecks if designed in C. This allows the designers and maintainers of Linux to make it extremely fast and efficient.

          This is in contrast to all the BSDs where they wrote their entire OS in C even in places where using assembly is critical. The result? BSDs are one of the slowest OS ever. Even slower then windows.

          Don't believe me? see it for yourselves:http://svn.freebsd.org/base/head/

          If you do a `find ./ -name "*.asm" -print`, you find nothing. There are some *.S files but it turns out that those are just extra-baggage left over from when they copy-paste AT&T code which resulted in the USL vs BSDI lawsuit. These *.S files are never referenced in any of the Makefiles.

          That as well as the fact that, the source tree one big heavy pile of garbage full of spaghetti code which just shows how of a crappy mess BSD is.

          No wander why BSD kernels are so un-portable and slow. Worse, they are even trying to rewrite everything (including the kernel) in C++ all because of clang. What retards.
          BSD bashing I think is not warranted. Even it may be written in C, the slowness of BSD (like in OS X) stays many times not in assembly but in other factors, like some implementations had a big lock in kernel. I talk about thins like this. In fact a big part of Windows is written in C++ today and many parts are written in C for a long time and no one is complaining of how slow it is (some people still do, but this is again not because of how assembly is written or not).

          FreeBSD has to be slower for many reasons (and Mac OS X) than Linux and this includes:
          - many critical modules of Linux are not compiled as modules but part of the kernel
          - Linux has more manpower and more interest to be fast in supercomputers so SGI and IBM contributed heavily in it
          - A lot of hardware companies profile and tune Linux still (for example Intel)
          - the file system (Ext4) is faster in general than FreeBSD
          - FreeBSD is compiled against an older GCC (4.2) because GCC 4.3 / GPL3 is not compatible with BSD license

          If you add all these together, it is to be expected, with or without assembly that FreeBSD is slow(er than Linux).

          Going back to the topic's talk "is assembly still relevant to most Linux software"? Linux kernel is not most Linux software A scan of the source states clearly that Linux has 2.9% assembly (compared with: 94.5% C). If you substract the atomics or system calls, or work arounds like to free the cache when a context switch is happening) or requests to CPU to go in a lower state (all of them require assembly, C will not do the cut), and multiply with the platforms that Linux is supported, the assembly usage is really minimal.

          Source: https://www.ohloh.net/p/linux/analys...guages_summary

          If we go to FreeBSD, the history is really similar, and there is assembly: 91.4% C / 2.4% Assembly (similar with Linux rates), but the BSD kernel is many times smaller

          https://www.ohloh.net/p/freebsd/anal...guages_summary

          Comment


          • Beating the dead horse:
            As for my own code, my typical workflow is something like:
            • Write function in Objective-C. If fast enough, stop.
            • Rewrite function to use a better algorithm or data structure. If fast enough, stop.
            • Rewrite function in C. If fast enough, stop.
            • Rewrite function in multi-threaded C with Grand Central Dispatch. If fast enough, stop.
            • Rewrite function in OpenCL.


            With JavaScript, optimization stops at step #2. Even with the promising new asm.js, that would get me to step #2.5 — still slower than C, and a far cry from multi-threaded C or OpenCL. Developing for Mac, I have more tools at my disposal for clearing out performance bottlenecks and delivering a superb user experience. (Xcode's profiler, by the way, is generally excellent.)
            This article discusses why a developer will use OS X and why it doesn't use a web platform to deliver the application. So is it not about assembly, but I can reflect the same experience in C#, excluding the step: "rewrite the function in C", which most of the time makes no sense (Objective C as C# has an overhead, for C++ would not be the case). At least for my applications, I found that step 3 would be phrased like: "Use NGen on client machine".

            Comment


            • here's a funny comparison
              http://www.hxa.name/minilight/#comparison

              ofc its bit subjective but still interesting
              OCaml has the best lines/performance ratio with 0.89 compared to 1 of C with special python being third with 1.55
              C++ got 1.18 (79% total lines compared to C and 50% slower)


              and i agree we need a good profiler
              perf is awesome but needs a good gui (or integration in some IDE)
              people like pretty bars and stuff
              and idk if can do thread timings

              perf (y)

              bdw i seen smallpt being properly sse optimized by the compiler
              (ofc theres a couple unnecessary loads/stores but nothing too influential; and i didnt check all)
              Last edited by gens; 05-28-2013, 09:50 AM.

              Comment


              • Originally posted by gens View Post
                here's a funny comparison
                http://www.hxa.name/minilight/#comparison

                ofc its bit subjective but still interesting
                OCaml has the best lines/performance ratio with 0.89 compared to 1 of C with special python being third with 1.55
                C++ got 1.18 (79% total lines compared to C and 50% slower)
                Do we read the same stuff?

                From the article:
                C was faster than C++ probably mostly because LLVM's link-time optimisation worked for C, but not for C++.
                "Special python", you mean most likely "ShedSkin", and if you read both the description of ShedSkin you will notice that it outputs C++ that it compile. As C++ did not work with LTO, and there is a translation penalty with the outdated version, I'm surprised to get this data as relevant? Why not trying an more up-to-date version like 0.9.3 !? Source: http://code.google.com/p/shedskin/downloads/list

                Why not patching ShedSkin to use LTO too (-O4 for Clang++) !?

                Thank you for benchmarks, btw, did you plan to write an Assembly version of it and contribute it?

                Comment


                • Originally posted by ciplogic View Post
                  Do we read the same stuff?

                  Thank you for benchmarks, btw, did you plan to write an Assembly version of it and contribute it?
                  note the subjective
                  also note that the C version vas written by the autor, most (if not all) others are contributed
                  the pseudocode is great so they are probably similar

                  and no mr. cynical, i didnt (tip: pure sarcasm would fit there nicer)
                  however when finishing this school stuff i might
                  why not,
                  raytracers are relatively simple and being computationally intensive and easy to vectorize they should benefit from SIMD/MIMD a lot
                  (i see some dubious load/stores in the executable, maybe its grand design but... why)

                  bdw: so what ? compiler fault is compiler fault
                  not like your gonna say "if the engine worked on that car it would go faster"

                  PS every computer has a theoretical computational power
                  and to quote form here
                  "He later purchased a second Alpha-based computer and by rewriting the crucial subroutines was able to improve its performance to 78 percent of its theoretical peak calculating speed, up from 44 percent. "

                  guess what BLAS libraries are usually written in (its not C/C++)
                  Last edited by gens; 05-29-2013, 03:17 PM.

                  Comment


                  • Originally posted by gens View Post
                    note the subjective
                    also note that the C version vas written by the autor, most (if not all) others are contributed
                    the pseudocode is great so they are probably similar

                    (...)why not,
                    raytracers are relatively simple and being computationally intensive and easy to vectorize they should benefit from SIMD/MIMD a lot
                    (i see some dubious load/stores in the executable, maybe its grand design but... why)

                    bdw: so what ? compiler fault is compiler fault
                    not like your gonna say "if the engine worked on that car it would go faster"
                    But today (not when the benchmarks were made), the engine works. I do remember that LTO to have bugs in GCC (but in 4.8 are happen more rarely), but very (read very) seldom I have them on today's Clang++ (and I compile Qt applications with it). So even the compiler will have a penalty, compare with lines of code ratio, most likely C++ would be still a safer bet.

                    The other argument, is also simple: as there are 8(!) versions between ShedSkin tested (0.1.1) to 0.9.3, it is likely that ShedSkin improved as performance. Even if it didn't, the compiler that is using (GCC or Clang++) would work faster (with LTO). Even with your counting, just upgrading the compiler to work with LTO, will make the scaling (lines of code / performance) close to 1.0 mark.

                    (...)guess what BLAS libraries are usually written in (its not C/C++)
                    What's BLAS has to do with anything?

                    So let's see anyway in which language are written the BLAS routines?

                    I will use Wikipedia for convenience:
                    http://en.wikipedia.org/wiki/Basic_L...ra_Subprograms

                    Reference implementation is in C or in Fortran!

                    What about the implementations? Some are assembly, some are Fortran 77, some are C++ ( http://en.wikipedia.org/wiki/Basic_L...mplementations )
                    From the implementations, the ones with star are using assembly
                    Accelerate*, ACML*, C++ AMP BLAS, ATLAS, ESSL*, Eigen BLAS*, Goto BLAS*, HP MLIB*, Intel MKL*, MathKeisan*, Netlib BLAS, Netlib CBLAS, PDLIB/SX*, SCSL, Sun Performance Library*, SurviveGotoBLAS2, OpenBLAS*,cuBLAS
                    If my counting works, 10 implementations are using assembly, 7 are not using assembly (with the cuBLAS using CUDA).

                    If you count other libraries that are having "BLAS-like functionality ( http://en.wikipedia.org/wiki/Basic_L..._functionality ), which are 11 implementations that are not assembly oriented, but C++ or OpenCL, or derivatives, you will see that assembly may not be majoritary for the hot of the hotest.

                    But let's say BLAS needs to be written in assembly, still it will not change your raytracer to be written in assembly, isn't it? Maybe call BLAS routines that you know that are optimized by vendors/compiler makers, instead of writing your own version.

                    Comment


                    • k... the serious ones at least are in fortran or asm (intel and amd ones are in fortran, supercomputing ones are in fortran or asm)
                      fastest one being Goto BLAS

                      yes
                      you dont have to write anything in assembly, as we agreed already a couple times
                      if you got a mature project, refactored code in libraries and a good asm programer...
                      today's cpus are fast enough even for java to work good interactively

                      bdw
                      in asm i can write a loop that adapts codepath to cpu instruction set and cache size at runtime
                      can even go crazy and make a simple compiler for a loop, or just make it self modifying (or make it self analyzing)

                      why i like asm more then c++ is simple (and ofc subjective)
                      in asm there are limited number of instructions, that are written in stone (silicon, but not far), with which i can do anything
                      in (pure) C++ id need to remember lots of things (to optimize id need to know far more)
                      but thats just my very limited opinion
                      in a production environment C++/java/etc are ofc better as you get a working program faster

                      do what you want
                      everything has its good and bad sides

                      bdw i really cant understand why theres such a big difference with LTO since it dosent change the loop itself (and the loop is most all the execution time)
                      mysteries for another time
                      Last edited by gens; 05-30-2013, 01:29 PM.

                      Comment


                      • Originally posted by gens View Post
                        (...)
                        bdw i really cant understand why theres such a big difference with LTO since it dosent change the loop itself (and the loop is most all the execution time)
                        mysteries for another time
                        The loop of pixels matter, but not that much, but where LTO matters:
                        - in inlining, at least of the code is split into multiple files (where LTO comes really into play) - the source code you've shown is using multiple files (.c or .cpp, so LTO can take advantage of this)
                        - devirtualization (and again more opportunities of inlining)
                        - function cloning (basically, it is possible to decide that the function X is always called with the 2nd argument set on value 3, so a clone of the original function is made, and the cloned function has an inline value of 3 in all cases of usage of argument 2)
                        - dead code elimination is more aggressive which leads to better cache locality (at least)

                        in asm i can write a loop that adapts codepath to cpu instruction set and cache size at runtime
                        can even go crazy and make a simple compiler for a loop, or just make it self modifying (or make it self analyzing)

                        why i like asm more then c++ is simple (and ofc subjective)
                        in asm there are limited number of instructions, that are written in stone (silicon, but not far), with which i can do anything
                        This is true for any language (to be able to write an efficient compiler). In fact is much easier to write optimizations for higher level languages. Let me explain: I wrote in my free time a "mini .Net VM" which reads the MSIL/CIL language and at least for some instructions (that it supports) it translates easily into a linear form C++ (similar with Linear IL from Mono page).

                        For this code in C#:
                        Code:
                        static void Main()
                        {
                                    var IsPrime = 1;
                                    int a = 30;
                                    int b = 9 - a / 5;
                                    int c;
                                    if(IsPrime==0)
                                        return;
                        
                                    c = b * 4;
                                    if (c > 10)
                                    {
                                        c = c - 10;
                                    }
                                    var result = c * (60 / a);
                                    Console.Write(result);
                        }
                        Is translated into (C++) (no optimizations, up-to spec as operations)

                        Code:
                        void (...)::Main() {
                        
                        System::Int32 local_0;
                        System::Int32 local_1;
                        System::Int32 local_2;
                        System::Int32 local_3;
                        System::Int32 local_4;
                        System::Boolean local_5;
                        System::Int32 vreg_1;
                        System::Int32 vreg_2;
                        System::Int32 vreg_3;
                        System::Int32 vreg_4;
                        System::Int32 vreg_5;
                        System::Int32 vreg_6;
                        System::Int32 vreg_7;
                        System::Int32 vreg_8;
                        System::Int32 vreg_9;
                        System::Int32 vreg_10;
                        System::Int32 vreg_11;
                        System::Int32 vreg_12;
                        System::Int32 vreg_13;
                        System::Int32 vreg_14;
                        System::Int32 vreg_15;
                        System::Int32 vreg_16;
                        System::Int32 vreg_17;
                        System::Int32 vreg_18;
                        System::Int32 vreg_19;
                        System::Int32 vreg_20;
                        System::Int32 vreg_21;
                        System::Int32 vreg_22;
                        System::Int32 vreg_23;
                        System::Int32 vreg_24;
                        System::Int32 vreg_25;
                        System::Int32 vreg_26;
                        System::Int32 vreg_27;
                        System::Int32 vreg_28;
                        System::Int32 vreg_29;
                        System::Int32 vreg_30;
                        System::Int32 vreg_31;
                        
                        vreg_1 = 1;
                        local_0 = vreg_1;
                        vreg_2 = 30;
                        local_1 = vreg_2;
                        vreg_3 = 9;
                        vreg_4 = local_1;
                        vreg_5 = 5;
                        vreg_6 = vreg_4/vreg_5;
                        vreg_7 = vreg_3-vreg_6;
                        local_2 = vreg_7;
                        vreg_8 = local_0;
                        vreg_9 = 0;
                        vreg_10 = (vreg_8 == vreg_9)?1:0;
                        vreg_11 = 0;
                        vreg_12 = (vreg_10 == vreg_11)?1:0;
                        local_5 = vreg_12;
                        vreg_13 = local_5;
                        if(vreg_13) goto label_28;
                        goto label_69;
                        label_28:
                        vreg_14 = local_2;
                        vreg_15 = 4;
                        vreg_16 = vreg_14*vreg_15;
                        local_3 = vreg_16;
                        vreg_17 = local_3;
                        vreg_18 = 10;
                        vreg_19 = (vreg_17 > vreg_18)?1:0;
                        vreg_20 = 0;
                        vreg_21 = (vreg_19 == vreg_20)?1:0;
                        local_5 = vreg_21;
                        vreg_22 = local_5;
                        if(vreg_22) goto label_53;
                        vreg_23 = local_3;
                        vreg_24 = 10;
                        vreg_25 = vreg_23-vreg_24;
                        local_3 = vreg_25;
                        label_53:
                        vreg_26 = local_3;
                        vreg_27 = 60;
                        vreg_28 = local_1;
                        vreg_29 = vreg_27/vreg_28;
                        vreg_30 = vreg_26*vreg_29;
                        local_4 = vreg_30;
                        vreg_31 = local_4;
                        System::Console::Write(vreg_31);
                        
                        label_69:
                        return;
                        }
                        But writing optimizations of the high level language, can reduce the code to the equivalent C++ code:
                        Code:
                        void (...)::Main() {
                        
                        System::Int32 local_0;
                        System::Int32 local_1;
                        System::Int32 local_2;
                        System::Int32 local_3;
                        System::Int32 local_4;
                        System::Boolean local_5;
                        
                        System::Console::Write(4);
                        
                        return;
                        }
                        The optimization steps are:
                        * DceLocalAssigned,
                        * DceVRegAssigned
                        * ConstantVariablePropagation()
                        * ConstantVariableOperatorPropagation()
                        * ConstantVariableBranchOperatorPropagation()
                        * ConstantVariablePropagationInCall()
                        * RemoveUnreferencedLabels()
                        * DceLocalAssigned()
                        * ConsecutiveLabels()
                        * ConstantVariableBranchOperatorPropagation()
                        * OperatorConstantFolding()
                        * ConstantDfaAnalysis()
                        * ReachabilityLines()
                        * DeleteJumpNextLine()

                        The entire project took up-to-now like 1 month (optimizations included) in my free time (like 2-4 h/day).

                        Some of the optimizations are (practically) impossible to do them on assembly level like: DataFlow Constant propagation (ConstantDfaAnalysis), but the C# code for it is just under 200 lines (is like 180 lines for this mini "VM" that outputs C++).

                        Comment


                        • Originally posted by gens View Post
                          (...)
                          in (pure) C++ id need to remember lots of things (to optimize id need to know far more)
                          but thats just my very limited opinion
                          in a production environment C++/java/etc are ofc better as you get a working program faster
                          (...)
                          So which things do you need to remember to optimize C++? This is my list:
                          - use const and const reference wherever possible: this is good both for performance and code safety!
                          - use references (not pointers) for big objects. This is also true for C, but references are known to be non-NULL, which is a great boon for most users. If you use a modern C++ (C++ 11), you may not need even to do this, as there is Move-Semantics that does not copy big objects by default.
                          - for calls that you make them often, write them in headers, to give to the compiler the opportunity to inline it. If you use STL, this is done for you (as templates have to be written just in headers, they are very often inlined by most compilers)
                          - for "experts": at compile time make evaluation templates that compute the constant values of the expressions
                          - for very tight code you may need to use intrinsics, or to write the code compiler friendly, so the compiler "catches" the optimizations you may want.

                          Is it anything I missed?

                          Let's compare with what a compiler does, and you have to do write in your own assembly optimizations:
                          - you have for jumps to take in account the instruction distance, it is better to use short-jumps over long jumps
                          - you have to know assembly instruction length (time is not an issue for OOO CPUs), so to use Xor over Mov for setting values to zero. This can change the jump kind
                          - you have to make the register coloring (an NP Hard problem) by yourself to assign registers optimally. There is a huge bunch of literature that shows strategies of register allocation. "Linear Scan Register Allocator" (LRSA) is good enough for most purposes, but also a good strategy is to make full colorization. This increases the register allocation time by 10x (in the comparison paper between Client and Server compiler in Java's HotSpot), but gives like 10-15% faster code. The greedy strategy of LLVM 3.0+ is a way to allocate registers better than LRSA and to not do full colorization (it is very similar with the Standford's compiler course algorithm of full RA, but it doesn't work for entire function). Wish you luck matching by default in a non-trivial function your register allocation.
                          - for a medium/long function, you have to compute by yourself all redundancies and simplifications that a compiler will do it for you (shown in the previous post)
                          - you have to make instructions to not be inter-dependent so as many stages of computation to be executed into the Out-Of-Order CPU pipeline
                          - you have to inline the function by itself, this also implying that you have to re-assign registers
                          - you don't have such of a great profilers, and as far as I know there is just Valgrind (CacheGrind) that is basically free (CG works also for C/C++). VTune's tools are out of reach for hobby developers
                          - you have to take in account different CPU instruction sets, caches, CPU features and so on
                          - you have to make all this, and you have to do it safe

                          The list could be longer (like: there are calling conventions that can be tuned, (it is not such of a big deal to write: __fastcall before a function) which can be done on C++ side (and they have to be taught on assembly too), but for main part, I can say that the lists are fairly accurate of my understanding of what assembly and C++ optimizations imply.

                          Comment

                          Working...
                          X