Page 18 of 18 FirstFirst ... 8161718
Results 171 to 177 of 177

Thread: Is Assembly Still Relevant To Most Linux Software?

  1. #171
    Join Date
    May 2012
    Posts
    435

    Default

    here's a funny comparison
    http://www.hxa.name/minilight/#comparison

    ofc its bit subjective but still interesting
    OCaml has the best lines/performance ratio with 0.89 compared to 1 of C with special python being third with 1.55
    C++ got 1.18 (79% total lines compared to C and 50% slower)


    and i agree we need a good profiler
    perf is awesome but needs a good gui (or integration in some IDE)
    people like pretty bars and stuff
    and idk if can do thread timings

    perf (y)

    bdw i seen smallpt being properly sse optimized by the compiler
    (ofc theres a couple unnecessary loads/stores but nothing too influential; and i didnt check all)
    Last edited by gens; 05-28-2013 at 09:50 AM.

  2. #172
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    398

    Default

    Quote Originally Posted by gens View Post
    here's a funny comparison
    http://www.hxa.name/minilight/#comparison

    ofc its bit subjective but still interesting
    OCaml has the best lines/performance ratio with 0.89 compared to 1 of C with special python being third with 1.55
    C++ got 1.18 (79% total lines compared to C and 50% slower)
    Do we read the same stuff?

    From the article:
    C was faster than C++ probably mostly because LLVM's link-time optimisation worked for C, but not for C++.
    "Special python", you mean most likely "ShedSkin", and if you read both the description of ShedSkin you will notice that it outputs C++ that it compile. As C++ did not work with LTO, and there is a translation penalty with the outdated version, I'm surprised to get this data as relevant? Why not trying an more up-to-date version like 0.9.3 !? Source: http://code.google.com/p/shedskin/downloads/list

    Why not patching ShedSkin to use LTO too (-O4 for Clang++) !?

    Thank you for benchmarks, btw, did you plan to write an Assembly version of it and contribute it?

  3. #173
    Join Date
    May 2012
    Posts
    435

    Default

    Quote Originally Posted by ciplogic View Post
    Do we read the same stuff?

    Thank you for benchmarks, btw, did you plan to write an Assembly version of it and contribute it?
    note the subjective
    also note that the C version vas written by the autor, most (if not all) others are contributed
    the pseudocode is great so they are probably similar

    and no mr. cynical, i didnt (tip: pure sarcasm would fit there nicer)
    however when finishing this school stuff i might
    why not,
    raytracers are relatively simple and being computationally intensive and easy to vectorize they should benefit from SIMD/MIMD a lot
    (i see some dubious load/stores in the executable, maybe its grand design but... why)

    bdw: so what ? compiler fault is compiler fault
    not like your gonna say "if the engine worked on that car it would go faster"

    PS every computer has a theoretical computational power
    and to quote form here
    "He later purchased a second Alpha-based computer and by rewriting the crucial subroutines was able to improve its performance to 78 percent of its theoretical peak calculating speed, up from 44 percent. "

    guess what BLAS libraries are usually written in (its not C/C++)
    Last edited by gens; 05-29-2013 at 03:17 PM.

  4. #174
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    398

    Default

    Quote Originally Posted by gens View Post
    note the subjective
    also note that the C version vas written by the autor, most (if not all) others are contributed
    the pseudocode is great so they are probably similar

    (...)why not,
    raytracers are relatively simple and being computationally intensive and easy to vectorize they should benefit from SIMD/MIMD a lot
    (i see some dubious load/stores in the executable, maybe its grand design but... why)

    bdw: so what ? compiler fault is compiler fault
    not like your gonna say "if the engine worked on that car it would go faster"
    But today (not when the benchmarks were made), the engine works. I do remember that LTO to have bugs in GCC (but in 4.8 are happen more rarely), but very (read very) seldom I have them on today's Clang++ (and I compile Qt applications with it). So even the compiler will have a penalty, compare with lines of code ratio, most likely C++ would be still a safer bet.

    The other argument, is also simple: as there are 8(!) versions between ShedSkin tested (0.1.1) to 0.9.3, it is likely that ShedSkin improved as performance. Even if it didn't, the compiler that is using (GCC or Clang++) would work faster (with LTO). Even with your counting, just upgrading the compiler to work with LTO, will make the scaling (lines of code / performance) close to 1.0 mark.

    (...)guess what BLAS libraries are usually written in (its not C/C++)
    What's BLAS has to do with anything?

    So let's see anyway in which language are written the BLAS routines?

    I will use Wikipedia for convenience:
    http://en.wikipedia.org/wiki/Basic_L...ra_Subprograms

    Reference implementation is in C or in Fortran!

    What about the implementations? Some are assembly, some are Fortran 77, some are C++ ( http://en.wikipedia.org/wiki/Basic_L...mplementations )
    From the implementations, the ones with star are using assembly
    Accelerate*, ACML*, C++ AMP BLAS, ATLAS, ESSL*, Eigen BLAS*, Goto BLAS*, HP MLIB*, Intel MKL*, MathKeisan*, Netlib BLAS, Netlib CBLAS, PDLIB/SX*, SCSL, Sun Performance Library*, SurviveGotoBLAS2, OpenBLAS*,cuBLAS
    If my counting works, 10 implementations are using assembly, 7 are not using assembly (with the cuBLAS using CUDA).

    If you count other libraries that are having "BLAS-like functionality ( http://en.wikipedia.org/wiki/Basic_L..._functionality ), which are 11 implementations that are not assembly oriented, but C++ or OpenCL, or derivatives, you will see that assembly may not be majoritary for the hot of the hotest.

    But let's say BLAS needs to be written in assembly, still it will not change your raytracer to be written in assembly, isn't it? Maybe call BLAS routines that you know that are optimized by vendors/compiler makers, instead of writing your own version.

  5. #175
    Join Date
    May 2012
    Posts
    435

    Default

    k... the serious ones at least are in fortran or asm (intel and amd ones are in fortran, supercomputing ones are in fortran or asm)
    fastest one being Goto BLAS

    yes
    you dont have to write anything in assembly, as we agreed already a couple times
    if you got a mature project, refactored code in libraries and a good asm programer...
    today's cpus are fast enough even for java to work good interactively

    bdw
    in asm i can write a loop that adapts codepath to cpu instruction set and cache size at runtime
    can even go crazy and make a simple compiler for a loop, or just make it self modifying (or make it self analyzing)

    why i like asm more then c++ is simple (and ofc subjective)
    in asm there are limited number of instructions, that are written in stone (silicon, but not far), with which i can do anything
    in (pure) C++ id need to remember lots of things (to optimize id need to know far more)
    but thats just my very limited opinion
    in a production environment C++/java/etc are ofc better as you get a working program faster

    do what you want
    everything has its good and bad sides

    bdw i really cant understand why theres such a big difference with LTO since it dosent change the loop itself (and the loop is most all the execution time)
    mysteries for another time
    Last edited by gens; 05-30-2013 at 01:29 PM.

  6. #176
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    398

    Default

    Quote Originally Posted by gens View Post
    (...)
    bdw i really cant understand why theres such a big difference with LTO since it dosent change the loop itself (and the loop is most all the execution time)
    mysteries for another time
    The loop of pixels matter, but not that much, but where LTO matters:
    - in inlining, at least of the code is split into multiple files (where LTO comes really into play) - the source code you've shown is using multiple files (.c or .cpp, so LTO can take advantage of this)
    - devirtualization (and again more opportunities of inlining)
    - function cloning (basically, it is possible to decide that the function X is always called with the 2nd argument set on value 3, so a clone of the original function is made, and the cloned function has an inline value of 3 in all cases of usage of argument 2)
    - dead code elimination is more aggressive which leads to better cache locality (at least)

    in asm i can write a loop that adapts codepath to cpu instruction set and cache size at runtime
    can even go crazy and make a simple compiler for a loop, or just make it self modifying (or make it self analyzing)

    why i like asm more then c++ is simple (and ofc subjective)
    in asm there are limited number of instructions, that are written in stone (silicon, but not far), with which i can do anything
    This is true for any language (to be able to write an efficient compiler). In fact is much easier to write optimizations for higher level languages. Let me explain: I wrote in my free time a "mini .Net VM" which reads the MSIL/CIL language and at least for some instructions (that it supports) it translates easily into a linear form C++ (similar with Linear IL from Mono page).

    For this code in C#:
    Code:
    static void Main()
    {
                var IsPrime = 1;
                int a = 30;
                int b = 9 - a / 5;
                int c;
                if(IsPrime==0)
                    return;
    
                c = b * 4;
                if (c > 10)
                {
                    c = c - 10;
                }
                var result = c * (60 / a);
                Console.Write(result);
    }
    Is translated into (C++) (no optimizations, up-to spec as operations)

    Code:
    void (...)::Main() {
    
    System::Int32 local_0;
    System::Int32 local_1;
    System::Int32 local_2;
    System::Int32 local_3;
    System::Int32 local_4;
    System::Boolean local_5;
    System::Int32 vreg_1;
    System::Int32 vreg_2;
    System::Int32 vreg_3;
    System::Int32 vreg_4;
    System::Int32 vreg_5;
    System::Int32 vreg_6;
    System::Int32 vreg_7;
    System::Int32 vreg_8;
    System::Int32 vreg_9;
    System::Int32 vreg_10;
    System::Int32 vreg_11;
    System::Int32 vreg_12;
    System::Int32 vreg_13;
    System::Int32 vreg_14;
    System::Int32 vreg_15;
    System::Int32 vreg_16;
    System::Int32 vreg_17;
    System::Int32 vreg_18;
    System::Int32 vreg_19;
    System::Int32 vreg_20;
    System::Int32 vreg_21;
    System::Int32 vreg_22;
    System::Int32 vreg_23;
    System::Int32 vreg_24;
    System::Int32 vreg_25;
    System::Int32 vreg_26;
    System::Int32 vreg_27;
    System::Int32 vreg_28;
    System::Int32 vreg_29;
    System::Int32 vreg_30;
    System::Int32 vreg_31;
    
    vreg_1 = 1;
    local_0 = vreg_1;
    vreg_2 = 30;
    local_1 = vreg_2;
    vreg_3 = 9;
    vreg_4 = local_1;
    vreg_5 = 5;
    vreg_6 = vreg_4/vreg_5;
    vreg_7 = vreg_3-vreg_6;
    local_2 = vreg_7;
    vreg_8 = local_0;
    vreg_9 = 0;
    vreg_10 = (vreg_8 == vreg_9)?1:0;
    vreg_11 = 0;
    vreg_12 = (vreg_10 == vreg_11)?1:0;
    local_5 = vreg_12;
    vreg_13 = local_5;
    if(vreg_13) goto label_28;
    goto label_69;
    label_28:
    vreg_14 = local_2;
    vreg_15 = 4;
    vreg_16 = vreg_14*vreg_15;
    local_3 = vreg_16;
    vreg_17 = local_3;
    vreg_18 = 10;
    vreg_19 = (vreg_17 > vreg_18)?1:0;
    vreg_20 = 0;
    vreg_21 = (vreg_19 == vreg_20)?1:0;
    local_5 = vreg_21;
    vreg_22 = local_5;
    if(vreg_22) goto label_53;
    vreg_23 = local_3;
    vreg_24 = 10;
    vreg_25 = vreg_23-vreg_24;
    local_3 = vreg_25;
    label_53:
    vreg_26 = local_3;
    vreg_27 = 60;
    vreg_28 = local_1;
    vreg_29 = vreg_27/vreg_28;
    vreg_30 = vreg_26*vreg_29;
    local_4 = vreg_30;
    vreg_31 = local_4;
    System::Console::Write(vreg_31);
    
    label_69:
    return;
    }
    But writing optimizations of the high level language, can reduce the code to the equivalent C++ code:
    Code:
    void (...)::Main() {
    
    System::Int32 local_0;
    System::Int32 local_1;
    System::Int32 local_2;
    System::Int32 local_3;
    System::Int32 local_4;
    System::Boolean local_5;
    
    System::Console::Write(4);
    
    return;
    }
    The optimization steps are:
    * DceLocalAssigned,
    * DceVRegAssigned
    * ConstantVariablePropagation()
    * ConstantVariableOperatorPropagation()
    * ConstantVariableBranchOperatorPropagation()
    * ConstantVariablePropagationInCall()
    * RemoveUnreferencedLabels()
    * DceLocalAssigned()
    * ConsecutiveLabels()
    * ConstantVariableBranchOperatorPropagation()
    * OperatorConstantFolding()
    * ConstantDfaAnalysis()
    * ReachabilityLines()
    * DeleteJumpNextLine()

    The entire project took up-to-now like 1 month (optimizations included) in my free time (like 2-4 h/day).

    Some of the optimizations are (practically) impossible to do them on assembly level like: DataFlow Constant propagation (ConstantDfaAnalysis), but the C# code for it is just under 200 lines (is like 180 lines for this mini "VM" that outputs C++).

  7. #177
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    398

    Default

    Quote Originally Posted by gens View Post
    (...)
    in (pure) C++ id need to remember lots of things (to optimize id need to know far more)
    but thats just my very limited opinion
    in a production environment C++/java/etc are ofc better as you get a working program faster
    (...)
    So which things do you need to remember to optimize C++? This is my list:
    - use const and const reference wherever possible: this is good both for performance and code safety!
    - use references (not pointers) for big objects. This is also true for C, but references are known to be non-NULL, which is a great boon for most users. If you use a modern C++ (C++ 11), you may not need even to do this, as there is Move-Semantics that does not copy big objects by default.
    - for calls that you make them often, write them in headers, to give to the compiler the opportunity to inline it. If you use STL, this is done for you (as templates have to be written just in headers, they are very often inlined by most compilers)
    - for "experts": at compile time make evaluation templates that compute the constant values of the expressions
    - for very tight code you may need to use intrinsics, or to write the code compiler friendly, so the compiler "catches" the optimizations you may want.

    Is it anything I missed?

    Let's compare with what a compiler does, and you have to do write in your own assembly optimizations:
    - you have for jumps to take in account the instruction distance, it is better to use short-jumps over long jumps
    - you have to know assembly instruction length (time is not an issue for OOO CPUs), so to use Xor over Mov for setting values to zero. This can change the jump kind
    - you have to make the register coloring (an NP Hard problem) by yourself to assign registers optimally. There is a huge bunch of literature that shows strategies of register allocation. "Linear Scan Register Allocator" (LRSA) is good enough for most purposes, but also a good strategy is to make full colorization. This increases the register allocation time by 10x (in the comparison paper between Client and Server compiler in Java's HotSpot), but gives like 10-15% faster code. The greedy strategy of LLVM 3.0+ is a way to allocate registers better than LRSA and to not do full colorization (it is very similar with the Standford's compiler course algorithm of full RA, but it doesn't work for entire function). Wish you luck matching by default in a non-trivial function your register allocation.
    - for a medium/long function, you have to compute by yourself all redundancies and simplifications that a compiler will do it for you (shown in the previous post)
    - you have to make instructions to not be inter-dependent so as many stages of computation to be executed into the Out-Of-Order CPU pipeline
    - you have to inline the function by itself, this also implying that you have to re-assign registers
    - you don't have such of a great profilers, and as far as I know there is just Valgrind (CacheGrind) that is basically free (CG works also for C/C++). VTune's tools are out of reach for hobby developers
    - you have to take in account different CPU instruction sets, caches, CPU features and so on
    - you have to make all this, and you have to do it safe

    The list could be longer (like: there are calling conventions that can be tuned, (it is not such of a big deal to write: __fastcall before a function) which can be done on C++ side (and they have to be taught on assembly too), but for main part, I can say that the lists are fairly accurate of my understanding of what assembly and C++ optimizations imply.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •