Page 7 of 18 FirstFirst ... 5678917 ... LastLast
Results 61 to 70 of 177

Thread: Is Assembly Still Relevant To Most Linux Software?

  1. #61
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    398

    Default

    Quote Originally Posted by gens View Post
    i started this hobby in the time of... i guess gcc 4.6.something
    gcc has changed since then and i didnt look at disassembly's in a while

    well anyway
    memcpy in gcc is builtin, meaning it will just copy a template function
    (...)
    So to understand: you accept that writing a MemCopy like function in the past, is was a bad idea, because as of today the builtin functions are doing better, right? But if you would write your code and even you were close to an expert to write it at the time of Intel32 Pentium 1 glory, maybe the version that GCC will provide, still will be better
    Quote Originally Posted by gens View Post
    i
    one example is here, where the author asked for help and ended with code twice as fast as fortran
    Yes, but also it shows how good was the compiler:
    Thank for any help in optimizing. Seems that I am
    bad at assembler Wink

    edit: I have followed fortran program and rearranged
    loops so that magnitude calculation can be vectorized
    and now program executes at 16 seconds which is
    still two seconds behind fortran Wink

    edit2: did some minor arrangement's of code so now it executes
    at Intel Fortran's speed.
    Also, if you look for the times when the things were posted, to get a solution for this simple NBody program took 2 months (!) to fix it. Of course, maybe tomorrow someone will port it to OpenCL and will work 10 times faster than the assembly code and using less CPU in the process: http://developer.apple.com/library/m...ion/Intro.html

    Or this video: https://www.youtube.com/watch?v=r1sN1ELJfNo

    Quote Originally Posted by gens View Post
    i
    about matrix multiply
    i did write a 3x3 matrix with 1x3 matrix multiply, albeit in intrinsics
    problem was sse processes 4 floats at a time and theres lots of 3x3 matrices
    my solution was to load an extra number from the next matrix (with shuffles) and do 4 matrices in a loop (4x3=12, that is dividable by 4 giving 3 steppes)
    idk how a compiler can come up with this solution, especially since it dosent know that that loop will process thousands of matrices
    funny thing is i had a lot more problems with pointers in C++ (im bad at C++) and then the hard drive failed
    So you had a 3x3 matrix, and you understood that a SSE register would pack 4 floats (32).

    What would stop you to do this extra float to add it to make the multiplications easy on SSE? (at least in case it shows in profiling)

    I mean: why not help the compiler a bit where you know is a bit weaker? If you know that C++ cannot compile across the object files, you should not create too many getters in other place than in headers, right? But if you think that: why not use assembly to improve the things out? I think the reason is that maybe tomorrow ARM will be popular, or a processor that even is Intel compatible, the instruction sequence has to be different (AMD and Intel are having many times different caches and latencies, which makes that compiler can help a lot on tuning for one CPU or another).

    also about cache
    true that a compiler respects cache lines and L2 cache size, but it also gives out a long unwound bunch of machine code
    also there is no specification about cache sizes (-mcpu dosent help since its a flag for an architecture, and one can have different cache sizes for one architecture)

    i think i can and in at least one case i did
    took me longer that it would in higher level languages, but that loop was running 5% of total cpu time and i had nothing better to do
    I want to make it clear, there are cases when as you said, are loops that are even more than 5% CPU and assembly seems to be the reason, but many more times is an issue of application design. I remember again from my past that when I did OpenGL, I was pushing triangle by triangle (as this was shown in mainstream tutorials of year's 2000), but never VBOs, glMatrixLoad, etc. and many processing were done on CPU. Today, if you know that you have a lot of processing, you may want to write it into Java and execute it into the cloud distributed which will give to you the answer properly and fast, or use OpenCL, or use all cores, and you don't care which does what.

    I know that NBody (in the your shown case) is not multi-core aware, but even the previous month at work, I had some native code and as it had to do a massive processing task (basically to crack passwords), moving into multi-core was done much easier with Java code and at the end, just using basically "Java -server" and by using all cores it was running basically 6-8 times faster. Of course 1 core C++ vs 4 HT (4 x 2) cores Java. Also I think that Java optimized very much the trace of code that happen to be run in that specific project, so I wouldn't give the conclusion like "Java is faster than C++" or anything of the sort. Also C++ and Java were using different frameworks to do that password checking. I am sure that you could point that many inefficiencies could be written into Assembly and will get 3x times faster than Java, and 20 times faster than using a framework in C++ of that specific task. But of course no sane developer will rewrite the C++ big library into assembly and optimize it and make it multicore at the end just for my sake. I also consider that even there would be a C++ multi-core version (which there is none for now, for my specific problem), still I would prefer Java, even C++ would be let's say 10% faster, even the task is time consuming, and the reason I've told you in a previous post: Java 8 will get some performance updates, if not, Java 9 with Project Jigsaw. With compiled C++ version, I'm stuck (I will have to recompile every time). At last, Java would allow me to move it my code into cloud, so I can have "infinite scalability". Do you know any cloud letting you to run Assembly code?

    At last, optimization is many times misdirected, as this guy told it fairly nicely three years ago:
    http://pl.atyp.us/wordpress/index.ph...because-its-c/
    Last edited by ciplogic; 04-05-2013 at 05:48 PM.

  2. #62
    Join Date
    May 2012
    Posts
    560

    Default

    Quote Originally Posted by ciplogic View Post
    So to understand: you accept that writing a MemCopy like function in the past, is was a bad idea, because as of today the builtin functions are doing better, right? But if you would write your code and even you were close to an expert to write it at the time of Intel32 Pentium 1 glory, maybe the version that GCC will provide, still will be better
    memcopy on P1 was a lot simpler as you didnt have sse that adds to complications with alignment
    so yes, a compiler could do a good job

    thing is you have to tell the compiler what it can use
    but most of the time a generic function is used that is "rep movsd/q"

    Quote Originally Posted by ciplogic View Post
    Yes, but also it shows how good was the compiler:

    Also, if you look for the times when the things were posted, to get a solution for this simple NBody program took 2 months (!) to fix it. Of course, maybe tomorrow someone will port it to OpenCL and will work 10 times faster than the assembly code and using less CPU in the process: http://developer.apple.com/library/m...ion/Intro.html
    also later he posted
    I have finally made reciprocal sqrt implementation and yes,
    it is fast Wink
    Done sse2 version which executes in 8.5 seconds! on q6600
    vs 14 seconds for fastest fortran on site Wink
    Also done avx version which is much faster than sse version Wink
    ofc, this is a specialized program

    it took me months to make a competitive memcpy
    its a hobby, theres no time table and i was still learning so leaving it alone for weeks happens
    and yes, openCL is great for lots of math
    but still a gpu is bad at logic so theres for example fast cpu only raytracing and fast gpu only with different algorithms
    i dont count opencl gpu/cpu as it needs to be fitted to an architecture to get the maximum


    Quote Originally Posted by ciplogic View Post
    What would stop you to do this extra float to add it to make the multiplications easy on SSE? (at least in case it shows in profiling)
    its a bunch of blends for vertices
    preparing them for sse... i guess it would be possible but it would need more logic in preparation and ugly changing of structs

    Quote Originally Posted by ciplogic View Post
    At last, optimization is many times misdirected, as this guy told it fairly nicely three years ago:
    http://pl.atyp.us/wordpress/index.ph...because-its-c/
    true
    theres always been an argument about it
    still sometimes when your done with a program, you profile it, you find bottlenecks/bugs, you work on that
    and so forth
    then if you have a loop that you think you can make faster in assembly (for whatever reason, be it sse, size or just plain cheating) you can try at least

    i mean
    its no wonder glibc has so much assembly
    its gonna do thousands of loops on thousands of computers
    you could say they are saving the environment by saving energy

    also x264 is a good example
    it can encode HD in real time for streaming, on a modern cpu ofc

    ofc writing a program that will run as fast as it can on many cpus is a lot of work
    and not all software needs that couple % at the cost of mixing languages
    but it is simple to write a function as a static library, the compiler should just paste it there as it would with any other code
    only drawback i can think of is that the compiler might (or might not) have to prepare the parameters, thats only one extra MOV per parameter once
    Last edited by gens; 04-06-2013 at 12:33 AM.

  3. #63
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    398

    Default

    Quote Originally Posted by gens View Post
    memcopy on P1 was a lot simpler as you didnt have sse that adds to complications with alignment
    so yes, a compiler could do a good job
    The point I wanted to make is that in 5 years from now your compiler/runtime can give a better code than your assembly. For cryptographic routines, the CPU can support extra extensions, and you may need today to write assembly to access them, but is better to extract this to an upstream library and all cases when the CPU supports them, will be optimized. So keeping the assembly OUT of your project code will always optimize
    Quote Originally Posted by gens View Post
    (...)
    i dont count opencl gpu/cpu as it needs to be fitted to an architecture to get the maximum
    I didn't count either, I simply noticed that NBody for example can be GPU computed, and you can get at the end a better response if you will have a similar complex computation. This is why supercomputers work with CUDA (and rarely with OpenCL) today. If you need today the best performance to optimize NBody because you depend on it, you can ask your customer to buy a 100-200$ video card to do this processing. Read carefully, that I didn't say that GPGPU computations are a golden hammer, if so, I wouldn't use myself Java 2 paragraphs later. I said that assembly code would be a much unwise decision than using C, as C would be a better starting point for an OpenCL application.



    its a bunch of blends for vertices
    preparing them for sse... i guess it would be possible but it would need more logic in preparation and ugly changing of structs
    No, I basically said this: if you acknowledge that a matrix 4x4 would work fast, compared with your case when 3x3 would not (and the compiler will SIMD this multiplication), why not copy your matrix back and forth from one matrix into another. Also, I've also told you that many matrix multiplications can be done on GPU (if you're using OpenGL, look into glLoadMatrix), so if you just need to display them, and you have 1.000.000 vertices to move, use a vertex shader, or at least this glLoadMatrix call that would apply matrix multiplications with 0% CPU usage.

    (...)
    i mean
    its no wonder glibc has so much assembly
    its gonna do thousands of loops on thousands of computers
    you could say they are saving the environment by saving energy
    GLibC in fact doesn't use this much assembly, and when it does, it is simply as you said written in a runtime level when there is no other way around it: for example to offer an atomic operations for smartpointers (I notice this in Boost library).

    I've heard a lot about saving energy using C++ (like Herb Sutter's last year "Why C++" and he said better performance per watt) and I can say that yes, C++ would give to you this wattage saving (compared to PHP for example, and in big .Net applications) but I don't see assembly how would stack up for the job. C++ is tedious by today's standards of programming, how assembly can get more acceptance.
    also x264 is a good example
    it can encode HD in real time for streaming, on a modern cpu ofc
    Sandy Bridge and NVidia (I think also is true about AMD) offer most operations of encoding H264 in hardware, so calling directly their SDK would likely improve the encoding. If you will use all mainstream video editors (which are paid) like PowerDirector use all hardware you can throw on them: http://www.cyberlink.com/products/po...on9_en_US.html (this is not a product advertisment!). This is also the Sony Vegas processing: http://www.sonycreativesoftware.com/...puacceleration

    This video shows that GPU acceleration gives 4x speedup: http://www.youtube.com/watch?v=6x5TAoo6JWI for a commercial video product

    ofc writing a program that will run as fast as it can on many cpus is a lot of work
    and not all software needs that couple % at the cost of mixing languages
    but it is simple to write a function as a static library, the compiler should just paste it there as it would with any other code
    only drawback i can think of is that the compiler might (or might not) have to prepare the parameters, thats only one extra MOV per parameter once
    Writing a program to work on all CPUs can be tedious, in C++ even more tedios, in assembly it would be harder. In C# you mostly write some lambdas and Parallel.ForEach(...), or async/await keywords. In 5 lines of code (and one poor's man Java's lambdas: anonymous objects) the regular can use a Thread queue. (like here: http://stackoverflow.com/questions/9...d-simple-queue ) Yes, C++ has more low level control, and we can argue that Java is written in C++, but also the HotSpot optimizer will remove if is save some locking primitives (look for this article in Wikipedia which sums it nicely: http://en.wikipedia.org/wiki/Java_pe...ock_coarsening ).

    Yes, going low, maybe I could get better performance, and if I would recompile all libraries I need, make my thread pooling, make a memory pool to reduce cache misses and increase allocation speed, I would get *maybe* better performance than Java, and writing the critical parts into assembly would take some speed advantage. But for that it would take like 1 year to implement, and at that moment Java 8 would launch, and the 30% speedup of C++ would be just 15% speedup.

    At last you missed one point too: C++ codes are not always multi-threaded by default. Many people did not write them in this way. For example excluding BeOS (or HaikuOS), basically all UI toolkits has to go in the main thread in almost every widget toolkit. So at the end is not about how you could potentially improve others' code by using nicely multi-core but how to patch other upstream libraries to be able to use these cores.

    Like in my case: there are Java libraries that are threadsafe, and C++ that they aren't. If you need this external library to check something, maybe your C++ code cannot work multi-threaded because the object that you need to share with this specific library cannot be shared nicely between threads (for example in my case was a big document object). I could spend a week to make this library multithreaded, and maybe 1 week to write my thread queue, and at the end to notice that there is another bug. I've ported the algorithm into Java in 2 days (and I'm not a Java developer, albeit I'm a decent C++ and C# developer) and I can get 6 to 8x speedup.

    I basically don't see if assembly would do any deal, may you explain me how? How would you make a C++ library multi-threaded when it isn't?
    Last edited by ciplogic; 04-06-2013 at 05:03 AM.

  4. #64
    Join Date
    Feb 2008
    Location
    Linuxland
    Posts
    5,196

    Default

    Like in my case: there are Java libraries that are threadsafe, and C++ that they aren't.
    That sounds like the Java wrapper for the library does the locking itself, meaning one of two things:
    - making the library thread-safe is easy, or
    - it is not worth it due to locking overhead, so you'll get little gains using the Java wrapper on multicore, and decreased performance using the Java wrapper on singlecore

  5. #65
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    398

    Default

    Quote Originally Posted by curaga View Post
    That sounds like the Java wrapper for the library does the locking itself, meaning one of two things:
    - making the library thread-safe is easy, or
    - it is not worth it due to locking overhead, so you'll get little gains using the Java wrapper on multicore, and decreased performance using the Java wrapper on singlecore
    In this case there are no wrappers, are a real world C++ implementation and a separate library that does mostly the same functionality and is fully written in Java.

    So, it can be the third thing: the C++ library is old and was sharing some resources at startup (as I've told you about a document format, so it loads things like share resources - font faces, and some formatting data) that I think they were done initially for performance/memory reasons. This library is commercial (and I don't want to say that is a bad library etc.) but the opensource Java library does (maybe a bit less featured than a full commercial solution but it still solves my problem) not load these shared resources (at least up to the password checking point) so it doesn't give any weird errors when moved into multi-core.

    As I've said earlier, it is possible to workaround the C++ solution, like extracting just the code of password checking and make sure that in this code will not initialize any font structures (C++ initializes eagerly resources, when Java does it lazily, and both approaches have advantages). Also, it was possible (I think) to profile the C++ library to see where the slow parts are and to make it 30% faster than the Java counterpart, for now a single core implementation in fact it works really bad (like 1.8x as slow against Java -server, and 1.2x as slow with Java client).

    Here is nothing about bashing C++, I wanted to point that real life cases when using a clean language can be an advantage and give a good enough performance. Even the C++ library is working 2x slower at just opening the document and checking the password, it may run much faster in real operation of working with this document right after is loaded, like importing other document inside the first one, to display it, to switch pages, etc. as C++ implementation may load fonts and such at startup just to make the operation faster later. This is why I explicitly said:
    Also I think that Java optimized very much the trace of code that happen to be run in that specific project, so I wouldn't give the conclusion like "Java is faster than C++" or anything of the sort.

  6. #66
    Join Date
    May 2012
    Posts
    560

    Default

    Quote Originally Posted by ciplogic View Post
    The point I wanted to make is that in 5 years from now your compiler/runtime can give a better code than your assembly. For cryptographic routines, the CPU can support extra extensions, and you may need today to write assembly to access them, but is better to extract this to an upstream library and all cases when the CPU supports them, will be optimized. So keeping the assembly OUT of your project code will always optimize

    ...
    sse has been around for... a looooong time
    if you count non x86 architectures then it has been here before i was born
    do i have to wait for my grandchildren (if i have any at all) to have a good sse compiler ?
    optimizing compilers are bloody complicated things
    and if you ask anyone that worked on one im sure they'l say it can never reach perfection
    then theres sloppy programing since most programers dont even know how big a cpu cache is, im sure many dont even know how to align memory

    i dont think people understand how much work a cpu does
    its a massive amount of calculations and logic, massive
    on a 1GHz cpu theres around 1000000000 instructions being executed per second

    in higher level languages its easy to implement something extra, like threading
    thus its also easy to make a working program and never find out its doing 3x more work then minimum required

    glibc has many parts in assembly, like a generic amd64 memcpy
    theres also sse, ssse3 and so on versions of many functions
    best one gets chosen by the linker/loader when called for

    a 4x4 matrix is 77% bigger then a 3x3 matrix, there would be no gain
    if you though about packing 3x3 to seem as 4x4, its complicated to do and could mess up other things
    and yes, a modern gpu has shaders for just that
    but not all computers have a modern gpu


    "I basically don't see if assembly would do any deal, may you explain me how? How would you make a C++ library multi-threaded when it isn't?"

    you cant make something threaded when it cant be threaded
    and if it can be made threaded, from what i understand kernel threads are fairly simple
    true you can make things threaded easier in higher level languages
    but, at least with pthreads, theres a price to pay
    fun fact, GNU ls uses threads

    its funny how many people think threading is the way to go when optimizing programs
    threaded, if you ask me, is useful in some cases but not nearly as much as people say

    what bugs me for a while now is how you got a 6-8 times faster execution when hyperthreading on a single core is still just one core

    anyway, profile profile profile
    there is no magic language or a perfect compiler
    people and compilers make mistakes
    also spending years trying to tell the compiler exactly what to do is worse then doing it yourself, if you ask me
    problems are people complicate things for no reason
    i like assembly cuz its simple rly
    i dont have to read a 400 page C standard and then some other books about tricks to do in C, all i need is a table with instruction latencies (its just a couple pages per cpu, and must cpus are same in that)

    just to add:
    things propagate from higher level things too ultimately the cpu itself
    one example is what i read that "rep movsq" will be treated specially by the cpu(in future generations), thus making it the fastest possible way to do memcpy
    other obvious examples are sse, as you sayd h264, AES and probably more
    Last edited by gens; 04-06-2013 at 09:58 AM.

  7. #67
    Join Date
    Nov 2009
    Location
    Madrid, Spain
    Posts
    398

    Default

    Quote Originally Posted by gens View Post
    sse has been around for... a looooong time
    if you count non x86 architectures then it has been here before i was born
    You talk about SIMD. SSE has been from Pentium 3 CPUs and Athlon XP Barton if I recall it correctly. SSE2 was from Pentium 4 and Athlon 64 on AMD's side.

    If you talk about Intel's side of SIMD it started with Pentium MMX (that would accelerate integer computations) and AMD 3DNow! (that would pack inside 2 32 bit floats) in the float point coprocessors of that time. If you take all the SIMD instructions, for projects (like ARM Neon, AltiVec on PPC), you will see that your matrix multiplication to be fast (by hand) would require a lot of effort and hardware.


    Quote Originally Posted by gens View Post
    do i have to wait for my grandchildren (if i have any at all) to have a good sse compiler ?
    optimizing compilers are bloody complicated things
    and if you ask anyone that worked on one im sure they'l say it can never reach perfection
    So, good SSE compiler, again I think you talk about SIMD. Any auto-vectorization which writes automatically AVX or AVX2 or SSE2,3 started inside GCC for at least 7 years from now: http://gcc.gnu.org/projects/tree-ssa/vectorization.html And is not only there, but it still improves at times.

    Quote Originally Posted by gens View Post
    then theres sloppy programing since most programers dont even know how big a cpu cache is, im sure many dont even know how to align memory
    If you have many concerns to address, including security, client specifications that are changing, and every project that has more than 1000 classes (files) the way to align the data into caches (even you know what means a cache miss or a cache line) may not be the first concern you have in mind. If you would work in aviation, you will see that safety is a higher concern, and close-to-realtime specifications (that can be achieved in cases even with Java(!)) are as valid concerns as the 8 or 16 byte alignment.


    Quote Originally Posted by gens View Post
    i dont think people understand how much work a cpu does
    its a massive amount of calculations and logic, massive
    on a 1GHz cpu theres around 1000000000 instructions being executed per second

    in higher level languages its easy to implement something extra, like threading
    thus its also easy to make a working program and never find out its doing 3x more work then minimum required
    You have two factual mistakes: 1GHz is not 1 billion instructions because in many times the CPU is bottlenecked by the CPU's memory bus, also even in CPUs that get better (like Out-of-order CPUs that are in the market for close to 20 years already). I recommend to you this "Crash course" about CPUs and instructions scheduling and the "impression" that we have for 1 cycle per CPU: http://hardware.slashdot.org/story/1...odern-hardware

    The 2nd mistake is that you write that high level languages cannot target low level performance. C++ is certainly a high level language and can precompute at compile time a lot of things with templates, and zero cost of CPU at runtime. Also there are medium level languages that can be slow for compiler to optimize for:
    Source: http://stackoverflow.com/questions/1...plication-in-c

    What was said as the way to improve the code (that some languages as Rust can give code warranties without having this issue):

    Don't inline the function. Your matrix multiplication generates quite a bit of code as it's unrolled, and the ARM only has a very tiny instruction cache. Excessive inlining can make your code slower because the CPU will be busy loading code into the cache instead of executing it.

    Use the restrict keyword to tell the compiler that the source- and destination pointers don't overlap in memory. Currently the compiler is forced to reload every source value from memory whenever a result is written because it has to assume that source and destination may overlap or even point to the same memory.
    Quote Originally Posted by gens View Post
    glibc has many parts in assembly, like a generic amd64 memcpy
    theres also sse, ssse3 and so on versions of many functions
    best one gets chosen by the linker/loader when called for

    a 4x4 matrix is 77% bigger then a 3x3 matrix, there would be no gain
    if you though about packing 3x3 to seem as 4x4, its complicated to do and could mess up other things
    and yes, a modern gpu has shaders for just that
    but not all computers have a modern gpu
    I just said in the previous post: there is no problem to have in some libraries some assembly if in real life today it works better. If you need atomicity, most likely you have *no other way* than working with assembly. But this still doesn't make it a sane option for LibreOffice, or for any desktop application I'm aware. Of course if there is no C API to access a feature, and the assembly way is the way to do it (like programming a serial port), yes, it needs assembly. I cannot argue that has to be used assembly. But for performance reasons?

    In fact all computers have a modern GPU, glLoadMatrix (http://www.khronos.org/opengles/sdk/...LoadMatrix.xml) is implemented by any Transform&Lighting video card from GeForce256 (DirectX7 card). Even the worst integrated Intel "video card" has it from times like 2005. If your user's computer doesn't implement it in hardware, it really means that they may not have even SSE2 instructions that you were so happy that you optimized for.

    Quote Originally Posted by gens View Post
    "I basically don't see if assembly would do any deal, may you explain me how? How would you make a C++ library multi-threaded when it isn't?"

    you cant make something threaded when it cant be threaded
    and if it can be made threaded, from what i understand kernel threads are fairly simple
    true you can make things threaded easier in higher level languages
    but, at least with pthreads, theres a price to pay
    fun fact, GNU ls uses threads

    its funny how many people think threading is the way to go when optimizing programs
    threaded, if you ask me, is useful in some cases but not nearly as much as people say
    Threading is not a way to optimize programs, but is a good way to offer a not-so-laggy experience for most users. Also today's hardware supports at least 2 cores, even on the weakest phones or even on Atom CPU class processors, and when you write your code to use these threads you get visible speedups. Why PThreads are pricey to be used? May you give your real-life case? It is harder to program with them with C or with C++ (is a bit easier with C++ 11 as you have lambdas to give the info you will need, but I don't see the deal.

    Quote Originally Posted by gens View Post
    what bugs me for a while now is how you got a 6-8 times faster execution when hyperthreading on a single core is still just one core
    I've said: it was one core with a C++ library vs a 4 cores (the same CPU) with a Java library and tunning by a little the Java flags. In fact C++ would benefit of "Turbo Boost" as it uses just one core, when Java implementation would not. But even I limited to 1 core, Java would execute close to 1.8x times faster than the C++ implementation (Java 1.6 flags were: -server -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseBiasedLocking against the -O3 for the Visual Studio's compiler).

    Quote Originally Posted by gens View Post
    anyway, profile profile profile
    there is no magic language or a perfect compiler
    (...)
    one example is what i read that "rep movsq" will be treated specially by the cpu(in future generations), thus making it the fastest possible way to do memcpy
    other obvious examples are sse, as you sayd h264, AES and probably more
    So this is why is much safer to use memcpy function, as the compiler will use in future "rep movsq" or your next .Net/Java/GLibC update. Also is better to use the AES algorithms from a library that is outside of your codebase with no ASM on your own.
    What if you need a particular loop to be fast and the CPU doesn't it auto-vectorize? Make a bug report, write it compiler friendly, and post your questions publicly:
    http://stackoverflow.com/questions/5...rize-this-loop

    In this way tomorrow's code you know that it doesn't only use SSE (1-4.x) but also AVX or whatever future instructions would appear and the compiler would support them.

    At your toy-example of multiplying matrices, even let's say you cannot put them on GPU, why not using an upstream library to do the multiplication? A quick Google search gave to me even a SIMD optimized Mono library (where highly tuned code is more required as Mono is not so great compiler): https://github.com/mhutch/Mono.GameMath/ If you will find that in 2 years from now a new CPU would appear, you can ask the person from the upstream library to fix it for you with your relevant use-case. He/she can give to you real life feedback how your loop is badly written or can fix it upstream. Is a win-win situation. Writing your assembly routine and leaving the company, will let the company powerless to fix your code after that.

  8. #68
    Join Date
    May 2012
    Posts
    560

    Default

    i took 1 instruction per tick because of out of order scheduling
    although most instructions take 1 tick or more, theres some that take half a tick (on newer cpus 1/3, maybe even 1/5)
    still even if you count some average of 2 tics per instruction its still a lot of instructions

    also people say "for every line in language you have to write 10-20 lines in asm"

    so heres a for loop

    in C

    for( i=0; i<66; i++) { stuff }

    in asm

    xor rcx, rcx
    label:
    stuff
    inc rcx
    cmp rcx, 66
    jnz label

    or shorter

    mov rcx, 66
    label:
    stuff
    dec rcx
    jnz label

    so one line with 3 paramaters turns into 3 instructions and a label (label gets translated into an address)

    ofc in asm you are not limited by any language specific quirks

    so ye
    its not that hard but its also not that productive


    "Writing your assembly routine and leaving the company, will let the company powerless to fix your code after that."
    what about writing undocumented C++ code ? can you read poorly constructed C++ ?
    if you program assembly and document and structure it, anyone that knows asm can replace you

    also about pthreads
    to be honest i found once that glibc posix threads were using over 5% cpu
    that should never happen and was probably a bug somewhere (didnt wanna say anything since i had messed up my system a bit)
    also in theory kernel threads are lighter, but as i figure you then need to make your own synchronization mechanism
    so ye, dont take me too seriously on this, but do try test before believing blindly

  9. #69
    Join Date
    Oct 2012
    Location
    Cologne, Germany
    Posts
    308

    Cool ASM has it's uses

    Quote Originally Posted by gens View Post
    i took 1 instruction per tick because of out of order scheduling
    although most instructions take 1 tick or more, theres some that take half a tick (on newer cpus 1/3, maybe even 1/5)
    still even if you count some average of 2 tics per instruction its still a lot of instructions

    also people say "for every line in language you have to write 10-20 lines in asm"

    so heres a for loop

    in C

    for( i=0; i<66; i++) { stuff }

    in asm

    xor rcx, rcx
    label:
    stuff
    inc rcx
    cmp rcx, 66
    jnz label

    or shorter

    mov rcx, 66
    label:
    stuff
    dec rcx
    jnz label

    so one line with 3 paramaters turns into 3 instructions and a label (label gets translated into an address)

    ofc in asm you are not limited by any language specific quirks

    so ye
    its not that hard but its also not that productive


    "Writing your assembly routine and leaving the company, will let the company powerless to fix your code after that."
    what about writing undocumented C++ code ? can you read poorly constructed C++ ?
    if you program assembly and document and structure it, anyone that knows asm can replace you

    also about pthreads
    to be honest i found once that glibc posix threads were using over 5% cpu
    that should never happen and was probably a bug somewhere (didnt wanna say anything since i had messed up my system a bit)
    also in theory kernel threads are lighter, but as i figure you then need to make your own synchronization mechanism
    so ye, dont take me too seriously on this, but do try test before believing blindly
    Thanks for this insufficient example!

    First off, your C-code stinks. Not only can't you judge efficiency by line-numbers, you also wouldn't ever construct a for-loop this way.
    Here's the correction for you and all the others not yet having understood how to efficiently construct loops, by the means of actually _allowing_ the compilers to optimise it properly:

    Code:
    for(i=66; i; --i){
            stuff;
    };
    1. Code Formatting: Never brag with one-liners when you can't read them once they get more complex.
    2. Count down!: (If possible), it will be much easier for the compiler, because it doesn't need to check an unary-condition and can fire off a jump if zero (JZ in x86) where needed. You would never know that if you didn't learn ASM some day
    3. Pre-Decrementing: I hope you know what that is, because there is good reason to do so: The compiler has no way to efficiently place the Post-Decrement in this loop, whereas it is really simple for him to do this with a Pre-Decrement.
    4. Semi-Colons: Quite small point, but it serves the readability to put a semi-colon at the end of a for-loop.
    5. I'm open for additions...


    Now, looking at your ASM (I took the shorter example for your convenience):
    Code:
    mov rcx, 66
    label:
    stuff
    dec rcx
    jnz label
    Cudos to you for at least doing it right in ASM! The DEC-JNZ-method is more efficient than doing an expensive comparison first (of course, if possible), so at least I will definitely give you credit for that !

    Nevertheless, you defeated your argumentation by making this mortifying mistake, because it ultimately supports my point about this issue:

    It doesn't make sense to write projects in ASM in general, but it definitely is helpful when you know how the compiler wrangles with your code to at least give him some hints for optimisation.
    For example, there are some horribly-built loops in libudev (I fixed in eudev recently) which don't even have any arguments atop whatsoever. All condition-checks where relocated into the loop-body.
    It works, granted, but the compiler was not able to optimise it and the code was not save, as there was a real danger of getting an infinite loop (I have to note here that I definitely am of the opinion that the udev-guys know what they are doing (in most cases), but blindly hoping for the compiler to fix messy code is just dumb and naive).

    So, I hope I've made my point clear. I'm definitely open for your kind remarks!

    @PThreads, ASM-productivity, Glibc
    Even though I am a big enemy of Glibc and a big supporter of its alternatives, your points are valid. For most cases, it is sufficient.
    It came to my mind when you put up the keyword "poorly constructed C++". I might be entering a divided world, but I guess this language is poorly designed in general, encouraging to employ inefficient coding-techniques.
    Especially today, when RAM has become the new bottleneck in personal computing, why in bloody hell are so many people encouraged to write code with giant classes and threading-mechanisms provoking cache-misses like no one else? It might change in the following years and C++ has it's uses, but how much sense does it make then to question the very existence of ASM when you can't even write proper C?

    Best regards

    FRIGN

  10. #70
    Join Date
    May 2012
    Posts
    560

    Default

    Quote Originally Posted by frign View Post
    Especially today, when RAM has become the new bottleneck in personal computing, why in bloody hell are so many people encouraged to write code with giant classes and threading-mechanisms provoking cache-misses like no one else? It might change in the following years and C++ has it's uses, but how much sense does it make then to question the very existence of ASM when you can't even write proper C?

    Best regards

    FRIGN
    im open to critic
    but that example wasnt an example of how to do things in C, it was just the simplest one liner that came to mind

    ofc normally i write it structured like

    Code:
    for( i=0; i<66; i++) {
          things
          }
    this looks readable in longer code to me, and i didnt have one teacher to teach me a specific coding stile
    as i dont do it for money i dont care rly

    and it gets compiled like that shorter example in asm
    thing is C was made for humans too, even thou it was made so programers dont have to write assembly
    so C maps well to assembly but also to human logic and compiler knows you just want to make that loop run 66 times

    but what C dosent tell you is how many registers that cpu has
    it comes natural to use as many variables in a loop as needed to reduce calls and double calculations
    but if you use more variables then you have registers then the compiler has to store them to ram and read back from cpu cache when needed

    thats one simple case where you dont know what your doing cuz you never learned and the compiler didnt tell you
    probably wont help much in performance thou as it gets loaded quite fast from the cache
    Last edited by gens; 04-07-2013 at 01:53 PM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •