Announcement

Collapse
No announcement yet.

Is Assembly Still Relevant To Most Linux Software?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #81
    cmon i didnt write C in more then a year
    took me 15min to remember it

    "What about the case of: stuff that changes RCX, would create an infinite loop? "

    rcx is a general purpose register
    but it is treated also as a special counter register, by the programers and the cpu in some cases

    when it comes down to it, if you have stuff that changes the "i" variable its the same thing

    fun fact; theoretical throughput of x87 floating point and sse floating point are the same

    Comment


    • #82
      Originally posted by gens View Post
      cmon i didnt write C in more then a year
      took me 15min to remember it

      "What about the case of: stuff that changes RCX, would create an infinite loop? "

      rcx is a general purpose register
      but it is treated also as a special counter register, by the programers and the cpu in some cases

      when it comes down to it, if you have stuff that changes the "i" variable its the same thing

      fun fact; theoretical throughput of x87 floating point and sse floating point are the same
      So let me answer to you about previous questions as well:
      - what about not commented C++ compared with not commented assembly: in fact in C++ is much easier: there is always less code in C++ that does the same
      Code:
      bool myfunction (int i,int j) { return (i<j); }
      
      struct myclass {
        bool operator() (int i,int j) { return (i<j);}
      } myobject;
      
      int main () {
        int myints[] = {32,71,12,45,26,80,53,33};
        std::vector<int> myvector (myints, myints+8);           
      
        std::sort (myvector.begin(), myvector.begin()+4);      
      
        std::sort (myvector.begin()+4, myvector.end(), myfunction); 
      
        std::sort (myvector.begin(), myvector.end(), myobject);     //(12 26 32 33 45 53 71 80)
      
        std::cout << "myvector contains:";
        for (std::vector<int>::iterator it=myvector.begin(); it!=myvector.end(); ++it)
          std::cout << ' ' << *it;
        std::cout << '\n';
      
        return 0;
      }
      This code does probably sorting. If there is a bug in it, you can split the code of displaying away, and you can replace your sorting with your custom code if you like. Assembly doesn't have this neat modules, high level definitions to work with. I was working with badly written C++ code, but hey, there is nothing to compare with assembly. C++ has other issues, like long compiling times, but this is not either about performance, but I would understand that rewriting a badly written module to be corrected can take time. But as always, if this big module is split in parts, some parts don't have to be fully rewritten, as you can understand that some parts do work.

      Also, if you use a decent IDE: XCode, QtCreator or Visual Studio (with Whole Tomato plugin), I see no issue into navigating into a huge amount of code, even is ugly written. May you say which tooling includes your assembly coding equivalent? I know Fasm, if you think that this is where we are today, you should still look around.

      "Is anything faster than x264 encoder or FFMPEG that are heavily written in assembly?"
      First of all I think you agree that: FFMpeg as written in assembly can have bugs, and if you have a bug you have to fix it in all implementations. Or sometimes just one of the implementations has a bug, like this FFMpeg bug: https://ffmpeg.org/trac/ffmpeg/ticket/1566

      As you don't have any numbers (between the baseline C/C++ implementation vs assembly based), I cannot say anything that just some mini-algorithms are written in assembly where the benefit is visible and if you think that this is your opportunity to say: see, I've told you that they *had to* use assembly, I would say no: today they have to use OpenCL and as I've shown to you some video that using OpenCL would encode faster using it, than ignoring, I think this is the best today's advice. (look to Sony Vegas or PowerDirector OpenCL speedups and you'll see what I'm talking about)

      "RCX is used for counting"
      When you said RCX is used for counting (most of the times), I cannot agree more with you, but this just means that if "stuff" is anything that does include loop, it will use RCX too, so when you will see a save and restore asm codes in GCC, saying: GCC doesn't write my loop in an optimum way, is a misunderstanding of the fact that GCC optimizes your program faster to make it finish first. (by saving and restoring later after stuff call)

      You also said another factual mistake: "compilers cannot know/decide when your loop is "hot" " and hence it cannot decide when to do loop unrolling, or inverting an if condition to match CPU cache. Here you're wrong for two reasons: GCC does it via 2 items: it has both built-in heuristics (so if you loop to values like 1000.000 it considers automatically the loop hot, and will try at least to inline your "stuff") and also it has Profile Guided Optimizations: http://dom.as/2009/07/27/profile-gui...tion-with-gcc/ How do I know that a compiler knows that? Reading a GCC paper about LTO: http://arxiv.org/pdf/1010.2196.pdf where LTO does optimize based on these heuristics (if no PGO info is around) that you said the compiler doesn't know about.

      At last, how fast would you write your code?

      Hypothetical case
      Let's say we are 2 engineers that we have to do the same task: to optimize a pixel-processed routine to be used in real life. My approach it will be to use up-to-date technologies, and you would use assembly.

      So, first of all, supposing that it is about C++ code, I will write 2 #pragma directives to make my program multi-core aware (OpenMP) and with saved time, and extracting the hot loop, I will use OpenCL with a cut&paste approach.

      Let's say you will do the assembly task to work 3x times faster in the mean time keeping the semantics and with no bug. Who would benefit?

      What about someone with an old computer?
      If you talk about users that have a Pentium 3, with a TNT2 video card, most certainly, you're the man. But as most people don't have Pentium 3, but at least a form of Athlon X2 or X4 or Pentium 4 dual-core, the assembly code doesn't look that shiny, it would be marginally faster, but not so fast.

      What about the power users?
      They can have a 4core,6core,8core machine, or maybe a 4u server with 64 cores (if they want to compute a huge library of pictures). They would benefit from the multi-core aware implementation. What about users that mostly play games, so they have an Athlon X2 but a powerful AMD Radeon 76xx card? They would be still faster with the OpenCL implementation than the 3x speedup assembly.

      What about the future?
      Compilers will still do optimizations, so even they would bring just 10% speedup (like this one in Clang: http://blog.llvm.org/2011/09/greedy-...n-llvm-30.html ) I just have to recompile the code. As the video drivers update and as more people have the hardware that is OpenCL aware, the code will leverage these users better, again with no work from my side. Also, maybe users have a very weak CPU for a tablet but a powerful GPU (like Tegra4, Adreno video cards) that are capable to a form of OpenCL.

      Comment


      • #83
        Namespaces

        Originally posted by ciplogic View Post
        So let me answer to you about previous questions as well:
        - what about not commented C++ compared with not commented assembly: in fact in C++ is much easier: there is always less code in C++ that does the same
        Code:
        bool myfunction (int i,int j) { return (i<j); }
        
        struct myclass {
          bool operator() (int i,int j) { return (i<j);}
        } myobject;
        
        int main () {
          int myints[] = {32,71,12,45,26,80,53,33};
          std::vector<int> myvector (myints, myints+8);           
        
          std::sort (myvector.begin(), myvector.begin()+4);      
        
          std::sort (myvector.begin()+4, myvector.end(), myfunction); 
        
          std::sort (myvector.begin(), myvector.end(), myobject);     //(12 26 32 33 45 53 71 80)
        
          std::cout << "myvector contains:";
          for (std::vector<int>::iterator it=myvector.begin(); it!=myvector.end(); ++it)
            std::cout << ' ' << *it;
          std::cout << '\n';
        
          return 0;
        }
        This code does probably sorting. If there is a bug in it, you can split the code of displaying away, and you can replace your sorting with your custom code if you like. Assembly doesn't have this neat modules, high level definitions to work with. I was working with badly written C++ code, but hey, there is nothing to compare with assembly. C++ has other issues, like long compiling times, but this is not either about performance, but I would understand that rewriting a badly written module to be corrected can take time. But as always, if this big module is split in parts, some parts don't have to be fully rewritten, as you can understand that some parts do work.

        Also, if you use a decent IDE: XCode, QtCreator or Visual Studio (with Whole Tomato plugin), I see no issue into navigating into a huge amount of code, even is ugly written. May you say which tooling includes your assembly coding equivalent? I know Fasm, if you think that this is where we are today, you should still look around.

        "Is anything faster than x264 encoder or FFMPEG that are heavily written in assembly?"
        First of all I think you agree that: FFMpeg as written in assembly can have bugs, and if you have a bug you have to fix it in all implementations. Or sometimes just one of the implementations has a bug, like this FFMpeg bug: https://ffmpeg.org/trac/ffmpeg/ticket/1566

        As you don't have any numbers (between the baseline C/C++ implementation vs assembly based), I cannot say anything that just some mini-algorithms are written in assembly where the benefit is visible and if you think that this is your opportunity to say: see, I've told you that they *had to* use assembly, I would say no: today they have to use OpenCL and as I've shown to you some video that using OpenCL would encode faster using it, than ignoring, I think this is the best today's advice. (look to Sony Vegas or PowerDirector OpenCL speedups and you'll see what I'm talking about)

        "RCX is used for counting"
        When you said RCX is used for counting (most of the times), I cannot agree more with you, but this just means that if "stuff" is anything that does include loop, it will use RCX too, so when you will see a save and restore asm codes in GCC, saying: GCC doesn't write my loop in an optimum way, is a misunderstanding of the fact that GCC optimizes your program faster to make it finish first. (by saving and restoring later after stuff call)

        You also said another factual mistake: "compilers cannot know/decide when your loop is "hot" " and hence it cannot decide when to do loop unrolling, or inverting an if condition to match CPU cache. Here you're wrong for two reasons: GCC does it via 2 items: it has both built-in heuristics (so if you loop to values like 1000.000 it considers automatically the loop hot, and will try at least to inline your "stuff") and also it has Profile Guided Optimizations: http://dom.as/2009/07/27/profile-gui...tion-with-gcc/ How do I know that a compiler knows that? Reading a GCC paper about LTO: http://arxiv.org/pdf/1010.2196.pdf where LTO does optimize based on these heuristics (if no PGO info is around) that you said the compiler doesn't know about.

        At last, how fast would you write your code?

        Hypothetical case
        Let's say we are 2 engineers that we have to do the same task: to optimize a pixel-processed routine to be used in real life. My approach it will be to use up-to-date technologies, and you would use assembly.

        So, first of all, supposing that it is about C++ code, I will write 2 #pragma directives to make my program multi-core aware (OpenMP) and with saved time, and extracting the hot loop, I will use OpenCL with a cut&paste approach.

        Let's say you will do the assembly task to work 3x times faster in the mean time keeping the semantics and with no bug. Who would benefit?

        What about someone with an old computer?
        If you talk about users that have a Pentium 3, with a TNT2 video card, most certainly, you're the man. But as most people don't have Pentium 3, but at least a form of Athlon X2 or X4 or Pentium 4 dual-core, the assembly code doesn't look that shiny, it would be marginally faster, but not so fast.

        What about the power users?
        They can have a 4core,6core,8core machine, or maybe a 4u server with 64 cores (if they want to compute a huge library of pictures). They would benefit from the multi-core aware implementation. What about users that mostly play games, so they have an Athlon X2 but a powerful AMD Radeon 76xx card? They would be still faster with the OpenCL implementation than the 3x speedup assembly.

        What about the future?
        Compilers will still do optimizations, so even they would bring just 10% speedup (like this one in Clang: http://blog.llvm.org/2011/09/greedy-...n-llvm-30.html ) I just have to recompile the code. As the video drivers update and as more people have the hardware that is OpenCL aware, the code will leverage these users better, again with no work from my side. Also, maybe users have a very weak CPU for a tablet but a powerful GPU (like Tegra4, Adreno video cards) that are capable to a form of OpenCL.
        Looking at your example, I have to honestly say that ASM is more readable.
        I know there is a "using namespace ..."-directive, but writing it in front of each element of the std-namespace doesn't make it more readable!

        I agree on the points you gave on ASM though! It can be a huge bottleneck in some cases, when there are bugs to fix for a critical implementation. But to be fair, you don't have to endlessly search for them in ASM, because it follows a simple logic.
        If you write buggy code in C/C++, it could be anything, reaching from a bug in a library to a misset compiler.

        Comment


        • #84
          Originally posted by frign View Post
          Looking at your example, I have to honestly say that ASM is more readable.
          I know there is a "using namespace ..."-directive, but writing it in front of each element of the std-namespace doesn't make it more readable!

          I agree on the points you gave on ASM though! It can be a huge bottleneck in some cases, when there are bugs to fix for a critical implementation. But to be fair, you don't have to endlessly search for them in ASM, because it follows a simple logic.
          If you write buggy code in C/C++, it could be anything, reaching from a bug in a library to a misset compiler.
          This is not my code, was a Google found code with templates (and I removed most of comments). Still I can find fairly clearly all the sorting logic is separate. Even I would not know the algorithm, or is it a bug in sorting function, I can clearly notice and debug where the issues are and I can rewrite just the sorting logic (even by using the C sort function). Also this code I specially picked because std::sort would likely inline the comparing function, so it works faster than the C implementation and it is really hard to write a sorting function in assembly (that is fast).

          I think that debugging C++ is ugly, and I grant you this, but still there are clean libraries that make your work finish: Qt is first it comes to mind. Also, most of high level languages like Java, Mono, Python, etc. can have bugs at many levels but most of them are sorted out for a long time by the upstream writers. In assembly as you write only very low level, you don't have anything under you (excluding the CPU, or maybe some OS basic functionality) but this also means that you cannot write a complex functionality that most software use today: a web server, an application that processes a picture with all interactions (Adobe Photoshop Elements is written in Qt), a compiler.

          To write small (in scope) functions in assembly is possible, the single concern is for: "Most Linux Software" - as the title suggest, is where this assembly should be: in few core libraries, not downstream to all applications. As for me, it literally makes no sense in many cases to use assembly at all (because compilers cannot optimize it further), because today's trends are multi-core or GPGPU which neither depends on assembly to be fast. Code guarantees are more important: immutability, some mutexes/semaphores (which yes, are written in assembly, but not in your code, but in your library), smart-pointers to avoid as much as possible wrong usages of new/delete., const refs (if a C++ developer uses them, can get big on performance), templates.

          Comment


          • #85
            where to start...

            messing with rcx(or whatever register you use for a counter) in a loop is like writing
            Code:
                    for( i=0; i<66; i++) {
                            i=0;
                    }
            its as dumb in asm as it is in C

            LTO is good in 32bit code, as it removes the "stack for parameter passing" part
            but in amd64 (and 32bit fastcall) parameters are passed thru registers so calling any function is as fast as a jmp (1 tick if shortjump, long dosent matter as it happens in LTO too)

            PGO is good yes, as good as your benchmark/test_case
            only downside is its an overhead while profiling so it cant be as accurate as sampling (like perf does)

            numbers ?
            do you have any numbers ?
            Steve McIntyre wrote in that report that ~30% of packages with assembly have it for performance reasons

            high level design ?
            its as possible in asm as it is in any other language
            openmp can be called from assembly code almost as easily as from C
            synchronizing/scheduling threads should be even easier in asm
            anything can be done in asm, even OO code
            furthermore OO adds to a programs size

            and back to numbers
            think i remember how i did 3x3*3x1 calculation
            you write cleanest, most compiler friendly code in C and compile it as you want (limit to sse2, as i dont have avx)
            il write a function that does the same in asm
            we will make a test loop to measure speed
            input will be... 10k(or 100?) of bout matrices, generated before calling
            LD_PRELOAD will be used to eliminate loading when called, and to choose test function
            or ?

            also about GPGPU
            i agree, and always have, that if you got a lot of data to process in parallel its best to use a gpu
            but you cant be sure you will always have a gpu on that computer
            and sending data to the gpu and getting them back will cause delays, that is a problem when you need real-time results (like sound processing and stuff with networking)

            oh and that FFMPEG(that i never mentioned, x264 is hosted with the VLC project) bug has been resolved in 19 hours since it was reported

            and again to remind, threading in assembly should be quite easy
            although i cant testify on that since i dont care about it
            but i can testify that calling C programs is easy


            also about the report, that i read most of now;
            replacing RDTSC with gettimeofday ? really ?
            rewriting glibc in full C ?
            removing asm from the kernel ?
            removing asm from OGRE ?
            bdw, OGRE is good C++, mostly cuz its documented so good

            and about ARM
            arm has fewer instructions
            you might think that that fact will serve a human coding in asm more then a compiler, that is not the case
            the future is some kinda cpu/gpu hybrid, with lots of special instructions, cores and registers
            advantage a human has over a computer is that a human is rly good at recognizing patterns, while a computer has to have linear logic (test everything against everything, maybe make assumptions to shorten that if possible)
            so ye, C for ARM, ARM for (some) machinery
            and ARM pushing ARM stating "ARM has better potential performance/watt" has never been proven (at least not without obvious cheating)

            i probably forgot something

            edit: yes, IDE
            i program FASM in Scite (has pretty colors) and xterm
            dont need anything complicated as there isnt anything complicated about it
            just run script that compiles, links(if needed) and runs
            if it fails "gdb ./it"

            PS copy/paste code works most of the time with asm and FASM has very powerful macros for anything (structs are very easy, for example)
            Last edited by gens; 08 April 2013, 03:16 PM.

            Comment


            • #86
              Too bad anti viruses flagged the crap out those programs.

              Comment


              • #87
                Originally posted by gens View Post
                where to start...

                messing with rcx(or whatever register you use for a counter) in a loop is like writing
                Code:
                        for( i=0; i<66; i++) {
                                i=0;
                        }
                its as dumb in asm as it is in C
                Code:
                      
                  std::string[66] dictionary = (...);
                  std::string textToFind = "assembly";
                  for( i=0; i<66; i++) {
                                if(textToFind == dictionary[i]) break;
                        }
                [/QUOTE]
                The equality comparison does contain a loop and maybe is just me, but it doesn't look that dumb this code. As for your C code, any compiler can remove the loop as dead code (as it has no side effects).

                Originally posted by gens View Post
                LTO is good in 32bit code, as it removes the "stack for parameter passing" part
                but in amd64 (and 32bit fastcall) parameters are passed thru registers so calling any function is as fast as a jmp (1 tick if shortjump, long dosent matter as it happens in LTO too)

                PGO is good yes, as good as your benchmark/test_case
                only downside is its an overhead while profiling so it cant be as accurate as sampling (like perf does)
                But PGO is not used for making profiling (as Sampling profiling does) but to give the hints you told a compiler cannot have: to see if is advantageous to revert an if condition to be cache friendly or to devirtualize (and putting a guard using Polymorphic inline caching) the method as it sees it beneficial. Of course a more accurate profiling requires more CPU. Do you have any solution you would propose to make PGO to be "sampling aware"?

                Originally posted by gens View Post
                numbers ?
                do you have any numbers ?
                Steve McIntyre wrote in that report that ~30% of packages with assembly have it for performance reasons

                high level design ?
                its as possible in asm as it is in any other language
                openmp can be called from assembly code almost as easily as from C
                synchronizing/scheduling threads should be even easier in asm
                anything can be done in asm, even OO code
                furthermore OO adds to a programs size
                I didn't say that is not possible to write OOP in Assembly (did I!?), what I said is simply: assembly as of today's maturity of compilers is a losing target for performance. In fact your numbers shows it: 60 packages (30% out of 200 listed) are in fact ridiculous small. My Ubuntu has like 40.000+ packages, but even we talk about 20.000 packages that may be targetting ARM64, we talk about 0.6% (or 0.3% for a full Ubuntu Main+Universe). And these packages are not written in the form of: void method() { asm {... } } and after that another method written in assembly. Just core parts that are performance sensitive, right?

                and back to numbers
                think i remember how i did 3x3*3x1 calculation
                you write cleanest, most compiler friendly code in C and compile it as you want (limit to sse2, as i dont have avx)
                il write a function that does the same in asm
                we will make a test loop to measure speed
                input will be... 10k(or 100?) of bout matrices, generated before calling
                LD_PRELOAD will be used to eliminate loading when called, and to choose test function
                or ?
                I will do this but in a later post (so stay tuned). Also, just to say, if your "matrix multiplication" point is to show that you can get (almost) always better with Assembly, please consider your target with your code. If you have 2D polygon to display it on screen, you can use glLoadMatrix and glMultiplyMatrix that use 0 (zero) CPU, in any video card of today's hardware. I would bet that if don't use VGA drivers, you would have this operation in hardware.

                also about GPGPU
                i agree, and always have, that if you got a lot of data to process in parallel its best to use a gpu
                but you cant be sure you will always have a gpu on that computer
                and sending data to the gpu and getting them back will cause delays, that is a problem when you need real-time results (like sound processing and stuff with networking)

                oh and that FFMPEG(that i never mentioned, x264 is hosted with the VLC project) bug has been resolved in 19 hours since it was reported
                Yes, it was solved and I think it helped that the person did say both the problem and the solution in the first part of the description, right? But I agree, assembly can be fixed, but it will be an extra burden (here for MMX implementation), but it could be for as many targets are faulty, when if there would be just a C/C++ target, there would be (in case of a bug) just one target to fix.
                and again to remind, threading in assembly should be quite easy
                although i cant testify on that since i dont care about it
                but i can testify that calling C programs is easy


                also about the report, that i read most of now;
                replacing RDTSC with gettimeofday ? really ?
                rewriting glibc in full C ?
                removing asm from the kernel ?
                removing asm from OGRE ?
                bdw, OGRE is good C++, mostly cuz its documented so good

                and about ARM
                arm has fewer instructions
                you might think that that fact will serve a human coding in asm more then a compiler, that is not the case
                the future is some kinda cpu/gpu hybrid, with lots of special instructions, cores and registers
                advantage a human has over a computer is that a human is rly good at recognizing patterns, while a computer has to have linear logic (test everything against everything, maybe make assumptions to shorten that if possible)
                so ye, C for ARM, ARM for (some) machinery
                and ARM pushing ARM stating "ARM has better potential performance/watt" has never been proven (at least not without obvious cheating)
                ARM has many reasons why it has a good (or better) performance/watt and they don't rely just of how good is a CPU on cranking numbers, which Intel seems to be for now the leader. If you talk about datacenter, there are power bills, ventilation, and other things like this. As most (company's file and network) servers stay idle, using ARM may be a better choice even Xeon supposedly can run circles around ARM. If you wait for your network SSD, you don't care so much if your CPU can compute things on GPU.

                At last, you made a straw-man, I never said to remove or to rewrite anything. The right of "rewriting Linux in C, C++, Java" whatever is not in my mind. I'm one of the Mono proponents in this forum, and I never wrote: rewrite all Gtk+ to Gtk# because Mono is great and "C# code can be written to be fast". I say that from: practical, maintainance, cost of development, time to market, etc. there is no point to use assembly for "most Linux software" - as is the title of the article-forum. I write most of my codes in C# (for years) and as for now I didn't face performance bottlenecks that I couldn't work around. In fact for one project I was working (search for NaroCAD, which is opensource btw), I did take a C component and made it OOP/C# with no (visible) performance implications. But even I was rewriting it into C# was for maintainance reasons, not performance. But performance for me is not an end, but for some projects it may be the means.
                Your questioning: why GLibC is not written only in C, in fact a big chunk of it is, look here for example, inside the math folder:



                i probably forgot something

                edit: yes, IDE
                i program FASM in Scite (has pretty colors) and xterm
                dont need anything complicated as there isnt anything complicated about it
                just run script that compiles, links(if needed) and runs
                if it fails "gdb ./it"

                PS copy/paste code works most of the time with asm and FASM has very powerful macros for anything (structs are very easy, for example)
                The IDE that you say you have is a joke in today's standards: does it have code navigation, as is typical for me to work with 30-40 projects in the same moment, and to go to the class I need in let's say in less that 2-3 seconds (is faster in the IDE I'm working with).

                Another part is that as much as is fancy to have macros, that C has, the IDEs offer much more, including custom rules of debugging, adjusting the build system, searching through dependencies and giving code completion. Does Fasm have any relevant code completion? (I did use Fasm like 10 years ago, so I'm not really up-to-date)

                About your challange
                Which is the use-case to create a 100.000 matrices for any vertex and to multiply them? Shouldn't be just the reverse? 100.000 vertices with one matrix 3x3? (3x3 matrices are for 2D transformations as far as I know, so you would want to change your polygons with a speciffic transformation).

                Comment


                • #88
                  Originally posted by ciplogic View Post
                  About your challange
                  Which is the use-case to create a 100.000 matrices for any vertex and to multiply them? Shouldn't be just the reverse? 100.000 vertices with one matrix 3x3? (3x3 matrices are for 2D transformations as far as I know, so you would want to change your polygons with a speciffic transformation).
                  A simple C (naive) code would finish this code in less than 3 ms (on my machine) trying to do anyhing more useful. So, my concern remains: where would you use this kind of coding?

                  This is the naive coding I'm talking about:

                  Code:
                  void compute()
                  {
                      int  len = 10000000;
                      auto  vertex = new Vertex3();
                      vertex[0] = 0;
                      vertex[1] = 100;
                      vertex[2] = 200;
                      auto result=new Vertex3();
                      for(auto i=0;i<len;i++)
                      {
                          auto matrix = matrices[i];
                  
                          result[0] = matrix[0][0]*vertex[0] + matrix[0][1]*vertex[1] +matrix[0][2]*vertex[2];
                          result[1] = matrix[1][0]*vertex[0] + matrix[1][1]*vertex[1] +matrix[1][2]*vertex[2];
                          result[2] = matrix[2][0]*vertex[0] + matrix[2][1]*vertex[1] +matrix[2][2]*vertex[2];
                  
                          vertex[0] = result[0];
                          vertex[1] = result[1];
                          vertex[2] = result[2];
                      }
                      globalResult[0] = vertex[0];
                      globalResult[1] = vertex[1];
                      globalResult[2] = vertex[2];
                  }
                  This code just picking Release in Visual Studio 2010 (for 10.000.000 items) is running in 250 ms on my machine ([email protected] Ghz, but just 1 core used)/Windows/target 32 bit. Anyway, I am still wondering which is the use-case you're referring to that a code like this would be useful.

                  Mono on a Linux machine for this very same code would give (for 1 core, Linux 64 bit this time) 273 ms (with default options) and 290 (with O=all and/or LLVM backend) on an [email protected] GHz/64 bit. I didn't test the C++ code yet as I did not set yet a GCC option and I feel a bit lazy Also, microbenchmarks are almost always silly. I though on an use-case of your multiple matrices multiplied, for example you want to have a game replay and you want to combine all sequences faster, but this *should* multiply matrices first and later the vertices to be multiplied once with the matrix (it uses the property that matrix multiplication is associative). This gives much less computation for vertices and for matrices. Also, if you have the same use-case (with matrices/vertices) you may save snapshots at every 200 ticks, so you can save computations to not save 10.000.000 computations for the sake of writing them in assembly!

                  So my concerns remain: aren't you doing wrong in the first place? Like if you want to display the points in OpenGL, shouldn't you use glLoadMatrix in the first place? to make the computation CPU free. If is for example a case when let's say you have a list of matrices that you apply on the top of the vertex (you say the use-case) even you not store them in CPU friendly way, it seems that is fairly fast for most of practical purposes, isn't it so?

                  Comment


                  • #89
                    So my code to optimize is:
                    Code:
                    using System;
                    using System.Collections.Generic;
                    
                    namespace SillyBenchmarks
                    {
                    	class MainClass
                    	{
                    		List<float[]> matrices = new List<float[]>();
                    		List<float[]> snapshots = new List<float[]>();
                    
                    		float [] globalResult;
                    		
                    
                    		void ComputeSnapshots(int step)
                    		{
                    			for(var i=0; i<matrices.Count;i+=step)
                    			{
                    				var snapshot = Identity();
                    				for(var j = i; j<Math.Min(matrices.Count, i+step); j++)
                    				{
                    					var result = Identity();
                    					Multiply(ref result, snapshot, matrices[j]);
                    					snapshot = result;
                    				}
                    				snapshots.Add(snapshot);
                    			}
                    		}
                    
                    		static void Multiply (ref float[] result, float[] src, float[]dest)
                    		{
                    			for(var i=0;i<3;i++)
                    				for(var j=0;j<3;j++)
                    			{
                    				var sum = 0.0f;
                    				for(var k=0;k<3;k++)
                    				{
                    					sum += src[i*3+k]*dest[k*3+j];
                    				}
                    				result[i*3+j]=sum;
                    			}
                    		}
                    		static float[] Identity()
                    		{
                    			var result = new float[9];
                    			result[0] = 1.0f;
                    			result[4] = 1.0f;
                    			result[8] = 1.0f;
                    			return result;
                    		}
                    
                    		static void MultiplyVertexWithMatrix (ref float[] result, ref float[] matrix, ref float[] vertex)
                    		{
                    			result [0] = matrix [0 * 3 + 0] * vertex [0] + matrix [0 * 3 + 1] * vertex [1] + matrix [0 * 3 + 2] * vertex [2];
                    			result [1] = matrix [1 * 3 + 0] * vertex [0] + matrix [1 * 3 + 1] * vertex [1] + matrix [1 * 3 + 2] * vertex [2];
                    			result [2] = matrix [2 * 3 + 0] * vertex [0] + matrix [2 * 3 + 1] * vertex [1] + matrix [2 * 3 + 2] * vertex [2];
                    		}
                    		void Compute(int step)
                    		{
                    			var vertex = new float[3];
                    		
                    			var result = new float[3];
                    			vertex[0] = 0;
                    		    vertex[1] = 100;
                    		    vertex[2] = 200;
                    			int len = matrices.Count;
                    			var pos =0;
                    		    for(var i=0;i<len;i+=step)
                    		    {
                    		        var matrix = snapshots[pos++];
                    
                    				MultiplyVertexWithMatrix (ref result, ref matrix, ref vertex);
                    
                    		        vertex[0] = result[0];
                    		        vertex[1] = result[1];
                    		        vertex[2] = result[2];
                    		    }
                    			globalResult = new float[3];
                    		    globalResult[0] = vertex[0];
                    		    globalResult[1] = vertex[1];
                    		    globalResult[2] = vertex[2];
                    
                    		}
                    
                    		public static void Main (string[] args)
                    		{
                    			
                    			var app = new MainClass();
                    			var len = 10000000;
                    			for(var i =0;i<len;i++)
                    				app.matrices.Add(Identity());
                    			var step = 200;
                    			app.ComputeSnapshots(step);
                    
                    			var start = Environment.TickCount;
                    			app.Compute(step);
                    			var end = Environment.TickCount;
                    			Console.WriteLine ("Time: {0}", end-start);
                    		}
                    	}
                    }
                    As creating of the matrices is done, I made as I've told you grouping of computations (I'm using the property that matrix multiplication is associative), and this is the code that multiplies in practice 10.000.000 matrices with a vector. The precomputation of snapshots can be long, but many games do this in "Loading" screen. After the snapshotting, even is not working incredibly fast, I see a 10x speedup with 1% more memory usage. In fact this this caching can be done as you save the replay, with no impact on performance (or minimal one).

                    So, right now I'm curios about my point: it seems that C# can be optimized in your use-case without using any SSE aware code, but using a smart enough algorithm. It runs what you see in like 15 ms, but when I'm using 200 items per snapshot, it would go under 10 ms.

                    How fast would you get using assembler?

                    To reproduce on your machine my figures, just: save the code file in a Main.cs run: gmcs Main.cs and run it with 'mono Main.exe'. I don't say to use any weird flag, like -O=unsafe or --llvm or to tune the GC.
                    Last edited by ciplogic; 09 April 2013, 04:42 AM.

                    Comment


                    • #90
                      Originally posted by frign View Post
                      I am itching to give my 2 cents here, as there is even more room for improvements, despite being quite minor compared to your initial proposal (Note: The CODE-Tag can be handy):

                      Code:
                      static [B](u)int_fastQ_t[/B] somechip_interp_flat(struct somechip_shader_ctx *ctx, int input)
                      {
                           [B](u)int_fastQ_t[/B] r; // I don't know the range of r, 
                                           // but it could be determined in the case 
                                           // and limited to 16 bits or even 8
                      
                           struct some_gpu_bytecode_alu alu;
                      
                           memset(&alu, 0, sizeof(struct some_gpu_bytecode_alu));
                      
                           alu.inst = SOME_ALU_INSTRUCTION_INTERP_LOAD_P0;
                      
                           alu.dst.sel = ctx->shader->input[input].gpr;
                           alu.dst.write = 1;
                      
                           alu.src[0].sel = SOME_ALU_SRC_PARAM_BASE + ctx->shader->input[input].lds_pos;
                      
                           for ([B]uint_fast8_t[/B] i = 0; i < 4; i++) {
                                alu.dst.chan = i;
                                alu.src[0].chan = i;
                      
                                if (i == 3)
                                     alu.last = 1;
                                r = some_alu_bytecode_add_alu(ctx->bc, &alu);
                                if (unlikely(r))
                                     break;
                           }
                           return r; 
                      }
                      Sadly, the order of the for-loop is determined. Looking at "int input", but not knowing its data-range, one might also be able to reduce its size accordingly.

                      More improvements would actually require to know more about the specific project (I would be happy if you could PM me the name of the project; I really would like to know (an hope it's not Intel)).

                      Did you commit your changes?

                      Best regards

                      FRIGN
                      Yes, commit those changes Sometimes things get written in a way (sloppy) and overlooked for years. Its always nice having stuff cleaned up.

                      However (u)int_fast* usage in the kernel, I'm not sure sure about. If it really is faster/better; it could be a mission to replace it all Unfortunately, right now, (3.7.10) only has ONE reference in the entire kernel and I don't know if it will even stay there, it's been there for a while I believe.

                      Code:
                      grep int_fast * -R
                      drivers/staging/tidspbridge/dynload/cload.c:    uint_fast32_t sum, temp;

                      Comment

                      Working...
                      X