Announcement

Collapse
No announcement yet.

Is Assembly Still Relevant To Most Linux Software?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by frign View Post
    Can you give me proof or a specific example? Judging from the years I worked at low-level C-programming, smaller integers are faster than the standard ones and the compiler just can't know the range of the very integers it has to work with. It is just impossible.

    The only compiler I know capable of this is a standardised Ada-Compiler like GNAT-GCC, given the condition you work out the ranges in your code properly, but not GCC. The language itself doesn't allow it!
    Maybe I misunderstood the article you posted (about modulo and date time), but the numbers are:
    uint32_t days, hours, minutes, seconds;
    ARM Cortex: 384 cycles

    and then
    uint8_t hours, minutes, seconds;
    ARM Cortex: 434 cycles

    finally
    uint_fast8_t hours, minutes, seconds;
    ARM Cortex: 384 cycles

    So that I assume that for the ARM Cortex, uint_fast8_t just map to uint32_t because it's the fastest container that holds at least 8 bits for this platform.

    About the compiler stuff, I just though the compiler could do the same thing (when you declare an int, it's free to use either 16, 32 or 64 bit for it depending on I-don't-know-what). It couldn't do it on any interface, because of binary compatibility, but it could do it for internal variables I guess. Because uint16_t is guaranteed to be 16 bits, but uint_fast16_t, short, and int are not. But those are only random thoughts.
    Last edited by erendorn; 09 April 2013, 03:39 PM.

    Comment


    • idk about ARM, but x86... depends

      you wanna make the job easy for the cpu so loading/saving is fastest with native sizes
      thats why its good to pack, what idk if a compiler will (i guess it will, best test)

      anyway, about calculations
      there are speed/latency/throughput tables here

      MOV r8, m8
      is 8bit load from memory
      (MOVZX is also, bit different)

      IDIV r16, r16
      is 16bit integer division

      etc.
      Last edited by gens; 09 April 2013, 04:00 PM.

      Comment


      • Take a look at the other test results

        Originally posted by erendorn View Post
        Maybe I misunderstood the article you posted (about modulo and date time), but the numbers are:
        uint32_t days, hours, minutes, seconds;
        ARM Cortex: 384 cycles

        and then
        uint8_t hours, minutes, seconds;
        ARM Cortex: 434 cycles

        finally
        uint_fast8_t hours, minutes, seconds;
        ARM Cortex: 384 cycles

        So that I assume that for the ARM Cortex, uint_fast8_t just map to uint32_t because it's the fastest container that holds at least 8 bits for this platform.

        About the compiler stuff, I just though the compiler could do the same thing (when you declare an int, it's free to use either 16, 32 or 64 bit for it depending on I-don't-know-what). It couldn't do it on any interface, because of binary compatibility, but it could do it for internal variables I guess. Because uint16_t is guaranteed to be 16 bits, but uint_fast16_t, short, and int are not. But those are only random thoughts.
        Yes, you spotted that right, but you only took in regard the values for the ARM processor.
        On the ARM architecture the modulo-operator is really fast when it comes to working with 32 bit integers. Thus, it should be regarded as a special case which was easily resolved using uint_fast8_t.
        What do we learn from it?
        Unless we are very strict to _really_ have an 8 bit integer, we should use uint_fast8_t or int_fast8_t respectively for unsigned and signed ones, because it automatically resolves even the quirks of specific architectures.
        That it doesn't make a difference in this special case is definitely noteworthy, but when you take into consideration the other cases, where the smaller integer-size did bring speed-benefits, I think it is still better than using 32 bit integers:

        uint32_t
        AVR: 18,720 cycles
        MSP430: 14,805 cycles
        ARM Cortex: 384 cycles

        uint8_t
        AVR: 14,400 cycles
        MSP430: 11,457 cycles
        ARM Cortex: 434 cycles

        uint_fast8_t
        AVR: 14,400 cycles
        MSP430: 11,595 cycles [the 100 cycles can be dismissed]
        ARM Cortex: 384 cycles

        Ultimately, it has been proven that reduced integer size _and_ letting the compiler choose bigger ones if more effective is the best way to go. The last result for the MSP430 is an exception and might be due to a compiler-regression, which can happen pretty often.

        Best regards

        FRIGN

        Comment


        • Originally posted by frign View Post
          That it doesn't make a difference in this special case is definitely noteworthy, but when you take into consideration the other cases, where the smaller integer-size did bring speed-benefits, I think it is still better than using 32 bit integers:
          FRIGN
          You're right, but I don't think it's an exception: it's completely rational that a 8 or 16 bit calculator would be suffering from computing on 32 bits values (and using long long on the ARM processor would probably add a bunch of instructions). On the other hand, I suspect most 32 bit or higher calculators don't much from using smaller size integers.
          Basically, if you target embedded, take care of sizes, else, probably not worth it (well, no harm done in using _fast versions).

          Comment


          • Too vague

            Originally posted by erendorn View Post
            You're right, but I don't think it's an exception: it's completely rational that a 8 or 16 bit calculator would be suffering from computing on 32 bits values (and using long long on the ARM processor would probably add a bunch of instructions). On the other hand, I suspect most 32 bit or higher calculators don't much from using smaller size integers.
            Basically, if you target embedded, take care of sizes, else, probably not worth it (well, no harm done in using _fast versions).
            I rather guess in this case this is one of the many quirks of the ARM-architecture you only find out studying the compiler- and architecture-manuals.
            Speculating around the size of integers and their respective performances in combination is too vague for my tastes, so I leave this to the compiler-engineers, who are real experts in this regard.

            Taking care of sizes is definitely worth it! Maybe not when you are already bloating your program with Glib and excessive use of C++-classes anyway, but in many other cases, when certain parts of a program are executed very often, taking care of the signedness and size of variables is _critical_!
            There has already been given the example of a driver-section, where inefficient use of variables cut off a lot of potential. In the end, Linux users are the ones suffering, because those minor issues can quickly add up.

            In this regard, thinking about this critical topic is essential for efficient software.

            Comment


            • Originally posted by gens View Post
              damn, now i have to write a function

              xmms shuffles are like sudoku so il write one tomorrow when me head clears up

              and ye, i was thinking of just the raw brute force multiplication as it is bit hard for a compiler since shuffling it to fit nicely needs a bit of planing
              like just a function compute(pointer, pointer, how_many)

              also interpreting the title "Is Assembly Still Relevant To Most Linux Software?" is, to be honest, not that easy
              if you count the 1% gained (guessing) from assembly in shared libraries, then it is relevant
              if you count things written directly in a program, then probably not that much (overall) except in the kernel and such low level things

              and again, assembly is not really to be used when not needed
              and its not as hard as everybody says
              I didn't ask you first to ask a function, but:
              - to give a valid case where you multiply a point with many matrices and not many points with one matrix
              - if is the former case, why not set a glLoadMatrix (or glMultiplyMatrix) which is computed with 0 CPU if is about to be displayed on screen
              - the single use-case I thought it would count, it would be cases when let's say you have a replay, and your objects are original object and a sequence of matrices. In this case, why not grouping the matrix multiplication with very little extra computation to make them (they can be done in real time) and mutliply the snapshots. To not say I'm such a smart guy, the idea comes from how Mercurial (hg) computes diffs and keeping a full file snapshot if the threshold worth the computation time. This is why in fact Python, an interpreted language/platform can be the core of one of the fastest code management systems (SCMs). This last case shows other thing: you can have huge computations and using even a not-so-optimizing compiler, you can get the performance you want, just use the algorithm suited best for your task. I redid the computations: the speedup was 20x not 10x, would be just 10x if the original solution would be as much optimized as possible (like either tuning the C++ compiler or use mono -O=unsafe that removes the bound checking). Can you tune assembly to give the speedup based on a variable that you set as a global variable?

              So my argument still remains: where you use this kind of coding that you cannot use even a not so optimal algorithm and your code to run properly? Also, your code will speedup all SSE2 code (which would be great), but either people with Athlon XP (with no SSE2 at least) or people with a fancy tablet would not benefit from any of your optimizations.

              "the 1% and is still relevant"
              Is less, as you noticed, and I've told you, it makes sense at places to write assembly. Also, please read the kernel source, is not only assembly. Is only C, some binary blogs (that are basically assembly) and some assembly for atomics and setting some ports. Without wanting to misqoute you, this was and is my view:
              "if you count things written directly in a program, then probably not that much (overall) except in the kernel and such low level things"

              Let's take one case when Assembly is listed in these projects and it doesn't make sense (for me): ioquake3. It uses QVM that is precompiled by a modified version of portable C. At last the VM parses the bytecode and makes an assembly out of it. Today we know more techniques and we have access to more technologies to do it faster: LLVM does just this - it optimizes on-the-fly in a portable way. LLVM has its' own disadvantages (mostly that is a big library and it has to be bundled). Big as in few megs, but it would not work on the phone for a small game. But most packages that use it are a hundred+ MB to download. I don't want to say that they *should* use LLVM (to not be myself misquoted later, as me giving advice to other and not writing the code), but certainly the LLVM could make the resulting scripting code visibly faster in the same time as it gives the independence of assembly.

              I did also said in the previous posts that many things are optimized in the graphic drivers (like glPushMatrix) and most users don't care how is this achieved, so if supposedly a video card doesn't support it in hardware, many things are today optimized through assembly. I see no point of removing it (at least as long as it gives more than few percent speedup and there is no maintenance burden) Today again it seems the tendency is to support these features using LLVM (which for me seems a good trend).

              Going back to the software that most people write, if this software doesn't have to create on the fly something very dynamic and very optimized (like would be a browser that has to give very fast JavaScript, so they rely on assembly, eventually) I hardly can see where could you use assembly in applications. Even in the cases where assembly is used, like let's say IonMonkey JS engine, they use it just at a very late stage and they don't use to speedup the compiler, but they use it to generate better backend assembly.

              So dear gens, say to me a sane use-case where would you use your case of multi-matrix multiply one a vertex and why it couldn't be optimized with a good algorithm (grouping computations is a solution, even that most likely, if it would be a replay and you can jump everywhere in the matrix list, probably every second it would be set a "key frame" so computations would be done from the keyframe up-to the subsection of the second, so very few computations to start with).

              Comment


              • Originally posted by frign View Post
                I don't know if I can take you serious any more. You still might not get the concept I think.
                What's with the aggression, i've posted 2 or 3 times on the last few pages. I think you may have me confused with some one else frign.
                Of course there are native 128 bit variables, why shouldn't there be any? There are also native 128 bit variables available on 32 bit processors! Gosh!
                Originally posted by frign View Post
                128 bit variables: They are even supported by the processor itself, so you don't need compiler extensions. You just need the to address the right register and store the variable in it; if your compiler doesn't already implement this behaviour properly and requires extensions, I would recommend you to switch to another which does!

                128 bit pointers: Don't make sense. Why?

                I hope you get my point: If your processor can only handle address-lengths of 64 bits size, 128 bit addresses (--> 128 bit pointers) don't factually bloody make sense.
                Lets please call them what they are, registers. There are native 128 bit registers on modern CPU's. (NOT ON ALL 32 BIT PROCESSORS!). AVR32 probably does not have anything wider either (I guess!!).
                So your CPU may or may not have 128 bit REGISTERS. This is not a problem however for the developer. The compiler (lets only talk about gcc please) will translate your uint128 to whatever IS available on your CPU.
                Let's take the normal 32 bit x86-architecture with a single simple core, because it is simple to explain:
                Simpler, yes, but that doesn't count for 'all' 32 bit architectures! But lets go with this
                There are specific registers to store specific integers of specified size. I'll list them for your convenience:
                1. 8 bit --> Registers AL, AH, ...
                2. 16 bit --> Registers AX, ...
                3. 32 bit --> Registers EAX, ...
                4. 64 bit --> Registers MM0, ...
                5. 128 bit --> Registers XMM0, ...


                As you can see, both big and small integers are natively integrated into the CPU, even though we can only have 32 bit addresses.
                Yes, you are speaking register sizes. And addressing. Still not what this is about, and this is where you make your big thinking mistake in two ways.
                First of all ignoring the more important part for now, the ALU; yes I totally agree with you that working with 8 bits on this example CPU can be faster for certain tasks (again we are ignoring the ALU). It can be faster to load the register (it could pack 4 and load 4 registers at once for example), just like a 128 bit register will actually be slower to load unload. And yes, the compiler can STORE a 128 bit value natively in a register without the CPU having to do nasty 32 bit to 128 bit conversion math. So a small speedup there.

                But here it comes, the point you overlook. You probably guessed it by now, the ALU. Generally speaking, we have 1 ALU per CPU and that ALU has a certain bit width to perform its calculations with. I think it is save to say that an 8 bit AVR, has an 8 bit ALU; whereas a 32 bit x86 has a 32 bit ALU. Lets ignore SSE etc extensions please, as I'm not up to speed how that works in hardware, but I think it is very safe to assume, the ALU is not involved, which causes the speed up. I guess we could consider it an additional mini-alu that only knows very specific small tasks that operates on those 128 bit registers? But please lets not focus on the extensions.

                Now this 32 bit ALU is what we'd call its native bit width, and thus all calculations are performed in 32 bits. Lets say you you want to add AL and AH and store it in AX, the CPU has to read AH, CONVERT it to 32 bit (which maybe is even free of charge in cycle cost, but that depends on your architecture), then reads AL converting it to 32 bit (yes yes yes it has to convert it, the ALU can only do math in its native format) and then do the addition. After that is done, it has to convert the 32 bit result back into a 16 bit register to place into AX, this conversion is probably a little more expensive (overflow flag etc comes to mind).

                Now if the compiler puts an 8 bit char, into EAX0 and the resulting 8 bit int into EAX1, the ALU can read those registers without conversion, do the calculation and store it again right away shaving off some cycles and THAT is where your speed up comes from, when working with the Processors NATIVEbit width. NOTregister sizes.

                Using large integers eats up more RAM, granted, but it is not faster to use them in any way. I put it this way: Storing each integer as a 64 bit integer on 64 bit systems doesn't bring you benefits and there are specific registers for all sizes!
                It brings benefits to use smaller integers, because they are native to the CPU!
                And that myth is now debunked hopefully. So while yes, it will save on memory transmission and storage capacity, it can be slower (and is) due to the processor having to do magic to actually use those registers. And yes, you can micro-optimize your application for certain calculations to make better use of those registers and thus get an overal speed up, despite the cost (e.g. using all 8 registers to do everything on CPU (max 8 registers in this example), rather then loading everything into the 4 32 bit registers and having to read/write to memory to obtain the new values. (If this specific point doesn't make sense I'll gladly type out an example).
                Even better, 8 bit AVR and 16 bit MSP-430 doesn't mean we peak at 8 bits. I honestly have never worked with these processors, but I am sure they do support 8 bit integers. Scaling those up to greater lengths depends on the architecture, but you are not forcibly locked to the specified maximum address space.
                I have worked with 8 bit AVR's but immediately admit not sure on the addressing capacity, but I belive it was 16 bit, wikipedia knows more.
                The ALU in this 8 bitter is as explained above 8 bit. So while this CPU can even work with 64 bit int's (and probably could even with 128 bit int's), it will be at a cost. So yes, it does not mean we 'peak?' at 8 bits, the ALU can do 64 bit math, but only in 8 bit chunks. The compiler could never 'optimize' this as the smallest size IS the native width in this specific case.

                Originally posted by frign View Post
                Yes, you spotted that right, but you only took in regard the values for the ARM processor.
                That it doesn't make a difference in this special case is definitely noteworthy, but when you take into consideration the other cases, where the smaller integer-size did bring speed-benefits, I think it is still better than using 32 bit integers:
                <snip>
                Ultimately, it has been proven that reduced integer size _and_ letting the compiler choose bigger ones if more effective is the best way to go. The last result for the MSP430 is an exception and might be due to a compiler-regression, which can happen pretty often.
                Now your are almost speaking sensible. The ARM isn't more efficient due to the module operator being hardware (yes that too, but lets forget that) What happened there is, the ARM is a NATIVE 32 bit CPU. It operates best on 32 bit registers. If you FORCE to use an 8 bit register (by using uint8_t you do, it is 'guaranteed') it will have to do those conversions talked about earlier. If you use uint8_fast_t, the compiler will do the smart thing and use a 32 bit register and thus the speed improvement. This is the reason why uint32_t is equally fast.

                Same logic holds for uint8, uint8_fast vs uint32 for the AVR. 32 bits is NOT efficient for the 8 bit ALU, wheras 8 bit's are.

                Also, using the C99 integer-types will bring you the flexibility you expected: there are ways to implement integers of _at least_ n bit size (int_leastn_t), of fastest least n bit size (int_fastn_t) or as of a standard n bit size (int8_t). stdint.h already handles that for you and you don't even need special compilers to optimize it in this regards.
                Now we're getting somewhere. I still think we don't need a special type, the compiler should be able to figure this out on its own.

                I as a developer KNOW my variable will fit in an 8 bit value. I thus rightfully declare it as u8 myvar; The compiler should have the freedom to use whatever fits best on the CPU, either scale it up to 32bit or keep it as 8 bit if it means more registers can be used simultaneously (see above).

                Also, stdint.h isn't available in the kernel, so I say, let the compiler worry about that. It knows best what registers need to be loaded with what. A developer will never know how the registers are setup at a certain point in his application. If it IS needed, for performance reasons, you can do some inline asm (btw I am all for asm for optimization)
                In this regards, storing certain datatypes in those native registers is not a big hurdle! Need to store an unsigned integer of at least 128 bit size? No problem, just use uint_least128_t and you are fine.
                Originally posted by frign View Post
                Can you give me proof or a specific example? Judging from the years I worked at low-level C-programming, smaller integers are faster than the standard ones and the compiler just can't know the range of the very integers it has to work with. It is just impossible.
                The only compiler I know capable of this is a standardised Ada-Compiler like GNAT-GCC, given the condition you work out the ranges in your code properly, but not GCC. The language itself doesn't allow it!
                Yes, you are fine, the compiler worries about the innards. The compiler knows (if you use --march etc) what sizes the registers are. It know there is no 128bit register on this CPU and it knows it will have to do some voodoomagic to split up your 128 bit var into 2x 64 bit registers (or whatever IS available on your architecture). Yes the compiler does NOT know how much of a certain variable is known, as you say, it can't (well not always and probably isn't allowed by the standard to know). You can HELP the compiler a little however. YOU know it will fit into 8 bits, so YOU declare it as uint8_t. The compiler still does not know if it is 1 bit, 3 bits or 5 bits. But it does know it is 8 bits and thus can use 16 bits, 32 bits or 64 bits as it sees fit.
                You are pointing out issues which do not exist! Integer sizes do not magically scale up or down depending on which architecture you are; it depends on the specific implementation of the architecture you are using.
                *G* I always simply stated, the compiler SHOULD scale up where needed.

                I hope this was clear enough already!
                I hope it did indeed clear up things to you too

                Sorry for the extremely large post, and if I'm wrong on certain area's do feel free to point them out (with proof) but don't nitpick on minor details that are not related or do not matter.
                Last edited by oliver; 10 April 2013, 05:10 AM. Reason: fix url tags

                Comment


                • Originally posted by frign View Post
                  I rather guess in this case this is one of the many quirks of the ARM-architecture you only find out studying the compiler- and architecture-manuals.
                  Speculating around the size of integers and their respective performances in combination is too vague for my tastes, so I leave this to the compiler-engineers, who are real experts in this regard.

                  Taking care of sizes is definitely worth it! Maybe not when you are already bloating your program with Glib and excessive use of C++-classes anyway, but in many other cases, when certain parts of a program are executed very often, taking care of the signedness and size of variables is _critical_!
                  There has already been given the example of a driver-section, where inefficient use of variables cut off a lot of potential. In the end, Linux users are the ones suffering, because those minor issues can quickly add up.

                  In this regard, thinking about this critical topic is essential for efficient software.
                  And this doesn't happen in proprietary software? Now your really just talking horseshit. Really bad horseshit.

                  I have worked with proprietary (embedded) software. I have seen leaked (later GPLed) source and still work with it. There is so much crap produced because of 'fast fast deadlines nobody sees anyway' mentality it would make you want to stab your eyes out.

                  Yes. There is still lot of bad code EVEN in open source code out there. But you know what? it is waiting to be found, ready to be optimized. This specific radeon example was a hackjob to get something working fast. It was probably overlooked since (well someone found it now). It just needs someone to step up and submit a patch. It is not impossible to fix, nobody _needs_ to suffer. The radeon driver (while production ready I guess) is still under heavy development (or slightly abandoned). There simply aren't enough manhours to spend on these performance optimizations. Once the radeon (and others) are feature complete and are stable, I'm sure more interest will be put into performance optimizations, which with profiling should be spotted. It's only a matter of time and this was a piss poor example just to spread crap.

                  Comment


                  • Originally posted by oliver View Post
                    And this doesn't happen in proprietary software? Now your really just talking horseshit. Really bad horseshit.

                    I have worked with proprietary (embedded) software. I have seen leaked (later GPLed) source and still work with it. There is so much crap produced because of 'fast fast deadlines nobody sees anyway' mentality it would make you want to stab your eyes out.

                    Yes. There is still lot of bad code EVEN in open source code out there. But you know what? it is waiting to be found, ready to be optimized. This specific radeon example was a hackjob to get something working fast. It was probably overlooked since (well someone found it now). It just needs someone to step up and submit a patch. It is not impossible to fix, nobody _needs_ to suffer. The radeon driver (while production ready I guess) is still under heavy development (or slightly abandoned). There simply aren't enough manhours to spend on these performance optimizations. Once the radeon (and others) are feature complete and are stable, I'm sure more interest will be put into performance optimizations, which with profiling should be spotted. It's only a matter of time and this was a piss poor example just to spread crap.
                    This is why a lot of script kiddies that read an assembly assembly tutorial and when they see some "3x faster" because is assembly they are so happy to say: look, all the code can be made 3x times faster. With no connection either with reality or with their willingness to fix them.

                    Also they are also the guys that most likely don't contribute with an assembly ready implementation to Radeon, or Cairo, or GLibC or whatever library they think it would benefit. And as development time is limited, people like them will be always frustrated for not using the proper sized int, and that the CPU has to big latency, instead of contributing to the projects. I'm really curious of any guys here (that at least are assembly fans) where they have shown their "expertise".

                    I was working with both OSS and proprietary software, and of course the quality of the code is driven by the people and the policies of writing code. Being FOSS doesn't mean better written. Sure FOSS has different advantages (including the four freedoms of GPL, or the ability to fix after the fact), but to blame a code that is hardly written by someone, I think is a bad practice if you don't contribute back. I do remember to say: C++ is bad (in some posts) but not because of implementations and also if my values did not include C++, it also made me to commit/support other platforms I think they have a future. (I don't want to make any self-promotion).

                    So in short, which of you that you consider yourself "an assembly friendly guy" did contribute to GLibC, Gnu As, Fasm or whatever. May you say where did you contribute (in brief)? I mean at least we would all benefit from your work in a form or the other, right?

                    Comment


                    • Clearing it up

                      Originally posted by oliver View Post
                      What's with the aggression, i've posted 2 or 3 times on the last few pages. I think you may have me confused with some one else frign.

                      Lets please call them what they are, registers. There are native 128 bit registers on modern CPU's. (NOT ON ALL 32 BIT PROCESSORS!). AVR32 probably does not have anything wider either (I guess!!).
                      So your CPU may or may not have 128 bit REGISTERS. This is not a problem however for the developer. The compiler (lets only talk about gcc please) will translate your uint128 to whatever IS available on your CPU.

                      Yes, you are speaking register sizes. And addressing. Still not what this is about, and this is where you make your big thinking mistake in two ways.
                      First of all ignoring the more important part for now, the ALU; yes I totally agree with you that working with 8 bits on this example CPU can be faster for certain tasks (again we are ignoring the ALU). It can be faster to load the register (it could pack 4 and load 4 registers at once for example), just like a 128 bit register will actually be slower to load unload. And yes, the compiler can STORE a 128 bit value natively in a register without the CPU having to do nasty 32 bit to 128 bit conversion math. So a small speedup there.

                      But here it comes, the point you overlook. You probably guessed it by now, the ALU. Generally speaking, we have 1 ALU per CPU and that ALU has a certain bit width to perform its calculations with. I think it is save to say that an 8 bit AVR, has an 8 bit ALU; whereas a 32 bit x86 has a 32 bit ALU. Lets ignore SSE etc extensions please, as I'm not up to speed how that works in hardware, but I think it is very safe to assume, the ALU is not involved, which causes the speed up. I guess we could consider it an additional mini-alu that only knows very specific small tasks that operates on those 128 bit registers? But please lets not focus on the extensions.

                      Now this 32 bit ALU is what we'd call its native bit width, and thus all calculations are performed in 32 bits. Lets say you you want to add AL and AH and store it in AX, the CPU has to read AH, CONVERT it to 32 bit (which maybe is even free of charge in cycle cost, but that depends on your architecture), then reads AL converting it to 32 bit (yes yes yes it has to convert it, the ALU can only do math in its native format) and then do the addition. After that is done, it has to convert the 32 bit result back into a 16 bit register to place into AX, this conversion is probably a little more expensive (overflow flag etc comes to mind).

                      Now if the compiler puts an 8 bit char, into EAX0 and the resulting 8 bit int into EAX1, the ALU can read those registers without conversion, do the calculation and store it again right away shaving off some cycles and THAT is where your speed up comes from, when working with the Processors NATIVEbit width. NOTregister sizes.


                      And that myth is now debunked hopefully. So while yes, it will save on memory transmission and storage capacity, it can be slower (and is) due to the processor having to do magic to actually use those registers. And yes, you can micro-optimize your application for certain calculations to make better use of those registers and thus get an overal speed up, despite the cost (e.g. using all 8 registers to do everything on CPU (max 8 registers in this example), rather then loading everything into the 4 32 bit registers and having to read/write to memory to obtain the new values. (If this specific point doesn't make sense I'll gladly type out an example).

                      I have worked with 8 bit AVR's but immediately admit not sure on the addressing capacity, but I belive it was 16 bit, wikipedia knows more.
                      The ALU in this 8 bitter is as explained above 8 bit. So while this CPU can even work with 64 bit int's (and probably could even with 128 bit int's), it will be at a cost. So yes, it does not mean we 'peak?' at 8 bits, the ALU can do 64 bit math, but only in 8 bit chunks. The compiler could never 'optimize' this as the smallest size IS the native width in this specific case.


                      Now your are almost speaking sensible. The ARM isn't more efficient due to the module operator being hardware (yes that too, but lets forget that) What happened there is, the ARM is a NATIVE 32 bit CPU. It operates best on 32 bit registers. If you FORCE to use an 8 bit register (by using uint8_t you do, it is 'guaranteed') it will have to do those conversions talked about earlier. If you use uint8_fast_t, the compiler will do the smart thing and use a 32 bit register and thus the speed improvement. This is the reason why uint32_t is equally fast.

                      Same logic holds for uint8, uint8_fast vs uint32 for the AVR. 32 bits is NOT efficient for the 8 bit ALU, wheras 8 bit's are.


                      Now we're getting somewhere. I still think we don't need a special type, the compiler should be able to figure this out on its own.

                      I as a developer KNOW my variable will fit in an 8 bit value. I thus rightfully declare it as u8 myvar; The compiler should have the freedom to use whatever fits best on the CPU, either scale it up to 32bit or keep it as 8 bit if it means more registers can be used simultaneously (see above).

                      Also, stdint.h isn't available in the kernel, so I say, let the compiler worry about that. It knows best what registers need to be loaded with what. A developer will never know how the registers are setup at a certain point in his application. If it IS needed, for performance reasons, you can do some inline asm (btw I am all for asm for optimization)

                      Yes, you are fine, the compiler worries about the innards. The compiler knows (if you use --march etc) what sizes the registers are. It know there is no 128bit register on this CPU and it knows it will have to do some voodoomagic to split up your 128 bit var into 2x 64 bit registers (or whatever IS available on your architecture). Yes the compiler does NOT know how much of a certain variable is known, as you say, it can't (well not always and probably isn't allowed by the standard to know). You can HELP the compiler a little however. YOU know it will fit into 8 bits, so YOU declare it as uint8_t. The compiler still does not know if it is 1 bit, 3 bits or 5 bits. But it does know it is 8 bits and thus can use 16 bits, 32 bits or 64 bits as it sees fit.

                      *G* I always simply stated, the compiler SHOULD scale up where needed.

                      I hope it did indeed clear up things to you too

                      Sorry for the extremely large post, and if I'm wrong on certain area's do feel free to point them out (with proof) but don't nitpick on minor details that are not related or do not matter.
                      Thanks for marking your article up with bold text! People really should consider doing that more regularily.

                      Okay first point: 128 bit variables and 128 pointers.
                      I was formulating it around which datatypes are stored in the specific registers on the CPU. You might also call it datatypes stored in specialised registers, like data-registers and address-registers. But because there are so many different other types of registers, I wanted to make it clear what I meant, which would've been partly omitted if I had just used the word "register".
                      Still, addresses of 128 bit length aren't required on systems with a native address-size of 64bit.

                      Looking at specific 128 bit data-registers for instance, it definitely is dependent on the architecture if you have respective data-registers available to store them. If not, an int_least128_t would still be possible in the regard, that big datatypes of n bit size can be implemented by software, which is slower, though.
                      Nevertheless, common architectures like x86 (32 bit) natively support 128 bit data-registers. I did not say it applied to _all_ 32 bit processors.

                      I didn't focus on the aspect of ALU's, because this case seemed trivial to me. It may be different for the AVR, but we do have n-bit ALU's on almost all modern architectures, which means, that there is no to very little overhead in the ALU handling non-address-size-native variables.
                      Granted, back in the old days this was critical to know, because the bottleneck was definitely the processor.
                      The point you overlook is the fact that the processor has a cache to manage! Today, in the days of symmetric processing and branch prediction, the memory is the new bottleneck. And if insist on using those huge datatypes equal to the native address-size of the processor you are using, you shouldn't wonder how many cache-misses you'll get.
                      You are talking about "magic" tasks the CPU has to perform to pass variables off to the ALU, which in itself is not as inflexible as you claim it to be, but you actually support my position:
                      Small datatypes are faster to process, whereas handling big datatypes might be effective in regards to being native to the address-size, but in reality are too large to handle.

                      So, where are we heading to? If we not look at software-implemented big-datatypes (because there is an individual approach to that for literally each architecture), the n-bit ALU's have no problem handling smaller integers, because the smaller registers are a subset of the bigger ones, which means, that when the CPU has to use an 8 bit register, it limits itself to those last 8 bits of the 64 bit register.
                      Using larger integers increases memory overhead on the other hand, and as we know, memory is ~200x slower than an instruction-cycle on the CPU.

                      Still, I should note here that there is the possibility that memory overhead can be lower than the overhead of CPU->ALU-conversions, but I can't believe it. Show this to me and you are perfectly right about your claims, with me being wrong on my part.


                      Talking ARM, the same case applies: uint_fast8_t basically means, that you need an integer-variable which is at least 8 bits wide and fastest given all the available choices. I am not an ARM-expert, but your explanation is completely valid. Still, this added-up efficiency is based on the SIMD-extension, and as we decided upon not talking about CPU-extensions, this aspect is not part of the relation between CPU and ALU indifferently.


                      So, I need to come to a conclusion here, because it is definitely a complex and interesting topic, and I don't want this post to become tl;dr.

                      You could have saved yourself the many words and only given your final statement:
                      the compiler SHOULD scale up where needed.
                      Because I definitely agree on that!
                      What you must understand is the fact that the typesets in stdint.h are just recommendations for certain architectures. If you have the time you can go check it out.
                      And of course, they are not in any way forced instructions to actually handle them at the specified least size on the low level. Those typesets exist to give the compiler the freedom to use smaller data-registers for variables which have a possible speed-advantage.
                      If the processor is better at handling at larger integers, then the compiler and the editors of stdint.h know that, because they've studied the compiler-manuals, and respectively add the freedom for example for smaller datatypes to still be stored in larger data-registers, even according to the ASM-example you gave earlier.

                      I don't say in any way you are wrong on this area. You proof to be an expert for microcontrollers. What you must realize is that today's CPU's sadly shifted into being very efficient in itself, but very inefficient when it comes to memory-IO. Keeping your variables small reduces it, unless you can show me that small datatypes are factually stored in memory at the native address-length or more.

                      As always, if you see a misconception on my side, please show me accordingly,
                      but don't nitpick on minor details that are not related or do not matter.

                      Comment

                      Working...
                      X