Announcement

Collapse
No announcement yet.

Our Last Time Benchmarking Ubuntu 32-bit vs. 64-bit

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #71
    Originally posted by jacob View Post

    It doesn't work that way. GPRs are used to store a program's live variables, which are always few and don't increase in number as available RAM grows. It is true that all things being equal, more GPRs = better but in practice it's more complicated than that. During a context switch, the CPU must dump its entire internal state to RAM and reload another state from RAM. Just going from eight 32-bit GPRs to sixteen 64-bit GPRs means that the volume of transferred data is multiplied by 4, which means more bus cycles, more latency, more cache pressure etc.

    Then there is code density. Given a common, simple two-operand instruction such as add r8, r9, with 16 GPRs each register is coded on 4 bits, which means that the two operands are coded using exactly 1 byte. With 32 GPRs you would need 10 bits, which means 2 bytes with 6 "wasted" bits. This doesn't matter much on RISC machines where code is sparse and there is lots of "waste" no matter what, but on a CISC it would undermine one of the main advantages of variable instruction length. On the other hand, RISC always needs more registers to do the same thing and broadly speaking, 32 GPRs on a RISC = 16 GPRs on a CISC.

    Also, the smaller the number of logical GPRs compared to physical, the easier it is for the instruction decoder to maintain a mapping without running out of available physical registers. By being able to feed the pipelines without having to wait for registers to become available, it can take full advantage of SMT and speculative execution, both of which are great performance features.

    Basically when designing a new ISA, you have to try to find the best possible compromise between all those pros and cons and 16 logical GPRs is a good number in that regard for a CISC processor.
    So the point of your first paragraph really is an architectural detail, that I believe all the major processor manufacturers, Intel and AMD included, are fully capable of designing around. And the point of your second paragraph really is a compiler detail, that I believe all major OSS compilers, GCC and LLVM included, are fully capable of optimizing away.

    But, the point of your third paragraph is rock solid and I totally agree with it. The front end on such a monstrosity would be horrendous. It would probably have to be a native RISC interface, And I guess that right there is the biggest and most obvious reason why 16 GPRs is where we're at.

    EDIT: As far as modern x86 CPU's go, I think it's well past due for a new round of "simplification".
    Last edited by duby229; 02 October 2017, 07:20 PM.

    Comment


    • #72
      Originally posted by duby229 View Post

      So the point of your first paragraph really is an architectural detail, that I believe all the major processor manufacturers, Intel and AMD included, are fully capable of designing around.
      Sure they can. You can always transfer more data faster by using more silicon, bigger caches, greater bus bandwidth etc. In other words, by having much more expensive CPUs and much more expensive motherboards. Would it be worth it? 8 GPRs was clearly not enough and going to 16 makes a helluva difference. But the result of going from 16 to 32 would be much less clear and probably not worth the cost.

      Originally posted by duby229 View Post
      And the point of your second paragraph really is a compiler detail, that I believe all major OSS compilers, GCC and LLVM included, are fully capable of optimizing away.
      Ummm no. To select one register out of 16 you need 4 bits, to select one out of 32 you need 5 bits and no compiler in the world is going to change that. The point is that everything else being equal, 16 GPRs have the advantage that for binary register-register ops, the two operands nicely fit in exactly one byte, without wasting unused bits, while with 32 GPRs you would neet two bytes, with six bits left unused. That encoding scheme is hardwired into the CPU, compilers have absolutely no discretion there.

      Originally posted by duby229 View Post
      But, the point of your third paragraph is rock solid and I totally agree with it. The front end on such a monstrosity would be horrendous. It would probably have to be a native RISC interface, And I guess that right there is the biggest and most obvious reason why 16 GPRs is where we're at.

      EDIT: As far as modern x86 CPU's go, I think it's well past due for a new round of "simplification".
      I'm not sure what kind of "front end" you are referring to. All modern CPUs have a front-end that decodes instructions (from one or several software threads), translates them into native operations, maps the logical GPRs used by the code on the physical registers available and then puts the translated code in a queue from where it is fed into execution slots. Having a "native RISC interface" would not change anything to this principle and it is a Good Thing that it is this way. There was one attempt to "simplify" things by getting rid of that front-end and compiling directly into the core's "native" code. It was called Itanium and wasn't exactly a resounding success.

      Comment


      • #73
        Originally posted by jacob View Post

        Sure they can. You can always transfer more data faster by using more silicon, bigger caches, greater bus bandwidth etc. In other words, by having much more expensive CPUs and much more expensive motherboards. Would it be worth it? 8 GPRs was clearly not enough and going to 16 makes a helluva difference. But the result of going from 16 to 32 would be much less clear and probably not worth the cost.

        Ummm no. To select one register out of 16 you need 4 bits, to select one out of 32 you need 5 bits and no compiler in the world is going to change that. The point is that everything else being equal, 16 GPRs have the advantage that for binary register-register ops, the two operands nicely fit in exactly one byte, without wasting unused bits, while with 32 GPRs you would neet two bytes, with six bits left unused. That encoding scheme is hardwired into the CPU, compilers have absolutely no discretion there.

        I'm not sure what kind of "front end" you are referring to. All modern CPUs have a front-end that decodes instructions (from one or several software threads), translates them into native operations, maps the logical GPRs used by the code on the physical registers available and then puts the translated code in a queue from where it is fed into execution slots. Having a "native RISC interface" would not change anything to this principle and it is a Good Thing that it is this way. There was one attempt to "simplify" things by getting rid of that front-end and compiling directly into the core's "native" code. It was called Itanium and wasn't exactly a resounding success.
        Oh come on now. You're just nitpicking what I said. Trust me, if I think 16 GPRs isn't even close to enough, you can bet I think 32 isn't either. Like I said it's an architectural detail (that the compiler can optimize for). And no duh, when I say front end its the same way you say front. How you describe a front end is exactly how I describe a front end, So what was the point in making that point?

        EDIT: And about Itanium, it was an EPIC architecture, that very much resembled VLIW.... Everybody could have told you there was no way it could perform well as a general purpose processor. Hell even I knew it was doomed. It was VLIW for crying out loud. It was a "no shit" moment.
        Last edited by duby229; 02 October 2017, 07:56 PM.

        Comment


        • #74
          Originally posted by sdack View Post
          You're basically warming up an old argument: RISC vs. CISC.
          Huh? I don't see how anything I wrote that you quoted has anything to do with "RISC vs. CISC".

          Comment


          • #75
            Originally posted by jacob View Post
            It doesn't work that way. GPRs are used to store a program's live variables, which are always few and don't increase in number as available RAM grows. It is true that all things being equal, more GPRs = better but in practice it's more complicated than that.
            This.

            With 32 GPRs you would need 10 bits, which means 2 bytes with 6 "wasted" bits.
            No. At least on a typical RISC ISA with 32-bit instructions, the instructions are 32-bit aligned, yes (one reason why RISC decoders can be simpler than CISC), but the stuff inside the instruction including the register addresses are tightly packed, no need for them to be byte aligned.

            Basically when designing a new ISA, you have to try to find the best possible compromise between all those pros and cons and 16 logical GPRs is a good number in that regard for a CISC processor.
            Yes.

            Comment


            • #76
              Originally posted by duby229 View Post
              EDIT: As far as modern x86 CPU's go, I think it's well past due for a new round of "simplification".
              Isn't this more of ARM's target market?

              Comment


              • #77
                Originally posted by torsionbar28 View Post
                Isn't this more of ARM's target market?
                Not at all, Arm's target market is low power mobile devices. More like PowerPC really, except that ISA has a monstrous front end too.

                Comment


                • #78
                  Originally posted by jabl View Post
                  No. At least on a typical RISC ISA with 32-bit instructions, the instructions are 32-bit aligned, yes (one reason why RISC decoders can be simpler than CISC), but the stuff inside the instruction including the register addresses are tightly packed, no need for them to be byte aligned.
                  It doesn't matter if it's byte-aligned. On a RISC all instructions will be say 32 bits, no matter how many of these bits are used to select registers. Simplifying slightly, on a CISC with 16 GPRs, an op like MOV Rx, Ry can be coded on two bytes only: one byte for the opcode + flags and one byte for the 2 registers. If you have 32 GPRs you need 10 bits for the 2 operands, so your instruction would suddenly become 3 bytes (1 byte for opcode + flags, 1 byte for the 1st operand + 3 bits of the 2nd operand and 1 byte for the remaining 2 bits, plus 6 unused bits). The code would be less dense, that was the point I was trying to make.

                  Comment


                  • #79
                    Originally posted by duby229 View Post
                    Oh come on now. You're just nitpicking what I said. Trust me, if I think 16 GPRs isn't even close to enough, you can bet I think 32 isn't either. Like I said it's an architectural detail (that the compiler can optimize for). And no duh, when I say front end its the same way you say front. How you describe a front end is exactly how I describe a front end, So what was the point in making that point?
                    Not meaning to offend but based on your writing so far I don't "trust you" when it comes to discussing CPU design. First of all, do you have any numbers to back up your claims that 16 or 32 register is "not even close to enough"? What would be "enough" then, and why? How many programs do you know that routinely have more than 16 or 32 live variables simultaneously?

                    Secondly the number of GPRs is far from being an architectural detail as you seem to think. To the contrary it's the central part of the ISA definition that sets in stone the various tradeoffs the CPU settles on. It must make sense with regards to other aspects of the ISA (in particular, broadly speaking, less registers = more addressing modes needed). The CPU's internal dispatch algorithms, the compilers' register allocation algorithm and, most importantly, the platform's ABI specification are all directly impacted by it. There is a reason why CPU manufacturers don't add 32 or 64 new GPRs for each new version of their chips and no, it's not because they are too stupid.

                    Originally posted by duby229 View Post
                    EDIT: And about Itanium, it was an EPIC architecture, that very much resembled VLIW.... Everybody could have told you there was no way it could perform well as a general purpose processor. Hell even I knew it was doomed. It was VLIW for crying out loud. It was a "no shit" moment.

                    The Itanium instruction set was very similar to what Core iX and Xeon cores use internally. The whole concept was to get rid of dynamic translation, dispatch and register renaming and let the compiler do it all. Of course I agree wholeheartedly that it was a monumentally dumb idea and can't imagine how someone could become convinced that it could possibly work. But the point remains that your constant insistence that the compiler could somehow "optimize for" an arbitrarily chosen number of GPRs (whatever that means in your mind) is bollocks because the number of GPRs has a direct impact on cost, performance, ISA and ABI definition, neither of which has anything to do with a given compiler's optimisation capabilities.

                    Comment


                    • #80
                      Originally posted by jabl View Post
                      Huh? I don't see how anything I wrote that you quoted has anything to do with "RISC vs. CISC".
                      A part of the argument for RISC was the introduction of more registers.

                      Comment

                      Working...
                      X