Announcement

Collapse
No announcement yet.

ARM Launches "Facts" Campaign Against RISC-V

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by Weasel View Post
    @oiaohm Why include P4 in your example? I already admitted it was abysmally bad, and that's one of the reasons, because it tried to be a RISC, didn't work out too well did it?
    It was the 486 that went RISC so before the P4. Really You keep on saying P4 was when intel tried to be RISC this is wrong.

    Originally posted by Weasel View Post
    Obviously, it will have to go to two execution units, that's normal physics, at least for memory operands (but not only).
    386 and before had a memory register architecture design not load/store you don't have two execution units.
    https://upload.wikimedia.org/wikiped...386DX_arch.png
    Take a close look at the diagram of a i386.

    Obviously it will have to go to two execution units is wrong. i386 all instructions are converted to single μops the i386 is the last CISC processor intel made that did that and was the last one that was pure memory register architecture. Since then you have had a load/store architecture. Instructions turning into multi operations is purely taken from risc half instruction. It when intel goes from i386 to i486 does intel x86 cease to be a pure CISC design.

    Remember the risc-v div instruction I showed. It had half for load and half for store. Note you have 8 uop points yet a pipeline 12 long on skylake. There is a problem with this count.

    Originally posted by Weasel View Post
    That doesn't make the instruction RISC-like. Division for example has that, too. Why do RISC CPUs have division then?
    Also if you look at 486 to Pentium 4 the divide is 2 uop yet skylake is 4 uop. What is going on here. Lot of instructions increase in uops usage in x86 when intel does Out of order execution to attempt to fix the over sized pipeline they made in Pentium 4.

    There is effect of implementing out of order execution in risc load/store architecture of in fact increasing the number of uops per particular operations like divide. Monitoring uops so pipeline can find out when task is done and send a new task to a slow execution unit. Yes out of order risc you expect number of uops to increase on some instructions.


    The number of μops for each execution port. p0 means a μop to execution port 0. p01means a μop that can go to either port 0 or port 1. p0 p1 means two μops going to port 0 and 1, respectively.
    This is from skylake where it clearly states each port is at least 1 μops.


    Some operations are not counted here if they do not go to any execution port or if the counters are inaccurate
    I love this one so we have a port that take a pair of μops like risc-v yet due to this disclaimer we get to count them as 1 because we told you that the counters could be inaccurate.

    Originally posted by Weasel View Post
    Anyway, I was going to write more stuff (you can look for yourself at non-memory related instructions), but honestly I'm tired of this thread so have at it.
    Really its that you are clueless and don't understand what out of order RISC uops usage looks like to see that div on registers and other things uops usage exactly match what you would expect to see in a out of order RISC cpu from 1978 when IBM did the first out of order RISC.

    Reality thank you for not writing more stuff it would being wrong because you would not have been looking at those uops to see what ones were monitoring uops for operation of out of order pipeline. Yes you can expand out descriptions of every one of the x86 ports for uops if you have knowledge of risc out of order cpus and everything in the operation then makes full sense.

    Weasel you are trying to Weasel your way out. You claimed items were 1 uop when they were not. As I said CISC/RISC instructions are maco ops as in they commonly are processed as multi uops inside the cpu as this comes particularly true in out of order processors. All the common out of order processors are not based around memory register architecture but based around load store architecture using pipelines. Even the old 5 stage in the 486 shows clear load/store stages.

    If you get pipeline list for current intel processors you will also find load/store stages and for what has going into particular ports those have to be 2 uops not one. Making that complete document you have been attempt to quote to win insanely rough guide with mostly incorrect information if you get down to the nuts and bolts.

    Basically if someone attempted to build a clone of intel x86 processor following the intel documentation for P3 and new exactly they would be attempting impossible and this is intentional. This goes back to Intel being cloned in 386/486/586 by parties using intels own documentation.

    I also love how Intel has renamed different load/store things. Like fetch/retire is used for 1 pair of load store. If you do get a list of intel pipeline operation names you will notice load/store operations layout exactly how you would expect for a out of order risc systems IBM did in the 1970s.

    Comment


    • Originally posted by oiaohm View Post
      Also if you look at 486 to Pentium 4 the divide is 2 uop yet skylake is 4 uop. What is going on here. Lot of instructions increase in uops usage in x86 when intel does Out of order execution to attempt to fix the over sized pipeline they made in Pentium 4.
      I was asking about RISC CPU having a DIVISION instruction. If it did increase in uop then it means RISC CPU can't have division since it goes to multiple ports, do you see what you're saying now? Stop side-stepping the facts you're confronted with and try to "weasel out" your way out of my arguments.

      You can't have it both ways.

      Previously you said that a RISC CPU can implement division in 1 clock cycle and other clueless bullshit which is literally impossible in physics. Now you say that a division is multi-uop and macro op and goes to multiple execution units despite only having register operands. Which makes it a CISC instruction.

      So which one the fuck is it?

      Originally posted by oiaohm View Post
      You claimed items were 1 uop when they were not. As I said CISC/RISC instructions are maco ops as in they commonly are processed as multi uops inside the cpu as this comes particularly true in out of order processors.
      If you send an uop to two execution units it's still 1 uop.
      Last edited by Weasel; 08-07-2018, 08:33 AM.

      Comment


      • Originally posted by Weasel View Post
        I was asking about RISC CPU having a DIVISION instruction. If it did increase in uop uothen it means RISC CPU can't have division since it goes to multiple ports, do you see what you're saying now? Stop side-stepping the facts you're confronted with and try to "weasel out" your way out of my arguments.
        This is not understand the basics of load/store architecture. Load/store architecture means that a Load is 1 uop and a store is another 1uop. So a division on a load store is always 2 uops.
        This is a traditional 5 step RISC pipeline.
        IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back
        So each instruction decodes into at least 3 uops. 1 uop for EX, 1 uop for MEM and 1 uop for WB. MEM is Load/store from memory, EX is load from registers WB is store to regesters. It does not take much thinking with write though registers you can drop to a 4 deep pipeline instead of a 5 by splitting MEM access in two.

        More modern out of order has write though in the MEM or split MEM to allow more instructions to be stacked into a pipeline cycle.

        Compacted cycle of 1 deep pipeline RISC , Fetch EX MEM read access next to each other on the up clock. ID and WB MEM write on the down clock in a risc. Its simple to do a risc 5 deep..


        Originally posted by Weasel View Post
        Previously you said that a RISC CPU can implement division in 1 clock cycle and other clueless bullshit which is literally impossible in physics. Now you say that a division is multi-uop and macro op and goes to multiple execution units despite only having register operands. Which makes it a CISC instruction.
        But it is possible to implement division in 1 clock cycle if you limit you speed or are running dynamic clocking. Lot of old risc exploited dyanmic clocking where on divide the clock stepped forwards when divide was complete. The tradition 5 step risc pipe lining putting the memory operation in the middle gave a delay. So a 5 long pipeline from start to end take 5 clock cycles from beginning to end. 12 long pipeline takes 12 clock cycles of course.

        Originally posted by Weasel View Post
        If you send an uop to two execution units it's still 1 uop.
        Skylake x86 documentation clear states that each execution unit gets its own uop. Traditional RISC each execution unit gets it own uop and each execution unit is assigned to process at particular points in the pipeline/clock oscillation.

        It was traditional CISC like the i386 using memory register that sent a single 1 uop to multi execution units yes the i386 was the last intel chip to-do this except for the IA64 Itanium that flopped. Sending single uop to multi execution units means having to keep those execution units synced that turns out to be problematic particularly as clock speed go up. Load/store architecture is more tolerant to timing issues..

        Berkeley and IBM early risc was the first to be macro op where RISC instructions are fairly simple and like Risc-v.


        MULDIV multiplier multiplicand MUL/MULH[[S]U] dest OP

        The above is a multipilier lay out in risc-v. This is like IBM first RISC instruction set. First half of the risc instruction here is uop 1 MULDIV setting values in MULDIV hardware and the send half MUL/MULH is uop 2 getting results back to registers. uop 1 Load into processing. uop 2 is store back into registers the result. Of course other uops for the pipeline to delay and other things can be added in the middle with RISC.

        Load/Store architecture means a uop either loads or it stores never does both. If you are doing a load and store its always at least 2 uops on a load store. Of course a processing unit can take in fused uop code where 2 uop are sent in at once. Of course a load/store can appear CISC like when many uops are being run at once.

        Weasel you have 1 thing very wrong. RISC from the start was macro ops where instruction decoded to multi uops. It was CISC that attempted and failed with pure single uop. We do not see CISC chips being made any more using the single uop model. MIPS from standford, Berkley and IBM RISC all have never used a single uop per RISC instruction working on registers.

        Comment


        • I honestly don't understand where you come with so much stuff that has nothing to do with my point. It's like you copy-paste random CPU info from somewhere every time I ask you a question.

          You keep saying out-of-order... you realize RISC does not equal out-of-order and has nothing to do with it? (by itself, sure any CPU can use out-of-order execution). So it's completely pointless as long as you equate out-of-order with RISC and think it's not CISC. And no, I don't care which CPU was the first out-of-order CPU, it doesn't matter even if it was RISC, it just happened to be so. (maybe because RISC are easier to design, back then technology was much more primitive)

          But whatever, now you'll copy paste info about out-of-order and how pipelines work and so on, can't answer a simple question about div, then maybe some history about first OOO CPU and so on, none of which have anything to do with CISC vs RISC as per their definitions... lol

          Comment


          • Originally posted by Weasel View Post
            I honestly don't understand where you come with so much stuff that has nothing to do with my point. It's like you copy-paste random CPU info from somewhere every time I ask you a question.

            You keep saying out-of-order... you realize RISC does not equal out-of-order and has nothing to do with it? (by itself, sure any CPU can use out-of-order execution). So it's completely pointless as long as you equate out-of-order with RISC and think it's not CISC. And no, I don't care which CPU was the first out-of-order CPU, it doesn't matter even if it was RISC, it just happened to be so. (maybe because RISC are easier to design, back then technology was much more primitive)

            But whatever, now you'll copy paste info about out-of-order and how pipelines work and so on, can't answer a simple question about div, then maybe some history about first OOO CPU and so on, none of which have anything to do with CISC vs RISC as per their definitions... lol
            Load store architecture is way simpler to make out of order than memory register architecture due to uops doing 1 thing at a time so you can que them on processing units/ports and trigger those units/ports at particular times.

            Risc without out of order or instruction stacking historically had a problem that CISC designs back then did not.

            Memory register architecture can directly with a single uop perform a alteration on memory and registers with the same uop this single massive uop leads to high sync requirements sync requires connection lengths inside silicon and to outside to ram chips to be very important.. Load Store need multi uops being processed at the same time todo the same thing. RISC was designed for load store where you are producing multi simple uops.


            (by itself, sure any CPU can use out-of-order execution).
            In fact not all cpu architecture designs are suitable for out of order. one instruction set computer (OISC) Most of these cannot be made out of order because branch prediction is insanely hard to do on them.

            CISC really does not lend itself to out of order. Take your instruction interpreter reading ahead in risc its quite simple to group non conflicting loads and stores into groups for bulk processing. RISC was designed for out of order. Most CISC instruction sets like x86 was designed for in order using 1 uop code to do many different things in memory register architecture so requires more processing to turn back into load/store information to produce the information required bulk processing.

            Reality is RISC was designed to sit on an architecture that uses multi uops. Next is RISC cpus were the first with pipelines so you a processing rate of 1 instruction per clock cycle even if a instruction takes longer than 1 clock cycle as long as it less than pipeline time to complete.

            Also some risc has dyanmic clocking. Why Risc-v and Power chips don't have extra nop instruction after a jump/branch is pipeline on those can take 2 cycle to complete if it a jump yet this can appear as a single clock count. So its really simple to a divide in one clock tick in a risc design when the clock tick is controlled by how fast the instruction is processed. Due to pipelines

            Finally what is the fastest transistor. You say its electrically impossible to do divide at 4 ghz. The fastest in 2001 by IBM 210 Ghz. So a divider built in a CPU using the fastest transistor technology has no problems running at 4ghz particularly when you free clock it. Problem this requires a more expensive silicon production method and paying IBM for a patents on that production method. IBM has been able to produce very fast math circuits for particular usages like high end military grade software defined radios yes those need to perform very fast multiplication and divide.

            The reasons why multiply and divide is slow is
            1) Heat free running these circuits generates a lot of heat.(free running is running without a clock just feed power into the circuit and wait for result)
            2) Cost of production due to requiring more expensive plant design. Slower silicon at 5 nm the plant is in the billions this high speed stuff is double that cost again.
            3) Cost of production due to requiring more expensive materials for the silicon chips themselves.
            4) Failure rates in production are higher making the highspeed.

            So yes we could have 1 clock cycle 512bit wide divide for a cpu running at 4 ghz as long as we were willing to pay a lot more for cpus, have higher power bills and have larger cooling problems. Everything is a trade off. Reality is we can go a lot faster than we do today but we don't in desktop cpus the cpus made for high end software defined radio go insanely fast over 100ghz with insane cooling requirements like diamonds directly connected to the silicon chip to draw heat away from the silicon chip so it don't melt. Nothing I have said is impossible IBM for quite some time with their risc chip got the div/mul in 1 clock cycle by making those out of faster technology than anything else in the chip.

            Comment


            • Originally posted by Weasel View Post
              I honestly don't understand where you come with so much stuff that has nothing to do with my point. It's like you copy-paste random CPU info from somewhere every time I ask you a question.
              Maybe he's a bot.

              Comment


              • Originally posted by coder View Post
                Maybe he's a bot.
                Really I am not. The information I am giving is not random.

                There are strict differences between load store architecture and memory register architecture.

                Comment


                • Originally posted by oiaohm View Post
                  So yes we could have 1 clock cycle 512bit wide divide for a cpu running at 4 ghz
                  You can't, gates have physical delays. They are already unclocked. There was a topic about this very thing on an asm forum that I lurk in. See this post and enlighten yourself.

                  You realize the speed of light is a thing, right? (approx. 300 million m/s). 210 Ghz means it can only travel 1mm in a straight line before the next clock (I'm not sure how this works, but it's probably half of that, since it's a square wave), and that's excluding any delays which are very significant (this is the absolute limit). I know you didn't say to have the entire CPU clocked that high, but no it's not just impractical, it's probably physically impossible unless you specialize it just for that in the lab.

                  tl;dr it doesn't matter if you make it unclocked. The input to it is available only for one clock duty cycle.
                  Last edited by Weasel; 08-10-2018, 08:24 AM.

                  Comment


                  • Originally posted by oiaohm View Post
                    Really I am not.
                    A bot would say that.

                    Comment


                    • Originally posted by Weasel View Post
                      You can't, gates have physical delays. They are already unclocked. There was a topic about this very thing on an asm forum that I lurk in. See this post and enlighten yourself.

                      You realize the speed of light is a thing, right? (approx. 300 million m/s). 210 Ghz means it can only travel 1mm in a straight line before the next clock (I'm not sure how this works, but it's probably half of that, since it's a square wave), and that's excluding any delays which are very significant (this is the absolute limit). I know you didn't say to have the entire CPU clocked that high, but no it's not just impractical, it's probably physically impossible unless you specialize it just for that in the lab.
                      Silicon feature are measured in nm. So 28 nm features. 0.000028 *512 0.014336 Lets * that by 10 to allow for some wiring. You are still under .2 mm of in circuit travel distance for a 512 wide divide circuit done at 28 nm from entry to the divide circuit to exit. The biggest cause of slow is the chemistry used in the gates causing causing quite significant switching delays.

                      Speed of light is not the barrier the barrier is gate chemistry 2001 IBM had the chemistry to have gates switching at 210Ghz. This is SiGe this has been pushed even faster 798GHZ is 2014. 798ghz switching allows for a 64 bit divide working at 12 ghz. 128 big divide working at 6ghz and 256 working at 3ghz(that is faster than the general clock speed of lot of intel chips). Please note that is 2014. Since then SiGe been pushed even faster with different chemistry.

                      Under 4 ghz using the latest chemistry for SiGe and 7nm production completing a divide 512 bit wide divide every 4Ghz clock cycle is really easy. You have time to spare.

                      2001 IBM could do 64 bit divide at 3ghz(using high speed SiGe gates at the time limited to 210ghz) of course powerpc chip 2001 was only running at 1.2ghz. See no problem todo divide inside 1 cpu clock cycle. just have to make your divider and multiplier out the right chemistry that is many times faster than every other part in the chip but there is a price to pay for this. Yes that was 180nm tech. High end powerpc super computer chips were known for being insanely hot and power hungry but very fast at maths compared to the general powerpc chips.

                      Weasel basically your physical limit claims are total bull crap. In fact we are not exactly sure how faster SiGe can go. Problem is SiGe at high speed produces a heck load of heat with quite a decent amount of electrical loss.

                      One of the funny problems you hear RISC-V hardware makers talking about is when doing complex maths in x86 less than 5 percent of the ALU processing unit capabilities is in fact being used. That between the real speed a well made ALU unit doing divide/multi and other functions can run at. And the fact cpu clock speed is so many times slower than the ALU processing speed.

                      Intel does not use the fastest SiGe chemistry they use the SiGe chemistry with lower heat production and electrical losses price is slower switching speed. Everything is a trade off. So yes Intel numbers shows divide and multiply taking multi clock cycles but it does not have to its a choice at silicon production stage of what chemistry are you going to make the multiplier/divider out of. Do you have chip require high power in produce more heat require more cooling and be able to to divide/multiply very quickly(in the 4ghz clock speed range) or do you run divide/multiply slower and require 1/4 power(for total processing) and generate between 1/8 to 1/16 the heat. Think about it x86 chips get hot enough without spewing out 8/16 times more heat.

                      Comment

                      Working...
                      X