Announcement

Collapse
No announcement yet.

ARM Launches "Facts" Campaign Against RISC-V

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by Weasel View Post
    "uops fused domain" and then "1" for a "r,m" operation -> https://www.agner.org/optimize/instruction_tables.pdf

    The u from uop comes from the symbol micro, you know.

    So yes it's 1 uop, this is a fact, and I don't give a shit what you claim.
    Lets go though that file and read it.

    Uops defines
    INTEL
    The number of μops from the decoder or ROM. A μop that goes to multiple units is counted as one.

    Hang on that is not a valid define of a μops this is in fact a define a macro op. A uop goes to a single processing unit. Going to multi processing units mean it should be counted as multi uops. You find this on page 298,308,318


    289 Intel Atom documentation was question so they corrected it in future intel write up.
    The number of μops from the decoder or ROM.
    Notice that this is the start of the newer ones defines.

    278/267
    Intel Pentium 4 w. EM64T
    Number of μops issued from instruction decoder and stored in trace cache


    μops fused domain is another way of saying macro op.

    Page 247 in Skylake define of terms
    μops each port:
    The number of μops for each execution port. p0 means a μop to execution port 0. p01means a μop that can go to either port 0 or port 1. p0 p1 means two μops going to port 0 and 1, respectively.


    Originally posted by Weasel View Post
    So yes it's 1 uop, this is a fact, and I don't give a shit what you claim.
    Really its not what I claim its that you have not read the document you are quoting correctly as it proves you claim is bogus.

    If you had read the define carefully you would have found the problem Weasel. Skylake its 1 μop per port used and each port will be triggered at particular time in the pipeline.

    Lets go down to CRC32 r,m that you claimed was CISC. It gets converted into p1 p23. p23 is risc load from memory. Now look at CRC32 r,r is just p1 1 μops just like a risc would be .
    So CRC32 r,m gets converted to 1 processing op using port 1 and and one load from memory op using port 2 or 3. This is not reading very CISC like. Please note port 2/3 populate registers with value from memory in skylake be it registers you can see or cannot see as a programmer.

    CRC32 r,m takes 2 μops on skylake it would take 2 μops on a risc processor as well. Skylake perform fusion so these μops are processed in 1 pipecycle.

    Weasel basically if you had read the document you were quoting correct you would have worked out that internally intel Skylake is pure risc with fusion. Intel Skylake is CISC instruction set being converted to risc by the instruction decoder and to give CISC like appearance using hidden registers and never directly working on memory so without question RISC.

    uops fused domain is nothing more than performing part of μops fusion to run as many μops as possible per pipeline cycle.

    So a smart risc instruction decoder reading ahead where it stacks the load operation with the process operation in the same pipeline cycle generates the same performance pattern as Intel x86 processes.

    Weasel like it or not are the core of CPU design RISC load/store architecture and pipelines won. CISC now just pretend by instruction interpreter.

    You can see over the years Intel attempt to hide the core is RISC by creating new terms over the years that have the word μops in them or using invalid define of μops so they did not have to admit the fact. Skylake documentation is where Intel is somewhat truthful.

    So you have a pure RISC instruction set you don't need a complex instruction interpreter generating RISC on the fly so you only have to focus on the μops fusion to get performance so RISC is simpler to make than CISC. There is no magic performance advantage for CISC thinking that most CPUs people will point at as being CISC that perform well are in fact RISC design of core inside.

    CISC is a method to attempt to compress your instructions to consume less memory and hopefully make filling the μops simple but that does not work out because to fill the usable μops per pipeline cycle you end up have to read instruction ahead anyhow. Like take that CRC32 r,m its only using 2 of the possible 8 μops so you need to read the next instruction any how to find out if in this pipeline cycle you could process the next instruction as well. So be it CISC or RISC to have a high performing CPU you have to implement instruction read ahead to fill the available μops slots per pipeline cycle.

    CISC on paper seams like a good idea until you implement and find out it does not work that well. Keeping your instruction set simple it can be simply compressed and get most of the size saving of CISC in fact more saving than CISC in most case. Read multi instructions ahead as you have to anyhow with CISC to fill μops slots with a RISC kills all the theoretically advantages of CISC.

    Comment


    • @oiaohm Why include P4 in your example? I already admitted it was abysmally bad, and that's one of the reasons, because it tried to be a RISC, didn't work out too well did it?

      Obviously, it will have to go to two execution units, that's normal physics, at least for memory operands (but not only). That doesn't make the instruction RISC-like. Division for example has that, too. Why do RISC CPUs have division then?

      Anyway, I was going to write more stuff (you can look for yourself at non-memory related instructions), but honestly I'm tired of this thread so have at it.

      Comment


      • Originally posted by Weasel View Post
        @oiaohm Why include P4 in your example? I already admitted it was abysmally bad, and that's one of the reasons, because it tried to be a RISC, didn't work out too well did it?
        It was the 486 that went RISC so before the P4. Really You keep on saying P4 was when intel tried to be RISC this is wrong.

        Originally posted by Weasel View Post
        Obviously, it will have to go to two execution units, that's normal physics, at least for memory operands (but not only).
        386 and before had a memory register architecture design not load/store you don't have two execution units.
        https://upload.wikimedia.org/wikiped...386DX_arch.png
        Take a close look at the diagram of a i386.

        Obviously it will have to go to two execution units is wrong. i386 all instructions are converted to single μops the i386 is the last CISC processor intel made that did that and was the last one that was pure memory register architecture. Since then you have had a load/store architecture. Instructions turning into multi operations is purely taken from risc half instruction. It when intel goes from i386 to i486 does intel x86 cease to be a pure CISC design.

        Remember the risc-v div instruction I showed. It had half for load and half for store. Note you have 8 uop points yet a pipeline 12 long on skylake. There is a problem with this count.

        Originally posted by Weasel View Post
        That doesn't make the instruction RISC-like. Division for example has that, too. Why do RISC CPUs have division then?
        Also if you look at 486 to Pentium 4 the divide is 2 uop yet skylake is 4 uop. What is going on here. Lot of instructions increase in uops usage in x86 when intel does Out of order execution to attempt to fix the over sized pipeline they made in Pentium 4.

        There is effect of implementing out of order execution in risc load/store architecture of in fact increasing the number of uops per particular operations like divide. Monitoring uops so pipeline can find out when task is done and send a new task to a slow execution unit. Yes out of order risc you expect number of uops to increase on some instructions.


        The number of μops for each execution port. p0 means a μop to execution port 0. p01means a μop that can go to either port 0 or port 1. p0 p1 means two μops going to port 0 and 1, respectively.
        This is from skylake where it clearly states each port is at least 1 μops.


        Some operations are not counted here if they do not go to any execution port or if the counters are inaccurate
        I love this one so we have a port that take a pair of μops like risc-v yet due to this disclaimer we get to count them as 1 because we told you that the counters could be inaccurate.

        Originally posted by Weasel View Post
        Anyway, I was going to write more stuff (you can look for yourself at non-memory related instructions), but honestly I'm tired of this thread so have at it.
        Really its that you are clueless and don't understand what out of order RISC uops usage looks like to see that div on registers and other things uops usage exactly match what you would expect to see in a out of order RISC cpu from 1978 when IBM did the first out of order RISC.

        Reality thank you for not writing more stuff it would being wrong because you would not have been looking at those uops to see what ones were monitoring uops for operation of out of order pipeline. Yes you can expand out descriptions of every one of the x86 ports for uops if you have knowledge of risc out of order cpus and everything in the operation then makes full sense.

        Weasel you are trying to Weasel your way out. You claimed items were 1 uop when they were not. As I said CISC/RISC instructions are maco ops as in they commonly are processed as multi uops inside the cpu as this comes particularly true in out of order processors. All the common out of order processors are not based around memory register architecture but based around load store architecture using pipelines. Even the old 5 stage in the 486 shows clear load/store stages.

        If you get pipeline list for current intel processors you will also find load/store stages and for what has going into particular ports those have to be 2 uops not one. Making that complete document you have been attempt to quote to win insanely rough guide with mostly incorrect information if you get down to the nuts and bolts.

        Basically if someone attempted to build a clone of intel x86 processor following the intel documentation for P3 and new exactly they would be attempting impossible and this is intentional. This goes back to Intel being cloned in 386/486/586 by parties using intels own documentation.

        I also love how Intel has renamed different load/store things. Like fetch/retire is used for 1 pair of load store. If you do get a list of intel pipeline operation names you will notice load/store operations layout exactly how you would expect for a out of order risc systems IBM did in the 1970s.

        Comment


        • Originally posted by oiaohm View Post
          Also if you look at 486 to Pentium 4 the divide is 2 uop yet skylake is 4 uop. What is going on here. Lot of instructions increase in uops usage in x86 when intel does Out of order execution to attempt to fix the over sized pipeline they made in Pentium 4.
          I was asking about RISC CPU having a DIVISION instruction. If it did increase in uop then it means RISC CPU can't have division since it goes to multiple ports, do you see what you're saying now? Stop side-stepping the facts you're confronted with and try to "weasel out" your way out of my arguments.

          You can't have it both ways.

          Previously you said that a RISC CPU can implement division in 1 clock cycle and other clueless bullshit which is literally impossible in physics. Now you say that a division is multi-uop and macro op and goes to multiple execution units despite only having register operands. Which makes it a CISC instruction.

          So which one the fuck is it?

          Originally posted by oiaohm View Post
          You claimed items were 1 uop when they were not. As I said CISC/RISC instructions are maco ops as in they commonly are processed as multi uops inside the cpu as this comes particularly true in out of order processors.
          If you send an uop to two execution units it's still 1 uop.
          Last edited by Weasel; 08-07-2018, 08:33 AM.

          Comment


          • Originally posted by Weasel View Post
            I was asking about RISC CPU having a DIVISION instruction. If it did increase in uop uothen it means RISC CPU can't have division since it goes to multiple ports, do you see what you're saying now? Stop side-stepping the facts you're confronted with and try to "weasel out" your way out of my arguments.
            This is not understand the basics of load/store architecture. Load/store architecture means that a Load is 1 uop and a store is another 1uop. So a division on a load store is always 2 uops.
            This is a traditional 5 step RISC pipeline.
            IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back
            So each instruction decodes into at least 3 uops. 1 uop for EX, 1 uop for MEM and 1 uop for WB. MEM is Load/store from memory, EX is load from registers WB is store to regesters. It does not take much thinking with write though registers you can drop to a 4 deep pipeline instead of a 5 by splitting MEM access in two.

            More modern out of order has write though in the MEM or split MEM to allow more instructions to be stacked into a pipeline cycle.

            Compacted cycle of 1 deep pipeline RISC , Fetch EX MEM read access next to each other on the up clock. ID and WB MEM write on the down clock in a risc. Its simple to do a risc 5 deep..


            Originally posted by Weasel View Post
            Previously you said that a RISC CPU can implement division in 1 clock cycle and other clueless bullshit which is literally impossible in physics. Now you say that a division is multi-uop and macro op and goes to multiple execution units despite only having register operands. Which makes it a CISC instruction.
            But it is possible to implement division in 1 clock cycle if you limit you speed or are running dynamic clocking. Lot of old risc exploited dyanmic clocking where on divide the clock stepped forwards when divide was complete. The tradition 5 step risc pipe lining putting the memory operation in the middle gave a delay. So a 5 long pipeline from start to end take 5 clock cycles from beginning to end. 12 long pipeline takes 12 clock cycles of course.

            Originally posted by Weasel View Post
            If you send an uop to two execution units it's still 1 uop.
            Skylake x86 documentation clear states that each execution unit gets its own uop. Traditional RISC each execution unit gets it own uop and each execution unit is assigned to process at particular points in the pipeline/clock oscillation.

            It was traditional CISC like the i386 using memory register that sent a single 1 uop to multi execution units yes the i386 was the last intel chip to-do this except for the IA64 Itanium that flopped. Sending single uop to multi execution units means having to keep those execution units synced that turns out to be problematic particularly as clock speed go up. Load/store architecture is more tolerant to timing issues..

            Berkeley and IBM early risc was the first to be macro op where RISC instructions are fairly simple and like Risc-v.


            MULDIV multiplier multiplicand MUL/MULH[[S]U] dest OP

            The above is a multipilier lay out in risc-v. This is like IBM first RISC instruction set. First half of the risc instruction here is uop 1 MULDIV setting values in MULDIV hardware and the send half MUL/MULH is uop 2 getting results back to registers. uop 1 Load into processing. uop 2 is store back into registers the result. Of course other uops for the pipeline to delay and other things can be added in the middle with RISC.

            Load/Store architecture means a uop either loads or it stores never does both. If you are doing a load and store its always at least 2 uops on a load store. Of course a processing unit can take in fused uop code where 2 uop are sent in at once. Of course a load/store can appear CISC like when many uops are being run at once.

            Weasel you have 1 thing very wrong. RISC from the start was macro ops where instruction decoded to multi uops. It was CISC that attempted and failed with pure single uop. We do not see CISC chips being made any more using the single uop model. MIPS from standford, Berkley and IBM RISC all have never used a single uop per RISC instruction working on registers.

            Comment


            • I honestly don't understand where you come with so much stuff that has nothing to do with my point. It's like you copy-paste random CPU info from somewhere every time I ask you a question.

              You keep saying out-of-order... you realize RISC does not equal out-of-order and has nothing to do with it? (by itself, sure any CPU can use out-of-order execution). So it's completely pointless as long as you equate out-of-order with RISC and think it's not CISC. And no, I don't care which CPU was the first out-of-order CPU, it doesn't matter even if it was RISC, it just happened to be so. (maybe because RISC are easier to design, back then technology was much more primitive)

              But whatever, now you'll copy paste info about out-of-order and how pipelines work and so on, can't answer a simple question about div, then maybe some history about first OOO CPU and so on, none of which have anything to do with CISC vs RISC as per their definitions... lol

              Comment


              • Originally posted by Weasel View Post
                I honestly don't understand where you come with so much stuff that has nothing to do with my point. It's like you copy-paste random CPU info from somewhere every time I ask you a question.

                You keep saying out-of-order... you realize RISC does not equal out-of-order and has nothing to do with it? (by itself, sure any CPU can use out-of-order execution). So it's completely pointless as long as you equate out-of-order with RISC and think it's not CISC. And no, I don't care which CPU was the first out-of-order CPU, it doesn't matter even if it was RISC, it just happened to be so. (maybe because RISC are easier to design, back then technology was much more primitive)

                But whatever, now you'll copy paste info about out-of-order and how pipelines work and so on, can't answer a simple question about div, then maybe some history about first OOO CPU and so on, none of which have anything to do with CISC vs RISC as per their definitions... lol
                Load store architecture is way simpler to make out of order than memory register architecture due to uops doing 1 thing at a time so you can que them on processing units/ports and trigger those units/ports at particular times.

                Risc without out of order or instruction stacking historically had a problem that CISC designs back then did not.

                Memory register architecture can directly with a single uop perform a alteration on memory and registers with the same uop this single massive uop leads to high sync requirements sync requires connection lengths inside silicon and to outside to ram chips to be very important.. Load Store need multi uops being processed at the same time todo the same thing. RISC was designed for load store where you are producing multi simple uops.


                (by itself, sure any CPU can use out-of-order execution).
                In fact not all cpu architecture designs are suitable for out of order. one instruction set computer (OISC) Most of these cannot be made out of order because branch prediction is insanely hard to do on them.

                CISC really does not lend itself to out of order. Take your instruction interpreter reading ahead in risc its quite simple to group non conflicting loads and stores into groups for bulk processing. RISC was designed for out of order. Most CISC instruction sets like x86 was designed for in order using 1 uop code to do many different things in memory register architecture so requires more processing to turn back into load/store information to produce the information required bulk processing.

                Reality is RISC was designed to sit on an architecture that uses multi uops. Next is RISC cpus were the first with pipelines so you a processing rate of 1 instruction per clock cycle even if a instruction takes longer than 1 clock cycle as long as it less than pipeline time to complete.

                Also some risc has dyanmic clocking. Why Risc-v and Power chips don't have extra nop instruction after a jump/branch is pipeline on those can take 2 cycle to complete if it a jump yet this can appear as a single clock count. So its really simple to a divide in one clock tick in a risc design when the clock tick is controlled by how fast the instruction is processed. Due to pipelines

                Finally what is the fastest transistor. You say its electrically impossible to do divide at 4 ghz. The fastest in 2001 by IBM 210 Ghz. So a divider built in a CPU using the fastest transistor technology has no problems running at 4ghz particularly when you free clock it. Problem this requires a more expensive silicon production method and paying IBM for a patents on that production method. IBM has been able to produce very fast math circuits for particular usages like high end military grade software defined radios yes those need to perform very fast multiplication and divide.

                The reasons why multiply and divide is slow is
                1) Heat free running these circuits generates a lot of heat.(free running is running without a clock just feed power into the circuit and wait for result)
                2) Cost of production due to requiring more expensive plant design. Slower silicon at 5 nm the plant is in the billions this high speed stuff is double that cost again.
                3) Cost of production due to requiring more expensive materials for the silicon chips themselves.
                4) Failure rates in production are higher making the highspeed.

                So yes we could have 1 clock cycle 512bit wide divide for a cpu running at 4 ghz as long as we were willing to pay a lot more for cpus, have higher power bills and have larger cooling problems. Everything is a trade off. Reality is we can go a lot faster than we do today but we don't in desktop cpus the cpus made for high end software defined radio go insanely fast over 100ghz with insane cooling requirements like diamonds directly connected to the silicon chip to draw heat away from the silicon chip so it don't melt. Nothing I have said is impossible IBM for quite some time with their risc chip got the div/mul in 1 clock cycle by making those out of faster technology than anything else in the chip.

                Comment


                • Originally posted by Weasel View Post
                  I honestly don't understand where you come with so much stuff that has nothing to do with my point. It's like you copy-paste random CPU info from somewhere every time I ask you a question.
                  Maybe he's a bot.

                  Comment


                  • Originally posted by coder View Post
                    Maybe he's a bot.
                    Really I am not. The information I am giving is not random.

                    There are strict differences between load store architecture and memory register architecture.

                    Comment


                    • Originally posted by oiaohm View Post
                      So yes we could have 1 clock cycle 512bit wide divide for a cpu running at 4 ghz
                      You can't, gates have physical delays. They are already unclocked. There was a topic about this very thing on an asm forum that I lurk in. See this post and enlighten yourself.

                      You realize the speed of light is a thing, right? (approx. 300 million m/s). 210 Ghz means it can only travel 1mm in a straight line before the next clock (I'm not sure how this works, but it's probably half of that, since it's a square wave), and that's excluding any delays which are very significant (this is the absolute limit). I know you didn't say to have the entire CPU clocked that high, but no it's not just impractical, it's probably physically impossible unless you specialize it just for that in the lab.

                      tl;dr it doesn't matter if you make it unclocked. The input to it is available only for one clock duty cycle.
                      Last edited by Weasel; 08-10-2018, 08:24 AM.

                      Comment

                      Working...
                      X