Announcement

**Weasel** · 07 August 2018, 08:31 AM

Originally posted by oiaohm View Post

Also if you look at 486 to Pentium 4 the divide is 2 uop yet skylake is 4 uop. What is going on here. Lot of instructions increase in uops usage in x86 when intel does Out of order execution to attempt to fix the over sized pipeline they made in Pentium 4.

I was asking about RISC CPU having a DIVISION instruction. If it did increase in uop then it means RISC CPU can't have division since it goes to multiple ports, do you see what you're saying now? Stop side-stepping the facts you're confronted with and try to "weasel out" your way out of my arguments.

You can't have it both ways.

Previously you said that a RISC CPU can implement division in 1 clock cycle and other clueless bullshit which is literally impossible in physics. Now you say that a division is multi-uop and macro op and goes to multiple execution units despite only having register operands. Which makes it a CISC instruction.

So which one the fuck is it?

Originally posted by oiaohm View Post

You claimed items were 1 uop when they were not. As I said CISC/RISC instructions are maco ops as in they commonly are processed as multi uops inside the cpu as this comes particularly true in out of order processors.

If you send an uop to two execution units it's still 1 uop.

**oiaohm** · 08 August 2018, 10:47 PM

Originally posted by Weasel View Post

I was asking about RISC CPU having a DIVISION instruction. If it did increase in uop uothen it means RISC CPU can't have division since it goes to multiple ports, do you see what you're saying now? Stop side-stepping the facts you're confronted with and try to "weasel out" your way out of my arguments.

This is not understand the basics of load/store architecture. Load/store architecture means that a Load is 1 uop and a store is another 1uop. So a division on a load store is always 2 uops.
This is a traditional 5 step RISC pipeline.
IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back
So each instruction decodes into at least 3 uops. 1 uop for EX, 1 uop for MEM and 1 uop for WB. MEM is Load/store from memory, EX is load from registers WB is store to regesters. It does not take much thinking with write though registers you can drop to a 4 deep pipeline instead of a 5 by splitting MEM access in two.

More modern out of order has write though in the MEM or split MEM to allow more instructions to be stacked into a pipeline cycle.

Compacted cycle of 1 deep pipeline RISC , Fetch EX MEM read access next to each other on the up clock. ID and WB MEM write on the down clock in a risc. Its simple to do a risc 5 deep..

Originally posted by Weasel View Post

Previously you said that a RISC CPU can implement division in 1 clock cycle and other clueless bullshit which is literally impossible in physics. Now you say that a division is multi-uop and macro op and goes to multiple execution units despite only having register operands. Which makes it a CISC instruction.

But it is possible to implement division in 1 clock cycle if you limit you speed or are running dynamic clocking. Lot of old risc exploited dyanmic clocking where on divide the clock stepped forwards when divide was complete. The tradition 5 step risc pipe lining putting the memory operation in the middle gave a delay. So a 5 long pipeline from start to end take 5 clock cycles from beginning to end. 12 long pipeline takes 12 clock cycles of course.

Originally posted by Weasel View Post

If you send an uop to two execution units it's still 1 uop.

Skylake x86 documentation clear states that each execution unit gets its own uop. Traditional RISC each execution unit gets it own uop and each execution unit is assigned to process at particular points in the pipeline/clock oscillation.

It was traditional CISC like the i386 using memory register that sent a single 1 uop to multi execution units yes the i386 was the last intel chip to-do this except for the IA64 Itanium that flopped. Sending single uop to multi execution units means having to keep those execution units synced that turns out to be problematic particularly as clock speed go up. Load/store architecture is more tolerant to timing issues..

Berkeley and IBM early risc was the first to be macro op where RISC instructions are fairly simple and like Risc-v.

MULDIV multiplier multiplicand MUL/MULH[[S]U] dest OP

The above is a multipilier lay out in risc-v. This is like IBM first RISC instruction set. First half of the risc instruction here is uop 1 MULDIV setting values in MULDIV hardware and the send half MUL/MULH is uop 2 getting results back to registers. uop 1 Load into processing. uop 2 is store back into registers the result. Of course other uops for the pipeline to delay and other things can be added in the middle with RISC.

Load/Store architecture means a uop either loads or it stores never does both. If you are doing a load and store its always at least 2 uops on a load store. Of course a processing unit can take in fused uop code where 2 uop are sent in at once. Of course a load/store can appear CISC like when many uops are being run at once.

Weasel you have 1 thing very wrong. RISC from the start was macro ops where instruction decoded to multi uops. It was CISC that attempted and failed with pure single uop. We do not see CISC chips being made any more using the single uop model. MIPS from standford, Berkley and IBM RISC all have never used a single uop per RISC instruction working on registers.

**Weasel** · 09 August 2018, 12:31 PM

I honestly don't understand where you come with so much stuff that has nothing to do with my point. It's like you copy-paste random CPU info from somewhere every time I ask you a question.

You keep saying out-of-order... you realize RISC does not equal out-of-order and has nothing to do with it? (by itself, sure any CPU can use out-of-order execution). So it's completely pointless as long as you equate out-of-order with RISC and think it's not CISC. And no, I don't care which CPU was the first out-of-order CPU, it doesn't matter even if it was RISC, it just happened to be so. (maybe because RISC are easier to design, back then technology was much more primitive)

But whatever, now you'll copy paste info about out-of-order and how pipelines work and so on, can't answer a simple question about div, then maybe some history about first OOO CPU and so on, none of which have anything to do with CISC vs RISC as per their definitions... lol

**oiaohm** · 09 August 2018, 08:56 PM

Originally posted by Weasel View Post

I honestly don't understand where you come with so much stuff that has nothing to do with my point. It's like you copy-paste random CPU info from somewhere every time I ask you a question.

You keep saying out-of-order... you realize RISC does not equal out-of-order and has nothing to do with it? (by itself, sure any CPU can use out-of-order execution). So it's completely pointless as long as you equate out-of-order with RISC and think it's not CISC. And no, I don't care which CPU was the first out-of-order CPU, it doesn't matter even if it was RISC, it just happened to be so. (maybe because RISC are easier to design, back then technology was much more primitive)

But whatever, now you'll copy paste info about out-of-order and how pipelines work and so on, can't answer a simple question about div, then maybe some history about first OOO CPU and so on, none of which have anything to do with CISC vs RISC as per their definitions... lol

Load store architecture is way simpler to make out of order than memory register architecture due to uops doing 1 thing at a time so you can que them on processing units/ports and trigger those units/ports at particular times.

Risc without out of order or instruction stacking historically had a problem that CISC designs back then did not.

Memory register architecture can directly with a single uop perform a alteration on memory and registers with the same uop this single massive uop leads to high sync requirements sync requires connection lengths inside silicon and to outside to ram chips to be very important.. Load Store need multi uops being processed at the same time todo the same thing. RISC was designed for load store where you are producing multi simple uops.

(by itself, sure any CPU can use out-of-order execution).
In fact not all cpu architecture designs are suitable for out of order. one instruction set computer (OISC) Most of these cannot be made out of order because branch prediction is insanely hard to do on them.

CISC really does not lend itself to out of order. Take your instruction interpreter reading ahead in risc its quite simple to group non conflicting loads and stores into groups for bulk processing. RISC was designed for out of order. Most CISC instruction sets like x86 was designed for in order using 1 uop code to do many different things in memory register architecture so requires more processing to turn back into load/store information to produce the information required bulk processing.

Reality is RISC was designed to sit on an architecture that uses multi uops. Next is RISC cpus were the first with pipelines so you a processing rate of 1 instruction per clock cycle even if a instruction takes longer than 1 clock cycle as long as it less than pipeline time to complete.

Also some risc has dyanmic clocking. Why Risc-v and Power chips don't have extra nop instruction after a jump/branch is pipeline on those can take 2 cycle to complete if it a jump yet this can appear as a single clock count. So its really simple to a divide in one clock tick in a risc design when the clock tick is controlled by how fast the instruction is processed. Due to pipelines

Finally what is the fastest transistor. You say its electrically impossible to do divide at 4 ghz. The fastest in 2001 by IBM 210 Ghz. So a divider built in a CPU using the fastest transistor technology has no problems running at 4ghz particularly when you free clock it. Problem this requires a more expensive silicon production method and paying IBM for a patents on that production method. IBM has been able to produce very fast math circuits for particular usages like high end military grade software defined radios yes those need to perform very fast multiplication and divide.

The reasons why multiply and divide is slow is
1) Heat free running these circuits generates a lot of heat.(free running is running without a clock just feed power into the circuit and wait for result)
2) Cost of production due to requiring more expensive plant design. Slower silicon at 5 nm the plant is in the billions this high speed stuff is double that cost again.
3) Cost of production due to requiring more expensive materials for the silicon chips themselves.
4) Failure rates in production are higher making the highspeed.

So yes we could have 1 clock cycle 512bit wide divide for a cpu running at 4 ghz as long as we were willing to pay a lot more for cpus, have higher power bills and have larger cooling problems. Everything is a trade off. Reality is we can go a lot faster than we do today but we don't in desktop cpus the cpus made for high end software defined radio go insanely fast over 100ghz with insane cooling requirements like diamonds directly connected to the silicon chip to draw heat away from the silicon chip so it don't melt. Nothing I have said is impossible IBM for quite some time with their risc chip got the div/mul in 1 clock cycle by making those out of faster technology than anything else in the chip.

**coder** · 09 August 2018, 11:24 PM

Originally posted by Weasel View Post

I honestly don't understand where you come with so much stuff that has nothing to do with my point. It's like you copy-paste random CPU info from somewhere every time I ask you a question.

Maybe he's a bot.

**oiaohm** · 09 August 2018, 11:28 PM

Originally posted by coder View Post

Maybe he's a bot.

Really I am not. The information I am giving is not random.

There are strict differences between load store architecture and memory register architecture.

**Weasel** · 10 August 2018, 08:20 AM

Originally posted by oiaohm View Post

So yes we could have 1 clock cycle 512bit wide divide for a cpu running at 4 ghz

You can't, gates have physical delays. They are already unclocked. There was a topic about this very thing on an asm forum that I lurk in. See this post and enlighten yourself.

You realize the speed of light is a thing, right? (approx. 300 million m/s). 210 Ghz means it can only travel 1mm in a straight line before the next clock (I'm not sure how this works, but it's probably half of that, since it's a square wave), and that's excluding any delays which are very significant (this is the absolute limit). I know you didn't say to have the entire CPU clocked that high, but no it's not just impractical, it's probably physically impossible unless you specialize it just for that in the lab.

tl;dr it doesn't matter if you make it unclocked. The input to it is available only for one clock duty cycle.

**coder** · 10 August 2018, 09:51 PM

Originally posted by oiaohm View Post

Really I am not.

A bot would say that.

**oiaohm** · 12 August 2018, 11:32 PM

Originally posted by Weasel View Post

You can't, gates have physical delays. They are already unclocked. There was a topic about this very thing on an asm forum that I lurk in. See this post and enlighten yourself.

You realize the speed of light is a thing, right? (approx. 300 million m/s). 210 Ghz means it can only travel 1mm in a straight line before the next clock (I'm not sure how this works, but it's probably half of that, since it's a square wave), and that's excluding any delays which are very significant (this is the absolute limit). I know you didn't say to have the entire CPU clocked that high, but no it's not just impractical, it's probably physically impossible unless you specialize it just for that in the lab.

Silicon feature are measured in nm. So 28 nm features. 0.000028 *512 0.014336 Lets * that by 10 to allow for some wiring. You are still under .2 mm of in circuit travel distance for a 512 wide divide circuit done at 28 nm from entry to the divide circuit to exit. The biggest cause of slow is the chemistry used in the gates causing causing quite significant switching delays.

Speed of light is not the barrier the barrier is gate chemistry 2001 IBM had the chemistry to have gates switching at 210Ghz. This is SiGe this has been pushed even faster 798GHZ is 2014. 798ghz switching allows for a 64 bit divide working at 12 ghz. 128 big divide working at 6ghz and 256 working at 3ghz(that is faster than the general clock speed of lot of intel chips). Please note that is 2014. Since then SiGe been pushed even faster with different chemistry.

Under 4 ghz using the latest chemistry for SiGe and 7nm production completing a divide 512 bit wide divide every 4Ghz clock cycle is really easy. You have time to spare.

2001 IBM could do 64 bit divide at 3ghz(using high speed SiGe gates at the time limited to 210ghz) of course powerpc chip 2001 was only running at 1.2ghz. See no problem todo divide inside 1 cpu clock cycle. just have to make your divider and multiplier out the right chemistry that is many times faster than every other part in the chip but there is a price to pay for this. Yes that was 180nm tech. High end powerpc super computer chips were known for being insanely hot and power hungry but very fast at maths compared to the general powerpc chips.

Weasel basically your physical limit claims are total bull crap. In fact we are not exactly sure how faster SiGe can go. Problem is SiGe at high speed produces a heck load of heat with quite a decent amount of electrical loss.

One of the funny problems you hear RISC-V hardware makers talking about is when doing complex maths in x86 less than 5 percent of the ALU processing unit capabilities is in fact being used. That between the real speed a well made ALU unit doing divide/multi and other functions can run at. And the fact cpu clock speed is so many times slower than the ALU processing speed.

Intel does not use the fastest SiGe chemistry they use the SiGe chemistry with lower heat production and electrical losses price is slower switching speed. Everything is a trade off. So yes Intel numbers shows divide and multiply taking multi clock cycles but it does not have to its a choice at silicon production stage of what chemistry are you going to make the multiplier/divider out of. Do you have chip require high power in produce more heat require more cooling and be able to to divide/multiply very quickly(in the 4ghz clock speed range) or do you run divide/multiply slower and require 1/4 power(for total processing) and generate between 1/8 to 1/16 the heat. Think about it x86 chips get hot enough without spewing out 8/16 times more heat.

Announcement

ARM Launches "Facts" Campaign Against RISC-V

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment