Announcement

**Weasel** · 20 July 2018, 11:27 AM

Originally posted by oiaohm View Post

A risc cpu with uop fusion can read something from memory and do something at the same time. Yes each single uop can only do 1 operation but nothing said that you could not process 2 or more uop in one clock cycle by IBM or Berkeley define of RISC.

Are you stupid or something?

"Modern" RISC means "load/store" architecture. You cannot in good conscience do something while also loading or storing to memory and still claim it's RISC.

Fusion is CISC by definition. The fact that RISCs do fusion means that RISC CPUs are CISC internally for fuck's sake. Hence proving my point that RISC CPUs are not RISC anymore but hybrids, like ARM. Because CISC "won" and every RISC CPU worth its salt is CISC-like internally.

It's not CISC that's RISC-like internally, I just proved it. It's RISC that's CISC, because they "fuse" simple RISC-like instructions into complex uops that do way more than 1 stuff at once.

Deal with it.

**coder** · 20 July 2018, 09:53 PM

Originally posted by oiaohm View Post

High performance real-time chips like mips32/mips64 based you will find the instruction latency

So, find me a document specifying that the latency of this instruction is once cycle. Anything else is just a waste of everybody's time.

Originally posted by oiaohm View Post

Really coder I think your are incompetent who don't have the basics who want to make me out as a liar because that is simpler than admit you were clueless on the topic.

I will gladly admit that I'm wrong, but only when you've provided clear evidence of one of many production CPUs you've claimed to have accomplished this feat.

**coder** · 20 July 2018, 09:55 PM

Originally posted by Weasel View Post

You realize we're not talking about divisions (and multiplications?) by constants, are you? Why don't you show a multiplication that works with any multiplicand (register, not constant) huh?

Every decent compiler in existence will "translate" divisions by constants into multiplications anyway, so your point is completely retarded and useless. A strawman.

My thoughts, exactly. Thank you.

**oiaohm** · 21 July 2018, 04:03 AM

Originally posted by Weasel View Post

You realize we're not talking about divisions (and multiplications?) by constants, are you? Why don't you show a multiplication that works with any multiplicand (register, not constant) huh?

These are divider studies

Just a moment...

https://www.researchgate.net/profile/Ar_Reddy2/publication/291832548_Design_and_Analysis_of_Integer_Divider_Using_Non_Restoring_Division_Algorithm/links/56a72e5508ae860e025545a6/Design-and-Analysis-of-Integer-Divider-Using-Non-Restoring-Division-Algorithm.pdf

There is a study some done into a in 2016.

https://pdfs.semanticscholar.org/a295/9c1e447521e16da028ce2503d4eb2572051f.pdf

Here is another study in 1999.

This form of divider first turned in 1991. I was a little out on the 30 years bit. Its 27 years. Asynchronous without clocks avoids a set of problems the multiplications and dividers suffer from. Also lets it run as fast as the gates run. Asynchronous Multiplier is older.

Originally posted by Weasel View Post

"Modern" RISC means "load/store" architecture. You cannot in good conscience do something while also loading or storing to memory and still claim it's RISC.

That is where you are in trouble. The 1975 IBM RISC the first. Was doing load and stores to memory while doing operations. The 1975 IBM RISC had RISC fusion. load/store architecture does not say you cannot be loading and storing to memory and loading and storing from operation centre in the same cycle. Just you have load cycle and a store cycle and its load store architecture.

Fusion in RISC was forgotten by the 1980 implementations of RISC.

[QUOTE=Weasel;n1037034]Fusion is CISC by definition. The fact that RISCs do fusion means that RISC CPUs are CISC internally for fuck's sake. Hence proving my point that RISC CPUs are not RISC anymore but hybrids, like ARM. Because CISC "won" and every RISC CPU worth its salt is CISC-like internally.
This is not right. Because Fusion did not come from CISC. Fusion is the ability to run multi instructions at once.

Originally posted by Weasel View Post

It's not CISC that's RISC-like internally, I just proved it. It's RISC that's CISC, because they "fuse" simple RISC-like instructions into complex uops that do way more than 1 stuff at once.

Fusion is not in fact CISC or RISC unique. Also IBM RISC fusion was not converting simple RISC like instructions into complex uops. Risc-v fusion also does not make complex uops. Risc fusion just limited out of order execution of simple uops With the out of order being execute multi uops at the same time if they will not conflict in hardware.

What risc-v rocket with fusion is doing is like what the P5 Pentium does and same as different IBM 1975 RISC experiments did. Instead of just loading the current instruction load the next few instruction along as well and see if you can in fact execute two or more at the same time because they will not be conflicting in hardware operation.

P5 can only do 2 of it CISC instructions max in a single cycle. Fusion is not uop problem. Fusion defines how you process instructions. Fusion is the way you instruction set is processed so multi CISC/RISC instructions being processed instead of single ones. Load/store architecture does not forbid multi uops executing at the same time in the same clock cycle. Fixed clocked RISC without Fusion will see only 1 uop at a time.

Really Fusion is not feature that means CISC. Fusion comes from 1975 IBM RISC experiments. Fusion first appears in out of order IBM RISC designs from 1975.

Proper complex uops as a uop is doing multi complex instructions for the uop code. Risc fusion is multi simple single feature uop being executed in the same clock cycle using the same load/store events. Fusion is a form of out of order execution. Fusion is not a CISC feature. Fusion is a feature most risc implementations have been horribly missing.

Thing to remember lot of CPU design is not new. Items like fusion that was done in 1975 got forgot until it was redone with the P5 hardware.

**Weasel** · 21 July 2018, 10:46 AM

Originally posted by oiaohm View Post

These are divider studies

Just a moment...

https://www.researchgate.net/profile/Ar_Reddy2/publication/291832548_Design_and_Analysis_of_Integer_Divider_Using_Non_Restoring_Division_Algorithm/links/56a72e5508ae860e025545a6/Design-and-Analysis-of-Integer-Divider-Using-Non-Restoring-Division-Algorithm.pdf

There is a study some done into a in 2016.

https://pdfs.semanticscholar.org/a295/9c1e447521e16da028ce2503d4eb2572051f.pdf

Here is another study in 1999.

This form of divider first turned in 1991. I was a little out on the 30 years bit. Its 27 years. Asynchronous without clocks avoids a set of problems the multiplications and dividers suffer from. Also lets it run as fast as the gates run. Asynchronous Multiplier is older.

I have no idea how any of that has anything to do (or proves?) that its done in 1 clock cycle.

Not only does it use async division (clock cycles are synchronous), but their fastest time is 3.7 nanoseconds. On a 4Ghz CPU that's 15 clock cycles. You realize this, right? And that's for their fastest. Sure they have used a comparatively high process node for their experiment, but lack of experiment is not proof.

So it looks like you've just proven yourself wrong even more.

Originally posted by oiaohm View Post

That is where you are in trouble. The 1975 IBM RISC the first. Was doing load and stores to memory while doing operations. The 1975 IBM RISC had RISC fusion. load/store architecture does not say you cannot be loading and storing to memory and loading and storing from operation centre in the same cycle. Just you have load cycle and a store cycle and its load store architecture.

Wrong. I never said you can't do it in the same cycle. I said you can't do it in the same instruction/operation. This is literally the definition of load/store architecture. Stop wasting time with nonsense.

Quote just in case:

Originally posted by Wikipedia

For instance, in a load/store approach both operands and destination for an ADD operation must be in registers. This differs from a register memory architecture (for example, a CISC instruction set architecture such as x86) in which one of the operands for the ADD operation may be in memory, while the other is in a register.

Originally posted by oiaohm View Post

This is not right. Because Fusion did not come from CISC. Fusion is the ability to run multi instructions at once.

NO. That has nothing to do with fusion: that's simply processing stuff in parallel (out of order).

Fusion means that you fuse multiple instructions or uops into one uop. You fuse multiple instructions into one "internal instruction" for the CPU.

Read the quote from Wikipedia above: load/store architecture must have ALL operands as registers. ALL. (except for dedicated load/store instructions)

Fuse makes just one uop that has a memory operand, therefore proving that the CPU internals are CISC by this very definition. You cannot go around this, sorry. An uop doesn't have to run in 1 clock cycle: what matters is that it's only one instruction (operation). Period.

I don't care if IBM RISC processor did it in 1970. It only proves it had to do it to compete with CISC. So in fact, modern RISC CPUs are CISC internally.

**sjekkel** · 21 July 2018, 02:38 PM

I looked through both of those papers and the one from 1999 mentions 3.7ns to finish. That's 270MHz.

The second article from 2016 mentions delays for division of up to 16 cycles, 16 * 0.27GHz = ~4.3GHz which is more or less the speed of modern cpus.

I tried to find how many cycles a division takes on intel.com but couldn't find anything.

Did I miss something or is one clock cycle for division only possible up to 270MHz?

**coder** · 22 July 2018, 01:38 AM

Originally posted by oiaohm View Post

These are divider studies

Just a moment...

https://www.researchgate.net/profile/Ar_Reddy2/publication/291832548_Design_and_Analysis_of_Integer_Divider_Using_Non_Restoring_Division_Algorithm/links/56a72e5508ae860e025545a6/Design-and-Analysis-of-Integer-Divider-Using-Non-Restoring-Division-Algorithm.pdf

There is a study some done into a in 2016.

https://pdfs.semanticscholar.org/a295/9c1e447521e16da028ce2503d4eb2572051f.pdf

Here is another study in 1999.

This is just throwing crap at the wall, and let me tell you that none of it is sticking.

The reason you cannot find a production CPU which implemented single-cycle integer division is because none have. You make claims you can't back up, and then try to wear us down by talking in circles. I'm done with this farce.

**coder** · 22 July 2018, 10:27 PM

Originally posted by sjekkel View Post

I looked through both of those papers and the one from 1999 mentions 3.7ns to finish. That's 270MHz.

...using an asynchronous architecture. I remember, around that time, reading that it could be the next big trend in silicon design, but AFAIK, it hasn't caught on. Possibly, modern CPUs adopted techniques that narrowed the gap, and perhaps asynch designs are more difficult to scale.

Originally posted by sjekkel View Post

The second article from 2016 mentions delays for division of up to 16 cycles, 16 * 0.27GHz = ~4.3GHz which is more or less the speed of modern cpus.

I tried to find how many cycles a division takes on intel.com but couldn't find anything.

Intel used to publish this information, but I'm not sure if they still do. You could refer to this reference Weasel posted:

http://www.agner.org/optimize/instruction_tables.pdf

It's a single doc with all info from decades worth of x86 CPUs, so be sure you're looking at the correct generation. Note that Kaby lake and Coffee Lake have the same microarchitecture as Skylake.

Anyway, according to the above doc, that 16-cycle figure is roughly in line with modern Intel and AMD CPUs.

Originally posted by sjekkel View Post

Did I miss something or is one clock cycle for division only possible up to 270MHz?

Well, that's clearly indexed to 1990's era process technology. Note that some advantage is derived from the asynchronous implementation. Also, it appears to be an entire chip devoted to implementing a single divider. So, that's a lot of variables to consider, when trying to map to modern CPUs. That's why I've been explicit in focusing on oiaohm claims that

MIPS, Risc-v and many other RISC chips have been using a very old circuit design that does divide for 64 bit and 128 bit number perform divide and multiply in 1 clock cycle.

He needs to provide some kind of evidence that any production chip ever accomplished this feat, or else concede the point.

I don't care if it's possible with impractically low clock speeds and large silicon footprint. I always claimed it wasn't practical - not that it was impossible.

**oiaohm** · 28 July 2018, 12:22 PM

Originally posted by Weasel View Post

Not only does it use async division (clock cycles are synchronous), but their fastest time is 3.7 nanoseconds. On a 4Ghz CPU that's 15 clock cycles. You realize this, right? And that's for their fastest. Sure they have used a comparatively high process node for their experiment, but lack of experiment is not proof.

That is 3.7 nanoseconds on a fpga and a old fpga at that. Getting numbers on production high performance silicon is hard most is NDA.

So it looks like you've just proven yourself wrong even more.

Originally posted by Weasel View Post

Wrong. I never said you can't do it in the same cycle. I said you can't do it in the same instruction/operation. This is literally the definition of load/store architecture. Stop wasting time with nonsense.

Quote just in case:
For instance, in a load/store approach both operands and destination for an ADD operation must be in registers. This differs from a register memory architecture (for example, a CISC instruction set architecture such as x86) in which one of the operands for the ADD operation may be in memory, while the other is in a register.

That is the catch. Ibm 1975 played withRisc with Write Through registers.

Originally posted by Weasel View Post

Read the quote from Wikipedia above: load/store architecture must have ALL operands as registers. ALL. (except for dedicated load/store instructions)

Write though registers with RISC this is still true. Its a out of order optimisations.

Originally posted by Weasel View Post

Fusion means that you fuse multiple instructions or uops into one uop. You fuse multiple instructions into one "internal instruction" for the CPU.

Fusion does not mean 1 uop. This is not what risc-v is upto.

Cortex-A72 is the latest iteration of ARM's largest CPU core, although it's ... per clock cycle and issue up to eight micro-ops

Notice something here. CPU don't have to run 1 uops per cycle. You also find intel breaks CISC instructions down into multi micro-ops and if there is not a conflit between 2 x86 instructions microops they can be executed in 1 clock cycle.

Originally posted by Weasel View Post

Fuse makes just one uop that has a memory operand, therefore proving that the CPU internals are CISC by this very definition. You cannot go around this, sorry. An uop doesn't have to run in 1 clock cycle: what matters is that it's only one instruction (operation). Period.

Write though registers see RISC get very interesting.

Clock high
mmu storing into registers processing reading out of registers.
Clocklow,
mmu reading out of registers processing writing into registers.

This is still load/store. Everything is still in the registers just out of order applied. With write though registers in risc that IBM and Berkley early out of order had things are more then interesting.(the ring register thing in Berkeley RISC designs come from this that you don't see in MIPS.)

Please note these are still 2 or more uops as.

As soon as you go out of order with RiSC things get different.

Doing out of order securely as intel, arm.... is finding out is quite complex. CISC instructions make live harder in fact.

Without out of order you see
load into reg1 from memory
load into reg2 from memory
add reg1 reg2 into reg3
store reg3 to memory.
This being quite long in risc is mostly the fact RISC most people have seen are in order versions. Out of order with write through registers sees the 2 loads the add and the store in that fairly much happening in 1 cycle. That is of course if the pipeline is smart enough. This is not make 1 single uop but work out what instructions can in fact be executed at exactly the same time by the hardware and do that. Also work out if areas need be write through instead of read/write.

Really the big thing is this form of compaction with write though registers for out of order risk does not reduce it down to 1 instruction. Multi instructions able to execute at once solution.

Notice something if you don't taint your RISC instruction set with CISC instructions when you have memory access problems you only have to deal with load and store circuits for the mmu. CISC how many instructions do you have that are performing memory operations and how many could be wrong.

Risc gets CISC like performance by going the other way basically. Instead of attempting to make single uops(CISC) to perform magic work out what uops you can in fact execute all at once. Yes write though registers is a heck of a optimisation. The result of RISC with out of order alterations is it start having performance like a CISC but its purely not. CISC would have bi-passed having the stuff in registers. RISC you will have registers be write though.

A good out of order RISC still has MMU->registers->operation->registers->MMU just this happening on compacted time frame being 1 cycle instead of like 4 for add example above. Of course when people see this stuff taking the same time as 1 instruction you have some idiots think this is CISC when it not is 1975 IBM risc optimisation experiment that was repeated in 1980 by Berkeley and appears again in the boom RISC-v out of order processor and fusion optimisations for rocket risc-v.

Please note write though registers are not inside MIPS or ARM.

**Weasel** · 29 July 2018, 08:33 AM

Originally posted by oiaohm View Post

That is 3.7 nanoseconds on a fpga and a old fpga at that. Getting numbers on production high performance silicon is hard most is NDA.

So it looks like you've just proven yourself wrong even more.

Such a clown. Lack of proof is not proof, so no, you've proven yourself even more wrong.

You can come up with any conspiracy bullshit you want and then claim "God exists, we just don't have the proof because of conspiracies" as some sort of proof. That doesn't make it any more than just a claim. It has zero proof.

Originally posted by oiaohm View Post

Fusion does not mean 1 uop. This is not what risc-v is upto.

Cortex-A72 is the latest iteration of ARM's largest CPU core, although it's ... per clock cycle and issue up to eight micro-ops

Notice something here. CPU don't have to run 1 uops per cycle. You also find intel breaks CISC instructions down into multi micro-ops and if there is not a conflit between 2 x86 instructions microops they can be executed in 1 clock cycle.

What the fuck has that to do with fusion? That's how many uops you can issue in a clock cycle, absolutely nothing to do with fusion. It's a completely different thing, orthogonal and exists completely independently of fusion.

And yes you can execute uops in parallel if they don't depend on each other. That's also a completely orthogonal thing. Again, absolutely nothing to do with fusion.

Fusion is when you "fuse" (surprise!) two or more uops into less, typically just one. e.g. you fuse two uops (load from memory, perform some calculation) into just 1 operation that the CPU understands natively (perform that calculation directly on the memory input). This is obviously much more efficient, and it's CISC. So RISC CPUs are CISC internally, because CISC is just better.

If uop issue rate (your first buzzwords), parallel execution of independent inputs (your second buzzwords), and fusion were axis of space, they'd form a complete 3D axis, none of them have anything to do with each other and work independently.

Yes you can execute two fused uops in parallel. This doesn't make them fused. Let's use assembly instructions for illustration. For example, consider this scalar code:

Code:

add rax, rbx
add rcx, rdx

These are both 1 uop each. Notice how the second one does not depend on the first's output (rax), so they can and will be executed in parallel. They're still two instructions/uops though.

Here's a fused example:

Code:

paddq xmm0, xmm1

Which is just one instruction/uop that does TWO additions in parallel, since xmm0 and xmm1 contain two 64-bit integers each. So it's clearly better. Fusion is the same, but it's more than just doing stuff in parallel -- it does operations that the CPU can understand natively, not just for vectors. Example:

Code:

add rax, [rbx]

is a single uop, and gets fused into a single uop even on RISC (where you have to load it separately from [rbx]).

You really have no idea what you're talking about.

Announcement

ARM Launches "Facts" Campaign Against RISC-V

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment