Announcement

**WorBlux** · 18 October 2020, 12:54 PM

Originally posted by lkcl View Post

well, what i'm hoping is that the significant work being done on llvm for RISC-V, ARM SVE/2, and other companies with variable-length VL, will hit mainline well in advance, such that all we need to do is a minimalist amount of porting work to add SV.

Ya it's such a huge advance in API, I'm surprised it took 4 iterations of mainstream SIMD before chip designers said "ya, that's a problem, I admit"

Originally posted by lkcl View Post

the FU-Regs matrix covers the information about which registers a given FU needs to read or write, whilst the FU-FU matrix preserves the *result* ordering dependency. interestingly, FU-FU preserves a DAG (Directed Acyclic Graph)

I think DAG translates to Dataflow diagram, for those who are less mathematically astute.

Originally posted by lkcl View Post

it's ultimately really quite compact and beautifully elegant, very little actual silicon, just hell to explain. even Thornton, in "Design of a Computer", specifically says this (around p.126 i think). i've found it usually takes several weeks for people to grasp the basics, and about 3 months to truly get it

....

this got very painful to explain on comp.arch that it was important to computer science that there be different terminology to distinguish "hazard-aware Function Unit" from "pipeline". we had quite a lot of idiots absolutely categorically insist that "FU equals pipeline, shuttup you moron".

eeeeventually after a week of quite painful discussion the term "Phase-aware Function Unit" occurred to me, as a way to distinguish this from "when people start treating pipelines as synonymous with FUs".

Phase-aware to mean "the FU is aware at all times and carries with it the responsibility for tracking and notifying of its operands and its result".

would you believe it, there is no modern industry-standard term for "Phase-aware Function Unit"?

I can believe it. After the RISC revolution it seems everyone ran off in the same direction, with most going down the CPU lane and a few going VLIW. A lot of iterative improvements, like 4% better branch predictors or a iterative improvement on LRU cache eviction. But fundamental design choices seemed to be stamped in steel (with million-dollar+ lithography screens anyways). If we can credit openrisc/Risc-v with anything it's reviving interest and tool-chains to the open commons.

Originally posted by lkcl View Post

tricky, isn't it? now you see why it took 18 months to design SV (and implement the simulator).

Ya certainly. I'm also seeing why prior designs pushed SIMD and let the compiler deal with all the edge cases. Especially if you're on a tight timeline trying to hit that next process node first. Not being on a tight schedule or tied to a particular node gives you the time and creative space to get it right. Also, little wonder why you, Mitch, and Ivan Goddard all end up in the same places.

**lkcl** · 18 October 2020, 01:30 PM

Originally posted by WorBlux View Post

Ya it's such a huge advance in API, I'm surprised it took 4 iterations of mainstream SIMD before chip designers said "ya, that's a problem, I admit"

all they had to do was look at the Cray architecture! Cray has done vectors large enough that the regfile needed to be held in external ultra-fast SRAM.

btw you may be intrigued to know that a number of people working for Northrup Grumman, and others who used Cray supercomputers, were significant contributors to RVV.

I think DAG translates to Dataflow diagram, for those who are less mathematically astute.

(and like blockchains which are all DAGs)

I can believe it. After the RISC revolution it seems everyone ran off in the same direction, with most going down the CPU lane and a few going VLIW.

realistically it has to be said that the only highly commercially successful VLIW processor is the TI DSP series. these are VLIW-double-interleaved, so really good for audio stereo processing. they're also typically programmed in assembler.

SIMD just... gaah. i think early in this thread i posted the "SIMD considered harmful" article, but that really doesn't sink in, as it's an academic exercise. where it really hits home is when you count the number of VSX handcoded assembly instructions in a recent glibc6 patch to POWER9.

250.

the RVV equivalent is *14*

A lot of iterative improvements, like 4% better branch predictors or a iterative improvement on LRU cache eviction. But fundamental design choices seemed to be stamped in steel (with million-dollar+ lithography screens anyways). If we can credit openrisc/Risc-v with anything it's reviving interest and tool-chains to the open commons.

well we can, but only because they got there before IBM had finished the preparatory work for opening PowerISA. Hugh said that he was hilariously contacted by tons of people saying, "duuude, OPF should totally do what RISCV is doing" and he had to tell them that a small team had been preparing exactly that, quietly, for 10 years

Ya certainly. I'm also seeing why prior designs pushed SIMD and let the compiler deal with all the edge cases.

indeed. it's veeery seductive. slap another opcode on, drop in a new ALU, and as far as the architecture is concerned, the SIMD opcode is literally no different from any other scalar operation. 2xFP32 is one opcode, 1xFP64 is another, neither the ISA nor architecture knows or cares.

dead simple, right?

wark-wark

Especially if you're on a tight timeline trying to hit that next process node first. Not being on a tight schedule or tied to a particular node gives you the time and creative space to get it right. Also, little wonder why you, Mitch, and Ivan Goddard all end up in the same places.

Mitch does all his designs at the gate level. he studied the 6600 very early on, recognised its genious, and based the Motorola 68000 on it. his expertise developed from there.

Ivan's team took a radically different approach in the Mill, where you actually do static scheduling by the compiler onto a "conveyor belt". all types of operations are known (fixed) completion times and so the compiler has all the information it needs. this does mean that the compiler *specifically* has to target a particular architecture. no two different Mill archs are binary compatible.

but... the Mill ISA? woow. there is no FP32 ADD or INT16 ADD, there is just... ADD. the size and type come from the LOAD operation, tag the register from that point onwards, and are carried right the way to STORE. ultra, ultra efficient and beautifully simple ISA, hardly any opcodes at all. terminology: polymorphic widths and operations. i wish i could use that but the problem is it is such a deviation from PowerISA it will be hard to get it in.

me i just don't waste time reinventing things that are already beautiful and elegant, or tolerate things that are not. like SIMD

**WorBlux** · 18 October 2020, 02:18 PM

Originally posted by lkcl View Post

SIMD just... gaah. i think early in this thread i posted the "SIMD considered harmful" article, but that really doesn't sink in, as it's an academic exercise. where it really hits home is when you count the number of VSX handcoded assembly instructions in a recent glibc6 patch to POWER9.

250.

the RVV equivalent is *14*

Which kind of make you wonder how much performance is being left on the table by clogging your caches with code rather than data...

Originally posted by lkcl View Post

Mitch does all his designs at the gate level. he studied the 6600 very early on, recognised its genious, and based the Motorola 68000 on it. his expertise developed from there.

Ivan's team took a radically different approach in the Mill, where you actually do static scheduling by the compiler onto a "conveyor belt". all types of operations are known (fixed) completion times and so the compiler has all the information it needs. this does mean that the compiler *specifically* has to target a particular architecture. no two different Mill archs are binary compatible.

but... the Mill ISA? woow. there is no FP32 ADD or INT16 ADD, there is just... ADD. the size and type come from the LOAD operation, tag the register from that point onwards, and are carried right the way to STORE. ultra, ultra efficient and beautifully simple ISA, hardly any opcodes at all. terminology: polymorphic widths and operations. i wish i could use that but the problem is it is such a deviation from PowerISA it will be hard to get it in.

I'm fairly sure on mill that the load specifies width, and the instruction provides type (and that pointer is a hardware type distinct from unsigned integer) http://millcomputing.com/wiki/Instruction_Set

Of course they are changing and refining all the time. But it is a very CISC instruction set . Not only is each member binary incompatible, every FU slot has a different binary encoding (and set of supported ops), which is one of the ways they get away with such a wide issue.

This combined with the conceptual conveyor belt means nobody except the crypto guys are going to be writing anything in raw assembler. I think their plan for hardware initialization is to include a forth interpreter on the ROM. Not unheard of, but a very different approach from current mainstream.

Originally posted by lkcl View Post

me i just don't waste time reinventing things that are already beautiful and elegant, or tolerate things that are not. like SIMD

One advantage of working for yourself, I suppose.

**lkcl** · 20 October 2020, 09:01 PM

Originally posted by WorBlux View Post

Which kind of make you wonder how much performance is being left on the table by clogging your caches with code rather than data...

indeed. with a hybrid design we are in a tricky situation of context switching between two different workloads, CPU and GPU. the point of SV is to "conpactify" the vector opcodes to minimise cache thrashing.

I'm fairly sure on mill that the load specifies width, and the instruction provides type (and that pointer is a hardware type distinct from unsigned integer) http://millcomputing.com/wiki/Instruction_Set

i'm reasonably sure i saw a talk in which Ivan said that the ISA is small and compact. which would tend to suggest that those are pseudo ops. don't know.

One advantage of working for yourself, I suppose.

ah we do have to be careful that things are not so radical that adoption does not happen. however in speaking with Mendy from the OpenPOWER Foundation she very kindly clarified how the new (v3.1B) Platforms work.

the misunderstanding in the open source community is that the platforms are defined by what is in IBM POWERnn processors. the code for glibc6 for example specifically has "#ifdef POWER9...."

Mendy explained to me that the "Embedded Floating Point" platform is a *minimum* requirement not a *hard and only* requirement. therefore we can classify our processor as "Embedded FP", thus avoid adding VSX, yet then go on to add a RADIX MMU which is not specifically required for the Embedded Platform.

the problem comes in making sure that the glibc6 and GNU/Linux Distros are aware that there are more linux-capable processors out there than IBM POWER9 and POWER10.

**WorBlux** · 22 October 2020, 02:41 AM

Originally posted by lkcl View Post

i'm reasonably sure i saw a talk in which Ivan said that the ISA is small and compact. which would tend to suggest that those are pseudo ops. don't know.

Yes, the encoding is very compact, the ISA itself though isn't so much so.

Gold Core - Mill Computing Wiki

http://millcomputing.com/wiki/Cores/Gold

Gold Core Operation Encoding - Mill Computing Wiki

http://millcomputing.com/wiki/Cores/Gold/Encoding

23 bits for the exu slot that can handle floating point ops, 19 for the slots that can't. The need to address at most 2 sources, 5 bits each. Leaving 13/9 bits to encode the instruction, and at an optimal packing that's still quite a few instructions, with immediate cutting down the space some. And of course the whole thing is still in development and with their specification process, the final number of ops is still to be determined. And the wiki is way behind current development.

Originally posted by lkcl View Post

ah we do have to be careful that things are not so radical that adoption does not happen. however in speaking with Mendy from the OpenPOWER Foundation she very kindly clarified how the new (v3.1B) Platforms work.

the misunderstanding in the open source community is that the platforms are defined by what is in IBM POWERnn processors. the code for glibc6 for example specifically has "#ifdef POWER9...."

Mendy explained to me that the "Embedded Floating Point" platform is a *minimum* requirement not a *hard and only* requirement. therefore we can classify our processor as "Embedded FP", thus avoid adding VSX, yet then go on to add a RADIX MMU which is not specifically required for the Embedded Platform.

the problem comes in making sure that the glibc6 and GNU/Linux Distros are aware that there are more linux-capable processors out there than IBM POWER9 and POWER10.

Well that's all well and good. and it wouldn't be the first processor that required a little finer slice of #ifdef.

This sent me on a rabbit-whole of reading about the radix-mmu's and hypervisor page tables. , and trying to figure out if you had nailed down the endian mode yet. Looks like the POWER transition really tossed a lot of things in the air. Hopefully your final design won't take as long as the mill.

BTW. I did send you a pm.

**lkcl** · 22 October 2020, 08:18 AM

Originally posted by WorBlux View Post

And the wiki is way behind current development.

that's a great sign! it means they have engineers who focus on what they're doing and getting that right.

This sent me on a rabbit-whole of reading about the radix-mmu's and hypervisor page tables.

i strongly recommend looking at microwatt's source code, here, in this case mmu.vhdl. it's pretty readable.

, and trying to figure out if you had nailed down the endian mode yet.

ah turns out endian (both) is ridiculously simple to implement. Paul Mackerras added dual endianness in about 80 lines of VHDL because the infrastructure was already there. LOADs in OpenPOWER ISA are byte-reversible so it was a trivial matter of hooking in there. he also added 32 bit in about 80 lines as well!

BTW. I did send you a pm.

thanks the reminder, got it.

**WorBlux** · 25 October 2020, 04:51 PM

Originally posted by lkcl View Post

that's a great sign! it means they have engineers who focus on what they're doing and getting that right.

i strongly recommend looking at microwatt's source code, here, in this case mmu.vhdl. it's pretty readable.

ah turns out endian (both) is ridiculously simple to implement. Paul Mackerras added dual endianness in about 80 lines of VHDL because the infrastructure was already there. LOADs in OpenPOWER ISA are byte-reversible so it was a trivial matter of hooking in there. he also added 32 bit in about 80 lines as well!

thanks the reminder, got it.

The micro-watt mmu is surprisingly readable. Still only following about a 1/3 of it (it was a help to look up what a radix tree was, i'm not particularly a maths guy). The note about the (unimplemented) hypervisor capability just being *this* times 2 was interesting. As I said i've got a lot of reading and learning to do before I can make full sense of lower-level microarchetectural details. I can tell it's interacting with the TLB and doing a lot of shifts to pull a physical address out of a process layer table (that's actually laid out as a multi-level radix tree in memory).

And the endianess win is good news, if only that they were all that simple (though you'd risk getting bored I suppose)

**lkcl** · 26 October 2020, 05:17 AM

[QUOTE=WorBlux;n1214831]

The micro-watt mmu is surprisingly readable. Still only following about a 1/3 of it (it was a help to look up what a radix tree was, i'm not particularly a maths guy). The note about the (unimplemented) hypervisor capability just being *this* times 2 was interesting. As I said i've got a lot of reading and learning to do before I can make full sense of lower-level microarchetectural details. I can tell it's interacting with the TLB and doing a lot of shifts to pull a physical address out of a process layer table (that's actually laid out as a multi-level radix tree in memory).
[quote]

microwatt's code is... stunning. it's actually an exercise in mastery of programming and communication. its developers have been world-leading experts and contributors to powerpc for 25 years: Paul Mackerras amusingly freaked out an Apple store employee around 1995 by pressing the well-known BSD boot key combinations when the first cyan iMacs came out, to put it into verbose console debug logging mode, and finding that it worked

you may find this useful:

https://github.com/power-gem5/gem5/b...lk_example.txt

there's not many up-to-date OpenPOWER ISA simulators out there [dolphin, pearpc and others remain 32bit and do not boot modern 32-bit linux distros]) that can boot a modern ppc64le kernel (not "and readable source code as well" by which i mean qemu's ppc64 source is excluded from that list due to its focus on JIT compilation), whereas the (experimental) POWER port of the cycle-accurate gem5 simulator by contrast is pretty readable.

Announcement

Libre-SOC Still Persevering To Be A Hybrid CPU/GPU That's 100% Open-Source

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment