Announcement

Collapse
No announcement yet.

Libre-SOC Still Persevering To Be A Hybrid CPU/GPU That's 100% Open-Source

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • lkcl
    replied
    [QUOTE=WorBlux;n1214831]

    The micro-watt mmu is surprisingly readable. Still only following about a 1/3 of it (it was a help to look up what a radix tree was, i'm not particularly a maths guy). The note about the (unimplemented) hypervisor capability just being *this* times 2 was interesting. As I said i've got a lot of reading and learning to do before I can make full sense of lower-level microarchetectural details. I can tell it's interacting with the TLB and doing a lot of shifts to pull a physical address out of a process layer table (that's actually laid out as a multi-level radix tree in memory).
    [quote]

    microwatt's code is... stunning. it's actually an exercise in mastery of programming and communication. its developers have been world-leading experts and contributors to powerpc for 25 years: Paul Mackerras amusingly freaked out an Apple store employee around 1995 by pressing the well-known BSD boot key combinations when the first cyan iMacs came out, to put it into verbose console debug logging mode, and finding that it worked

    you may find this useful:

    https://github.com/power-gem5/gem5/b...lk_example.txt

    there's not many up-to-date OpenPOWER ISA simulators out there [dolphin, pearpc and others remain 32bit and do not boot modern 32-bit linux distros]) that can boot a modern ppc64le kernel (not "and readable source code as well" by which i mean qemu's ppc64 source is excluded from that list due to its focus on JIT compilation), whereas the (experimental) POWER port of the cycle-accurate gem5 simulator by contrast is pretty readable.

    Leave a comment:


  • WorBlux
    replied
    Originally posted by lkcl View Post
    that's a great sign! it means they have engineers who focus on what they're doing and getting that right.


    i strongly recommend looking at microwatt's source code, here, in this case mmu.vhdl. it's pretty readable.



    ah turns out endian (both) is ridiculously simple to implement. Paul Mackerras added dual endianness in about 80 lines of VHDL because the infrastructure was already there. LOADs in OpenPOWER ISA are byte-reversible so it was a trivial matter of hooking in there. he also added 32 bit in about 80 lines as well!


    thanks the reminder, got it.
    The micro-watt mmu is surprisingly readable. Still only following about a 1/3 of it (it was a help to look up what a radix tree was, i'm not particularly a maths guy). The note about the (unimplemented) hypervisor capability just being *this* times 2 was interesting. As I said i've got a lot of reading and learning to do before I can make full sense of lower-level microarchetectural details. I can tell it's interacting with the TLB and doing a lot of shifts to pull a physical address out of a process layer table (that's actually laid out as a multi-level radix tree in memory).

    And the endianess win is good news, if only that they were all that simple (though you'd risk getting bored I suppose)

    Leave a comment:


  • lkcl
    replied
    Originally posted by WorBlux View Post
    And the wiki is way behind current development.
    that's a great sign! it means they have engineers who focus on what they're doing and getting that right.


    This sent me on a rabbit-whole of reading about the radix-mmu's and hypervisor page tables.
    i strongly recommend looking at microwatt's source code, here, in this case mmu.vhdl. it's pretty readable.


    , and trying to figure out if you had nailed down the endian mode yet.
    ah turns out endian (both) is ridiculously simple to implement. Paul Mackerras added dual endianness in about 80 lines of VHDL because the infrastructure was already there. LOADs in OpenPOWER ISA are byte-reversible so it was a trivial matter of hooking in there. he also added 32 bit in about 80 lines as well!

    BTW. I did send you a pm.
    thanks the reminder, got it.

    Leave a comment:


  • WorBlux
    replied
    Originally posted by lkcl View Post
    i'm reasonably sure i saw a talk in which Ivan said that the ISA is small and compact. which would tend to suggest that those are pseudo ops. don't know.
    Yes, the encoding is very compact, the ISA itself though isn't so much so.

    http://millcomputing.com/wiki/Cores/Gold
    http://millcomputing.com/wiki/Cores/Gold/Encoding

    23 bits for the exu slot that can handle floating point ops, 19 for the slots that can't. The need to address at most 2 sources, 5 bits each. Leaving 13/9 bits to encode the instruction, and at an optimal packing that's still quite a few instructions, with immediate cutting down the space some. And of course the whole thing is still in development and with their specification process, the final number of ops is still to be determined. And the wiki is way behind current development.


    Originally posted by lkcl View Post
    ah we do have to be careful that things are not so radical that adoption does not happen. however in speaking with Mendy from the OpenPOWER Foundation she very kindly clarified how the new (v3.1B) Platforms work.

    the misunderstanding in the open source community is that the platforms are defined by what is in IBM POWERnn processors. the code for glibc6 for example specifically has "#ifdef POWER9...."

    Mendy explained to me that the "Embedded Floating Point" platform is a *minimum* requirement not a *hard and only* requirement. therefore we can classify our processor as "Embedded FP", thus avoid adding VSX, yet then go on to add a RADIX MMU which is not specifically required for the Embedded Platform.

    the problem comes in making sure that the glibc6 and GNU/Linux Distros are aware that there are more linux-capable processors out there than IBM POWER9 and POWER10.
    Well that's all well and good. and it wouldn't be the first processor that required a little finer slice of #ifdef.

    This sent me on a rabbit-whole of reading about the radix-mmu's and hypervisor page tables. , and trying to figure out if you had nailed down the endian mode yet. Looks like the POWER transition really tossed a lot of things in the air. Hopefully your final design won't take as long as the mill.

    BTW. I did send you a pm.

    Leave a comment:


  • lkcl
    replied
    Originally posted by WorBlux View Post
    Which kind of make you wonder how much performance is being left on the table by clogging your caches with code rather than data...
    indeed. with a hybrid design we are in a tricky situation of context switching between two different workloads, CPU and GPU. the point of SV is to "conpactify" the vector opcodes to minimise cache thrashing.

    I'm fairly sure on mill that the load specifies width, and the instruction provides type (and that pointer is a hardware type distinct from unsigned integer) http://millcomputing.com/wiki/Instruction_Set
    i'm reasonably sure i saw a talk in which Ivan said that the ISA is small and compact. which would tend to suggest that those are pseudo ops. don't know.

    One advantage of working for yourself, I suppose.
    ah we do have to be careful that things are not so radical that adoption does not happen. however in speaking with Mendy from the OpenPOWER Foundation she very kindly clarified how the new (v3.1B) Platforms work.

    the misunderstanding in the open source community is that the platforms are defined by what is in IBM POWERnn processors. the code for glibc6 for example specifically has "#ifdef POWER9...."

    Mendy explained to me that the "Embedded Floating Point" platform is a *minimum* requirement not a *hard and only* requirement. therefore we can classify our processor as "Embedded FP", thus avoid adding VSX, yet then go on to add a RADIX MMU which is not specifically required for the Embedded Platform.

    the problem comes in making sure that the glibc6 and GNU/Linux Distros are aware that there are more linux-capable processors out there than IBM POWER9 and POWER10.



    Leave a comment:


  • WorBlux
    replied
    Originally posted by lkcl View Post
    SIMD just... gaah. i think early in this thread i posted the "SIMD considered harmful" article, but that really doesn't sink in, as it's an academic exercise. where it really hits home is when you count the number of VSX handcoded assembly instructions in a recent glibc6 patch to POWER9.

    250.

    the RVV equivalent is *14*
    Which kind of make you wonder how much performance is being left on the table by clogging your caches with code rather than data...

    Originally posted by lkcl View Post
    Mitch does all his designs at the gate level. he studied the 6600 very early on, recognised its genious, and based the Motorola 68000 on it. his expertise developed from there.

    Ivan's team took a radically different approach in the Mill, where you actually do static scheduling by the compiler onto a "conveyor belt". all types of operations are known (fixed) completion times and so the compiler has all the information it needs. this does mean that the compiler *specifically* has to target a particular architecture. no two different Mill archs are binary compatible.

    but... the Mill ISA? woow. there is no FP32 ADD or INT16 ADD, there is just... ADD. the size and type come from the LOAD operation, tag the register from that point onwards, and are carried right the way to STORE. ultra, ultra efficient and beautifully simple ISA, hardly any opcodes at all. terminology: polymorphic widths and operations. i wish i could use that but the problem is it is such a deviation from PowerISA it will be hard to get it in.
    I'm fairly sure on mill that the load specifies width, and the instruction provides type (and that pointer is a hardware type distinct from unsigned integer) http://millcomputing.com/wiki/Instruction_Set

    Of course they are changing and refining all the time. But it is a very CISC instruction set . Not only is each member binary incompatible, every FU slot has a different binary encoding (and set of supported ops), which is one of the ways they get away with such a wide issue.

    This combined with the conceptual conveyor belt means nobody except the crypto guys are going to be writing anything in raw assembler. I think their plan for hardware initialization is to include a forth interpreter on the ROM. Not unheard of, but a very different approach from current mainstream.

    Originally posted by lkcl View Post
    me i just don't waste time reinventing things that are already beautiful and elegant, or tolerate things that are not. like SIMD
    One advantage of working for yourself, I suppose.

    Leave a comment:


  • lkcl
    replied
    Originally posted by WorBlux View Post
    Ya it's such a huge advance in API, I'm surprised it took 4 iterations of mainstream SIMD before chip designers said "ya, that's a problem, I admit"
    all they had to do was look at the Cray architecture! Cray has done vectors large enough that the regfile needed to be held in external ultra-fast SRAM.

    btw you may be intrigued to know that a number of people working for Northrup Grumman, and others who used Cray supercomputers, were significant contributors to RVV.

    I think DAG translates to Dataflow diagram, for those who are less mathematically astute.
    (and like blockchains which are all DAGs)

    I can believe it. After the RISC revolution it seems everyone ran off in the same direction, with most going down the CPU lane and a few going VLIW.
    realistically it has to be said that the only highly commercially successful VLIW processor is the TI DSP series. these are VLIW-double-interleaved, so really good for audio stereo processing. they're also typically programmed in assembler.

    SIMD just... gaah. i think early in this thread i posted the "SIMD considered harmful" article, but that really doesn't sink in, as it's an academic exercise. where it really hits home is when you count the number of VSX handcoded assembly instructions in a recent glibc6 patch to POWER9.

    250.

    the RVV equivalent is *14*

    A lot of iterative improvements, like 4% better branch predictors or a iterative improvement on LRU cache eviction. But fundamental design choices seemed to be stamped in steel (with million-dollar+ lithography screens anyways). If we can credit openrisc/Risc-v with anything it's reviving interest and tool-chains to the open commons.
    well we can, but only because they got there before IBM had finished the preparatory work for opening PowerISA. Hugh said that he was hilariously contacted by tons of people saying, "duuude, OPF should totally do what RISCV is doing" and he had to tell them that a small team had been preparing exactly that, quietly, for 10 years

    Ya certainly. I'm also seeing why prior designs pushed SIMD and let the compiler deal with all the edge cases.
    indeed. it's veeery seductive. slap another opcode on, drop in a new ALU, and as far as the architecture is concerned, the SIMD opcode is literally no different from any other scalar operation. 2xFP32 is one opcode, 1xFP64 is another, neither the ISA nor architecture knows or cares.

    dead simple, right?

    wark-wark

    Especially if you're on a tight timeline trying to hit that next process node first. Not being on a tight schedule or tied to a particular node gives you the time and creative space to get it right. Also, little wonder why you, Mitch, and Ivan Goddard all end up in the same places.
    Mitch does all his designs at the gate level. he studied the 6600 very early on, recognised its genious, and based the Motorola 68000 on it. his expertise developed from there.

    Ivan's team took a radically different approach in the Mill, where you actually do static scheduling by the compiler onto a "conveyor belt". all types of operations are known (fixed) completion times and so the compiler has all the information it needs. this does mean that the compiler *specifically* has to target a particular architecture. no two different Mill archs are binary compatible.

    but... the Mill ISA? woow. there is no FP32 ADD or INT16 ADD, there is just... ADD. the size and type come from the LOAD operation, tag the register from that point onwards, and are carried right the way to STORE. ultra, ultra efficient and beautifully simple ISA, hardly any opcodes at all. terminology: polymorphic widths and operations. i wish i could use that but the problem is it is such a deviation from PowerISA it will be hard to get it in.

    me i just don't waste time reinventing things that are already beautiful and elegant, or tolerate things that are not. like SIMD


    Leave a comment:


  • WorBlux
    replied
    Originally posted by lkcl View Post
    well, what i'm hoping is that the significant work being done on llvm for RISC-V, ARM SVE/2, and other companies with variable-length VL, will hit mainline well in advance, such that all we need to do is a minimalist amount of porting work to add SV.
    Ya it's such a huge advance in API, I'm surprised it took 4 iterations of mainstream SIMD before chip designers said "ya, that's a problem, I admit"
    Originally posted by lkcl View Post
    the FU-Regs matrix covers the information about which registers a given FU needs to read or write, whilst the FU-FU matrix preserves the *result* ordering dependency. interestingly, FU-FU preserves a DAG (Directed Acyclic Graph)
    I think DAG translates to Dataflow diagram, for those who are less mathematically astute.
    Originally posted by lkcl View Post
    it's ultimately really quite compact and beautifully elegant, very little actual silicon, just hell to explain. even Thornton, in "Design of a Computer", specifically says this (around p.126 i think). i've found it usually takes several weeks for people to grasp the basics, and about 3 months to truly get it

    ....

    this got very painful to explain on comp.arch that it was important to computer science that there be different terminology to distinguish "hazard-aware Function Unit" from "pipeline". we had quite a lot of idiots absolutely categorically insist that "FU equals pipeline, shuttup you moron".

    eeeeventually after a week of quite painful discussion the term "Phase-aware Function Unit" occurred to me, as a way to distinguish this from "when people start treating pipelines as synonymous with FUs".

    Phase-aware to mean "the FU is aware at all times and carries with it the responsibility for tracking and notifying of its operands and its result".

    would you believe it, there is no modern industry-standard term for "Phase-aware Function Unit"?
    I can believe it. After the RISC revolution it seems everyone ran off in the same direction, with most going down the CPU lane and a few going VLIW. A lot of iterative improvements, like 4% better branch predictors or a iterative improvement on LRU cache eviction. But fundamental design choices seemed to be stamped in steel (with million-dollar+ lithography screens anyways). If we can credit openrisc/Risc-v with anything it's reviving interest and tool-chains to the open commons.
    Originally posted by lkcl View Post
    tricky, isn't it? now you see why it took 18 months to design SV (and implement the simulator).
    Ya certainly. I'm also seeing why prior designs pushed SIMD and let the compiler deal with all the edge cases. Especially if you're on a tight timeline trying to hit that next process node first. Not being on a tight schedule or tied to a particular node gives you the time and creative space to get it right. Also, little wonder why you, Mitch, and Ivan Goddard all end up in the same places.

    Leave a comment:


  • lkcl
    replied
    Originally posted by WorBlux View Post
    First off, thanks for the reply, it's clarified quite a few things for me.



    So the API supports it, but in practice the compiler (mesa) will tune VL sizes to the hardware (but I guess this is what mesa does for everything Vulkan/OpenGL/OpenCL anyways).
    pretty much, yeah. i mean, the compiler will know the register allocation / usage, and normally would shove out a batch of SIMD instructions (4x 4-wide SIMD to do 16 operations), whereas with SV it would issue *one* scalar operation with VL=16, *knowing* that this means that 16 registers will be needed.

    Even for native binary code, it seems more viable to include code paths for different 2^n bit optimal chunks, rather than trying to deal with every SIMD opcode/intrinsic under the sun. GCC could easily future-proof code by including optimized paths for yet unseen sizes, but you'd never be able to do that with unreleased SIMD extensions.
    well, what i'm hoping is that the significant work being done on llvm for RISC-V, ARM SVE/2, and other companies with variable-length VL, will hit mainline well in advance, such that all we need to do is a minimalist amount of porting work to add SV.


    Rename becasue of the dual FU-FU and FU-Reg DM's.
    the FU-Regs matrix covers the information about which registers a given FU needs to read or write, whilst the FU-FU matrix preserves the *result* ordering dependency. interestingly, FU-FU preserves a DAG (Directed Acyclic Graph)

    If RS A wants to write to r3, RS B wants to read R3 from RS A, and RS C also wants to write to r3, there's no reason RS C can't go ahead and do it's operation and keep the result on it's output latches while waiting for RS A to finish and RS B to pull it's read. I hope I'm starting to get it.
    pretty much

    B has a Read-after-Write hazard on A, C has a Write-after-Read hazard on B. yes absolutely, C can go ahead in parallel, create the result, and once the WaR hazard is dropped by B, the "hold" goes away,

    C is then allowed to raise "Write_Request", C will (at some point) be notified "ok, RIGHT NOW, you must put data, RIGHT NOW, on this clock cycle, for one cycle only, the data you want writing to the regfile". this is the "GO_WRITE" signal, and following that GO_WRITE (the cycle after), C absolutely must drop its Write_Request (because it's done its write). that "drop" of the Write_Request also goes into the FU-FU and FU_Regs Dependency Matrices to say "i no longer have a dependency: i'm totally done, no longer busy, and therefore free to be issued another instruction".

    it's ultimately really quite compact and beautifully elegant, very little actual silicon, just hell to explain. even Thornton, in "Design of a Computer", specifically says this (around p.126 i think). i've found it usually takes several weeks for people to grasp the basics, and about 3 months to truly get it.

    1. So a several FU's might share a pipilined ALU,
    yes. Mitch Alsup calls this "Concurrent Computation Units". basically if you have a 4-long pipeline, you have at least 4 "RSes" and you schedule one (and only one) of them to let it get data into the front of that pipeline, in each clock cycle.

    so long as it can track and buffer the results, But once an FU is issued an instruction it has to track it to commit?
    it's non-negotiably critical that they do so. failure to keep track of results will guaranteed 100% result in data corruption.

    the only thing that's slightly odd in the Concurrent Computation Unit case is: the FU is *not* the pipeline, it's the RS connected *to* the Pipeline. or, put another way, 4x RSes connected to a shared (mutexed) pipeline is actually *FOUR* separate Function Units.

    this got very painful to explain on comp.arch that it was important to computer science that there be different terminology to distinguish "hazard-aware Function Unit" from "pipeline". we had quite a lot of idiots absolutely categorically insist that "FU equals pipeline, shuttup you moron".

    eeeeventually after a week of quite painful discussion the term "Phase-aware Function Unit" occurred to me, as a way to distinguish this from "when people start treating pipelines as synonymous with FUs".

    Phase-aware to mean "the FU is aware at all times and carries with it the responsibility for tracking and notifying of its operands and its result".

    would you believe it, there is no modern industry-standard term for "Phase-aware Function Unit"?


    I may just have to go read that book. Are Mitch's chapters of addendum publicly availible?
    yes if you send me your email address (PM me) and indicate that you agree that if you share the files with anyone else you must ask them to credit Mitch Alsup if they use any of the material in it, and to require them to (recursively) request the same conditions (recursively) on those follow-on recipients.


    I'm looking at this CR thing for a while now, digging into that bug report, and the Power ISA specification, and not really getting any great ideas.


    One really bad idea - Ignore the CR and add a byte of mask at the bottom of each GPR. But of course the would make register spill/save a nighmare. Plus doesn't really help with GT/LT/EQ.

    One start to an idea was to expand the CR bit feild into byte feilds (plus mask). Also seems more terrible the more I think of it. If you were only ever doing 8x simd maybe.
    tricky, isn't it? now you see why it took 18 months to design SV (and implement the simulator).

    btw there are no bad ideas at this stage.

    Oh Well, I know that's not helpful at all, but it was fun to think about. I'll be sure to follow your progress anyways.

    Last edited by lkcl; 18 October 2020, 01:12 AM.

    Leave a comment:


  • WorBlux
    replied
    First off, thanks for the reply, it's clarified quite a few things for me.

    Originally posted by lkcl View Post
    well, if you try to slam 64x FP64 operations into the engine then yes you're going to run out of registers. if however you try 64 INT8 operations those will get spread out across 8x SIMD ALUs taking 8 64-bit registers each, which is... tolerable.
    So the API supports it, but in practice the compiler (mesa) will tune VL sizes to the hardware (but I guess this is what mesa does for everything Vulkan/OpenGL/OpenCL anyways). Even for native binary code, it seems more viable to include code paths for different 2^n bit optimal chunks, rather than trying to deal with every SIMD opcode/intrinsic under the sun. GCC could easily future-proof code by including optimized paths for yet unseen sizes, but you'd never be able to do that with unreleased SIMD extensions.

    Originally posted by lkcl View Post
    ah no. the issue engine is independent, the Reservation Stations are independent and their latches (called "Nameless Registers" in augmented-6600 terminology) act as buffers. as long as you still have RSes to reserve, the issue engine does not stall, and the RSes are *not* dependent on the Register File(s) for resource allocation. however the *moment* any given instruction cannot reserve a required RS, *then* you must stall.

    couple of notes:

    1) 6600 is not a pipelined architecture: it's a parallel-processing architecture where the Computation Units (ALUs) can be pipelines or FSMs or bits of wet string for all it cares. therefore, if the Function Units can't get a word in to read/write from the Regfiles, such that their stuff hangs around in the Reservation Stations, *then* you get a stall (because no free RSes). so that increased latency (because of the cyclic buffer between RSes and Regfiles) means that you may have to increase the number of RSes to compensate (that's if you care about the non-vector path... which we don't)

    2) Thornton and Cray were so hyper-intelligent and it was so early that they solved problems that they didn't know existed (or would become "problems" for other architects). consequently they didn't even notice that the RS "latches" were a form of "Register Renaming" and it's only an extensive retrospective analysis and comparison against the Tomasulo Algorithm that i even noticed that the RS latches are directly equivalent to "Register renaming". even Patterson, one of the world's leading academics, completely failed to notice this, angering and annoying the s*** out of Mitch Alsup enough for Mitch to write two supplementary chapters to Thornton's book, "Design of a Computer".
    Rename becasue of the dual FU-FU and FU-Reg DM's. If RS A wants to write to r3, RS B wants to read R3 from RS A, and RS C also wants to write to r3, there's no reason RS C can't go ahead and do it's operation and keep the result on it's output latches while waiting for RS A to finish and RS B to pull it's read. I hope I'm starting to get it.

    1. So a several FU's might share a pipilined ALU, so long as it can track and buffer the results, But once an FU is issued an instruction it has to track it to commit?

    I may just have to go read that book. Are Mitch's chapters of addendum publicly availible?


    Originally posted by lkcl View Post

    that was for RISC-V. OpenPOWER ISA, everything is based around Condition Registers. so, i am advocating that we simply vectorise those (and increase their number to 64 or 128)

    https://bugs.libre-soc.org/show_bug.cgi?id=213#c48
    I'm looking at this CR thing for a while now, digging into that bug report, and the Power ISA specification, and not really getting any great ideas.

    One really bad idea - Ignore the CR and add a byte of mask at the bottom of each GPR. But of course the would make register spill/save a nighmare. Plus doesn't really help with GT/LT/EQ.

    One start to an idea was to expand the CR bit feild into byte feilds (plus mask). Also seems more terrible the more I think of it. If you were only ever doing 8x simd maybe.

    Oh Well, I know that's not helpful at all, but it was fun to think about. I'll be sure to follow your progress anyways.



    Leave a comment:

Working...
X