Originally posted by xfcemint
View Post
Announcement
Collapse
No announcement yet.
Libre-SOC Still Persevering To Be A Hybrid CPU/GPU That's 100% Open-Source
Collapse
X
-
-
Originally posted by lkcl View Post
interesting. this sounds very much like an optimised barrel processor: i say "optimised", where barrel processors are normally fair-scheduling, you're talking about instant swapping between regfiles.
whereas, a hybrid CPU, being effectively "a standard SMP arrangement with extra opcode bells and whistles" would need a linux kernel OS context-switch.
Originally posted by lkcl View Postthe summary takes 30 seconds. a full debrief takes 7 hours.
this is why we're extending the PowerISA regfile to 128 FP and 128 INT regs.
Originally posted by lkcl View Post
Jeff Bush's Nyuzi paper, nyuzipass2015, already made this abundantly clear, hence why 128 FP and 128 INT regs. you absolutely cannot have the LOAD-processing-STORE loop interrupted by register spill.
(edit: well.. you can... but the power consumption penalty would terminate all and any possibility of having a commercially-viable processor. logically therefore, you don't do that!)
Originally posted by lkcl View Postah no, not quite. the vector instruction is basically not really a vector at all, it's a "for-loop from 0 to VL-1 whilst the PC is not advanced i.e. it's a bit like a SUB-PC". conceptually it sits in between instruction decode and instruction issue.
it therefore shoves *elements* into the multi-issue execution engine.
now, if the VL is e.g. 4 and there is room for e.g. 8-wide multi-issue, then the instruction decode does *not* stop with that first vector instruction, it goes, "hmm if i decode the next instruction as well i can shove an extra 4 elements into the 8-wide multi-issue"
and at *that* point it will go "ok i can't do any more in this cycle"
but because all the Computation Units are pipelines (except DIV) then on the next cycle guess what? next instruction decode gets 8 more free issue slots, and off we go again.
I think I'm starting to get it. The X.org presentation helped. So the decoder sees the Vector instruction, and knows the element width, and will pack a FU/registers SIMD style, and also continue into a neighboring FU. I suppose a Matrix takes care of unaligned/incomplete widths, but that i might be harder on alternative implementations.
And if I'm not wrong, at the end of the day, the limit of VL is based on the number of architectural registers available and their width? Hence the proposal for the official extension.
Anyways to be more specific, say you've entered into vector mode and the next instructions in the queue are
load x to r13
load Vx to r0...3
load Vy to r4...7
mult r13, r4..7 to R17...20
add r4...7, r17...20 to r13...r16
store r13..16 to Vz
And say Vy is in L1 cache, but Vx is isn't. Seems like load request for Vx could clog up you're load/store units even though theres potential tor a better scheduale. Or can loads overtake other loads in the pipeline?
Originally posted by lkcl View Post
memory load-store is basically exactly as it would be for a multi-issue superscalar out-of-order load-store, but most first-time processor architects wouldn't dream of creating a 6 to 8 multi-issue load-store microarchitecture. even BOOM has only just recently added 2 simultaneous load-stores.
to cope with the kind of memory load anticipated, i had to spend several months with Mitch Alsup on comp.arch last year, to get enough of an understanding of how to do it.
Originally posted by lkcl View Post
yes. and a minimum 256 bit L2 cache data path, plus 4 "striped" L1 caches. absolutely mental. *nobody* in the open hardware has tried designing something like this as a first processor! everyone does like 32-bit L1 caches or 64-bit, maybe.
Originally posted by lkcl View Post
not quite: the plan is to "stripe" the register file so that vectors are optimal, and to provide a cyclic ring-buffer for scalar workloads that don't quite fit that. example:
Vector A fits into R0 R1 R2 R3
Vector B fits into R4 R5 R6 R7
result C is to go into R8 R9 R10 R11
the data paths between R0, R4, R8, R12, R16 (etc) are immediate and direct. likewise between R1, R5, R9, .... etc.
therefore this takes 1 clock cycle to read or write, and there are 4 such "paths" between regfiles, so all *four* sets of vector ops (R8=R4+R0, R9=R5+R1) all do not interfere with each other.
however let us say that you make the "mistake" of doing this:
Vector A fits into R0 R1 R2 R3
Vector B fits into R4 R5 R6 R7
result C is to go into R9 R10 R11 R12
now although the reads (A, B) work fine, the result R0+R4, needing to go into R9, it is in the *wrong lane* and must be dropped into the "cyclic buffer". it will be a *three* cycle latency before it gets written to the regfile.
otherwise we have to have a full crossbar (12 or 16 way READ and 8 or 10 WRITE) and that's just completely insane.
Originally posted by lkcl View Post
they're not quite ZOHLsm but yes if you cognitively disconnect "decode" from "issue" then consider SV to be "a compressed version of decode", we can still have multi-issue decode and multi-issue execution.
yyeah there is so much to get done before considering doing that, although hilariously we considered overloading "branch" as a way to "start threads".
we have to do a full from-scratch redesign, in particular taking into account Condition Registers in PowerISA. sigh. https://bugs.libre-soc.org/show_bug.cgi?id=213
Comment
-
Originally posted by lkcl View Post
yes. i mean, there's a reason why "scalar" (normal, SMP) CPUs have scratch registers for context-switching (MIPS, RISC-V in particular) it's to get fast context-switches. where you start to include bank-switching of full register sets, including SPRs, it starts to make me twitchy about implementing something like that in a hybrid context. pure (dedicated) GPU, no problem.
Comment
-
Originally posted by WorBlux View PostYour register and cache video helped explain some of it, but still don't think I fully understand the 6600 overall. 7 Hours seems optimistic just for that aspect.
I do occasionally find myself on comp.arch sometimes just trying to glean interesting tidbits when I can.
Poor DIV, always the black sheep of the family. But I can now see how it's be easy to multiplex the issue.
I think I'm starting to get it. The X.org presentation helped. So the decoder sees the Vector instruction, and knows the element width, and will pack a FU/registers SIMD style,
and also continue into a neighboring FU.
I suppose a Matrix takes care of unaligned/incomplete widths,
but that i might be harder on alternative implementations.
And if I'm not wrong, at the end of the day, the limit of VL is based on the number of architectural registers available and their width?
Hence the proposal for the official extension.
this is the primary reason why we dropped RISC-V, because they failed, persistently and regularly, under their legal responsibilities under Trademark Law, to respond to reasonable in-good-faith requests to be included in the enhancement of the RISC-V ISA *without* completely compromising our business objectives.
moving on
Anyways to be more specific, say you've entered into vector mode and the next instructions in the queue are
load x to r13
load Vx to r0...3
load Vy to r4...7
mult r13, r4..7 to R17...20
add r4...7, r17...20 to r13...r16
store r13..16 to Vz
And say Vy is in L1 cache, but Vx is isn't. Seems like load request for Vx could clog up you're load/store units even though theres potential tor a better scheduale. Or can loads overtake other loads in the pipeline?
i.e the fallback is "these LD/STs are going to be done sequentially if we *can't* find opportunities for parallelsim" rather than "assume everything's done in parallel and whoops we missed some, wark, data-corruption"
Indeed, it does look like a very ambitious project, even more so once you drill into the details.
It's a real shame RISC-V wasn't more accommodating.
moving on...
I can see why you're using a python flavor to do it.
best to just stick with a modern OO programming language entirely.
"not quite: the plan is to "stripe" the register file so that vectors are optimal, and to provide a cyclic ring-buffer for scalar workloads that don't quite fit that. example:"
Does that mean the entire pipeline has to stall while waiting on the buffer? If so... ouch.
couple of notes:
1) 6600 is not a pipelined architecture: it's a parallel-processing architecture where the Computation Units (ALUs) can be pipelines or FSMs or bits of wet string for all it cares. therefore, if the Function Units can't get a word in to read/write from the Regfiles, such that their stuff hangs around in the Reservation Stations, *then* you get a stall (because no free RSes). so that increased latency (because of the cyclic buffer between RSes and Regfiles) means that you may have to increase the number of RSes to compensate (that's if you care about the non-vector path... which we don't)
2) Thornton and Cray were so hyper-intelligent and it was so early that they solved problems that they didn't know existed (or would become "problems" for other architects). consequently they didn't even notice that the RS "latches" were a form of "Register Renaming" and it's only an extensive retrospective analysis and comparison against the Tomasulo Algorithm that i even noticed that the RS latches are directly equivalent to "Register renaming". even Patterson, one of the world's leading academics, completely failed to notice this, angering and annoying the s*** out of Mitch Alsup enough for Mitch to write two supplementary chapters to Thornton's book, "Design of a Computer".
Overloading the branches for predicates does seem pretty clever though.
https://bugs.libre-soc.org/show_bug.cgi?id=213#c48
Comment
-
Originally posted by WorBlux View Post
Understandable, I've got some ideas, but keep getting stuck on details. Maybe normal prediction and prefetch will be plenty good in practice. And maybe not stuffing all the load units full from a single load-vector source instruction.
the nice thing about the predication is, it drops on top of the SIMD masks, and from there through to regfile byte-write-enables. no matter the element width, it's all good. it means that for a 64 bit operation, writing to the regfile we need to raise 8x byte-level write lines, but that's standard practice for SRAMs in L1 and L2 caches so cell library developers are going "yawn" at that (small) innovation.
Comment
-
First off, thanks for the reply, it's clarified quite a few things for me.
Originally posted by lkcl View Postwell, if you try to slam 64x FP64 operations into the engine then yes you're going to run out of registers. if however you try 64 INT8 operations those will get spread out across 8x SIMD ALUs taking 8 64-bit registers each, which is... tolerable.
Originally posted by lkcl View Postah no. the issue engine is independent, the Reservation Stations are independent and their latches (called "Nameless Registers" in augmented-6600 terminology) act as buffers. as long as you still have RSes to reserve, the issue engine does not stall, and the RSes are *not* dependent on the Register File(s) for resource allocation. however the *moment* any given instruction cannot reserve a required RS, *then* you must stall.
couple of notes:
1) 6600 is not a pipelined architecture: it's a parallel-processing architecture where the Computation Units (ALUs) can be pipelines or FSMs or bits of wet string for all it cares. therefore, if the Function Units can't get a word in to read/write from the Regfiles, such that their stuff hangs around in the Reservation Stations, *then* you get a stall (because no free RSes). so that increased latency (because of the cyclic buffer between RSes and Regfiles) means that you may have to increase the number of RSes to compensate (that's if you care about the non-vector path... which we don't)
2) Thornton and Cray were so hyper-intelligent and it was so early that they solved problems that they didn't know existed (or would become "problems" for other architects). consequently they didn't even notice that the RS "latches" were a form of "Register Renaming" and it's only an extensive retrospective analysis and comparison against the Tomasulo Algorithm that i even noticed that the RS latches are directly equivalent to "Register renaming". even Patterson, one of the world's leading academics, completely failed to notice this, angering and annoying the s*** out of Mitch Alsup enough for Mitch to write two supplementary chapters to Thornton's book, "Design of a Computer".
1. So a several FU's might share a pipilined ALU, so long as it can track and buffer the results, But once an FU is issued an instruction it has to track it to commit?
I may just have to go read that book. Are Mitch's chapters of addendum publicly availible?
Originally posted by lkcl View Post
that was for RISC-V. OpenPOWER ISA, everything is based around Condition Registers. so, i am advocating that we simply vectorise those (and increase their number to 64 or 128)
https://bugs.libre-soc.org/show_bug.cgi?id=213#c48
One really bad idea - Ignore the CR and add a byte of mask at the bottom of each GPR. But of course the would make register spill/save a nighmare. Plus doesn't really help with GT/LT/EQ.
One start to an idea was to expand the CR bit feild into byte feilds (plus mask). Also seems more terrible the more I think of it. If you were only ever doing 8x simd maybe.
Oh Well, I know that's not helpful at all, but it was fun to think about. I'll be sure to follow your progress anyways.
Comment
-
Originally posted by WorBlux View PostFirst off, thanks for the reply, it's clarified quite a few things for me.
So the API supports it, but in practice the compiler (mesa) will tune VL sizes to the hardware (but I guess this is what mesa does for everything Vulkan/OpenGL/OpenCL anyways).
Even for native binary code, it seems more viable to include code paths for different 2^n bit optimal chunks, rather than trying to deal with every SIMD opcode/intrinsic under the sun. GCC could easily future-proof code by including optimized paths for yet unseen sizes, but you'd never be able to do that with unreleased SIMD extensions.
Rename becasue of the dual FU-FU and FU-Reg DM's.
If RS A wants to write to r3, RS B wants to read R3 from RS A, and RS C also wants to write to r3, there's no reason RS C can't go ahead and do it's operation and keep the result on it's output latches while waiting for RS A to finish and RS B to pull it's read. I hope I'm starting to get it.
B has a Read-after-Write hazard on A, C has a Write-after-Read hazard on B. yes absolutely, C can go ahead in parallel, create the result, and once the WaR hazard is dropped by B, the "hold" goes away,
C is then allowed to raise "Write_Request", C will (at some point) be notified "ok, RIGHT NOW, you must put data, RIGHT NOW, on this clock cycle, for one cycle only, the data you want writing to the regfile". this is the "GO_WRITE" signal, and following that GO_WRITE (the cycle after), C absolutely must drop its Write_Request (because it's done its write). that "drop" of the Write_Request also goes into the FU-FU and FU_Regs Dependency Matrices to say "i no longer have a dependency: i'm totally done, no longer busy, and therefore free to be issued another instruction".
it's ultimately really quite compact and beautifully elegant, very little actual silicon, just hell to explain. even Thornton, in "Design of a Computer", specifically says this (around p.126 i think). i've found it usually takes several weeks for people to grasp the basics, and about 3 months to truly get it.
1. So a several FU's might share a pipilined ALU,
so long as it can track and buffer the results, But once an FU is issued an instruction it has to track it to commit?
the only thing that's slightly odd in the Concurrent Computation Unit case is: the FU is *not* the pipeline, it's the RS connected *to* the Pipeline. or, put another way, 4x RSes connected to a shared (mutexed) pipeline is actually *FOUR* separate Function Units.
this got very painful to explain on comp.arch that it was important to computer science that there be different terminology to distinguish "hazard-aware Function Unit" from "pipeline". we had quite a lot of idiots absolutely categorically insist that "FU equals pipeline, shuttup you moron".
eeeeventually after a week of quite painful discussion the term "Phase-aware Function Unit" occurred to me, as a way to distinguish this from "when people start treating pipelines as synonymous with FUs".
Phase-aware to mean "the FU is aware at all times and carries with it the responsibility for tracking and notifying of its operands and its result".
would you believe it, there is no modern industry-standard term for "Phase-aware Function Unit"?
I may just have to go read that book. Are Mitch's chapters of addendum publicly availible?
I'm looking at this CR thing for a while now, digging into that bug report, and the Power ISA specification, and not really getting any great ideas.
One really bad idea - Ignore the CR and add a byte of mask at the bottom of each GPR. But of course the would make register spill/save a nighmare. Plus doesn't really help with GT/LT/EQ.
One start to an idea was to expand the CR bit feild into byte feilds (plus mask). Also seems more terrible the more I think of it. If you were only ever doing 8x simd maybe.
btw there are no bad ideas at this stage.
Oh Well, I know that's not helpful at all, but it was fun to think about. I'll be sure to follow your progress anyways.
Last edited by lkcl; 18 October 2020, 01:12 AM.
Comment
-
Originally posted by lkcl View Postwell, what i'm hoping is that the significant work being done on llvm for RISC-V, ARM SVE/2, and other companies with variable-length VL, will hit mainline well in advance, such that all we need to do is a minimalist amount of porting work to add SV.
Originally posted by lkcl View Postthe FU-Regs matrix covers the information about which registers a given FU needs to read or write, whilst the FU-FU matrix preserves the *result* ordering dependency. interestingly, FU-FU preserves a DAG (Directed Acyclic Graph)
Originally posted by lkcl View Postit's ultimately really quite compact and beautifully elegant, very little actual silicon, just hell to explain. even Thornton, in "Design of a Computer", specifically says this (around p.126 i think). i've found it usually takes several weeks for people to grasp the basics, and about 3 months to truly get it
....
this got very painful to explain on comp.arch that it was important to computer science that there be different terminology to distinguish "hazard-aware Function Unit" from "pipeline". we had quite a lot of idiots absolutely categorically insist that "FU equals pipeline, shuttup you moron".
eeeeventually after a week of quite painful discussion the term "Phase-aware Function Unit" occurred to me, as a way to distinguish this from "when people start treating pipelines as synonymous with FUs".
Phase-aware to mean "the FU is aware at all times and carries with it the responsibility for tracking and notifying of its operands and its result".
would you believe it, there is no modern industry-standard term for "Phase-aware Function Unit"?
Originally posted by lkcl View Posttricky, isn't it? now you see why it took 18 months to design SV (and implement the simulator).
- 1 like
Comment
-
Originally posted by WorBlux View PostYa it's such a huge advance in API, I'm surprised it took 4 iterations of mainstream SIMD before chip designers said "ya, that's a problem, I admit"
btw you may be intrigued to know that a number of people working for Northrup Grumman, and others who used Cray supercomputers, were significant contributors to RVV.
I think DAG translates to Dataflow diagram, for those who are less mathematically astute.
I can believe it. After the RISC revolution it seems everyone ran off in the same direction, with most going down the CPU lane and a few going VLIW.
SIMD just... gaah. i think early in this thread i posted the "SIMD considered harmful" article, but that really doesn't sink in, as it's an academic exercise. where it really hits home is when you count the number of VSX handcoded assembly instructions in a recent glibc6 patch to POWER9.
250.
the RVV equivalent is *14*
A lot of iterative improvements, like 4% better branch predictors or a iterative improvement on LRU cache eviction. But fundamental design choices seemed to be stamped in steel (with million-dollar+ lithography screens anyways). If we can credit openrisc/Risc-v with anything it's reviving interest and tool-chains to the open commons.
Ya certainly. I'm also seeing why prior designs pushed SIMD and let the compiler deal with all the edge cases.
dead simple, right?
wark-wark
Especially if you're on a tight timeline trying to hit that next process node first. Not being on a tight schedule or tied to a particular node gives you the time and creative space to get it right. Also, little wonder why you, Mitch, and Ivan Goddard all end up in the same places.
Ivan's team took a radically different approach in the Mill, where you actually do static scheduling by the compiler onto a "conveyor belt". all types of operations are known (fixed) completion times and so the compiler has all the information it needs. this does mean that the compiler *specifically* has to target a particular architecture. no two different Mill archs are binary compatible.
but... the Mill ISA? woow. there is no FP32 ADD or INT16 ADD, there is just... ADD. the size and type come from the LOAD operation, tag the register from that point onwards, and are carried right the way to STORE. ultra, ultra efficient and beautifully simple ISA, hardly any opcodes at all. terminology: polymorphic widths and operations. i wish i could use that but the problem is it is such a deviation from PowerISA it will be hard to get it in.
me i just don't waste time reinventing things that are already beautiful and elegant, or tolerate things that are not. like SIMD
Comment
-
Originally posted by lkcl View PostSIMD just... gaah. i think early in this thread i posted the "SIMD considered harmful" article, but that really doesn't sink in, as it's an academic exercise. where it really hits home is when you count the number of VSX handcoded assembly instructions in a recent glibc6 patch to POWER9.
250.
the RVV equivalent is *14*
Originally posted by lkcl View PostMitch does all his designs at the gate level. he studied the 6600 very early on, recognised its genious, and based the Motorola 68000 on it. his expertise developed from there.
Ivan's team took a radically different approach in the Mill, where you actually do static scheduling by the compiler onto a "conveyor belt". all types of operations are known (fixed) completion times and so the compiler has all the information it needs. this does mean that the compiler *specifically* has to target a particular architecture. no two different Mill archs are binary compatible.
but... the Mill ISA? woow. there is no FP32 ADD or INT16 ADD, there is just... ADD. the size and type come from the LOAD operation, tag the register from that point onwards, and are carried right the way to STORE. ultra, ultra efficient and beautifully simple ISA, hardly any opcodes at all. terminology: polymorphic widths and operations. i wish i could use that but the problem is it is such a deviation from PowerISA it will be hard to get it in.
Of course they are changing and refining all the time. But it is a very CISC instruction set . Not only is each member binary incompatible, every FU slot has a different binary encoding (and set of supported ops), which is one of the ways they get away with such a wide issue.
This combined with the conceptual conveyor belt means nobody except the crypto guys are going to be writing anything in raw assembler. I think their plan for hardware initialization is to include a forth interpreter on the ROM. Not unheard of, but a very different approach from current mainstream.
Originally posted by lkcl View Postme i just don't waste time reinventing things that are already beautiful and elegant, or tolerate things that are not. like SIMD
Comment
Comment