Originally posted by WorBlux
View Post
whereas, a hybrid CPU, being effectively "a standard SMP arrangement with extra opcode bells and whistles" would need a linux kernel OS context-switch.
And I've tried to find more on your implementation of simple-V, but can't quite find what exactly is going on.
However if you're striding across very large vectors you can't keep it all in cache,
and I suspect you may even have a hard time streaming from memory fast enough.
(edit: well.. you can... but the power consumption penalty would terminate all and any possibility of having a commercially-viable processor. logically therefore, you don't do that!)
You say a vector instruction will essentially stop instruction decode other execution until the vector op is complete,
it therefore shoves *elements* into the multi-issue execution engine.
now, if the VL is e.g. 4 and there is room for e.g. 8-wide multi-issue, then the instruction decode does *not* stop with that first vector instruction, it goes, "hmm if i decode the next instruction as well i can shove an extra 4 elements into the 8-wide multi-issue"
and at *that* point it will go "ok i can't do any more in this cycle"
but because all the Computation Units are pipelines (except DIV) then on the next cycle guess what? next instruction decode gets 8 more free issue slots, and off we go again.
but it seems like at that point you are committed, and if you're on memory stall, there's not painless early out or swap/resume built in. Presumably you have to break it up internally to register-sized load/stores, but it's not clear if these can commit/pause/resume independently.
to cope with the kind of memory load anticipated, i had to spend several months with Mitch Alsup on comp.arch last year, to get enough of an understanding of how to do it.
To cover memory latency I'd expect a lot of loads in flight and a lot of places to put it.
I do see you have a proposal to bank/divide vector registers and that's maybe closer to what I'm thinking, assigning a bank to a specific op. Then when you hit a stall, you can switch to a vector op going on a different bank, and if an op is active on it try to continue it, or if the bank is empty, look at the scoreboard and try to find a another vector op.
Vector A fits into R0 R1 R2 R3
Vector B fits into R4 R5 R6 R7
result C is to go into R8 R9 R10 R11
the data paths between R0, R4, R8, R12, R16 (etc) are immediate and direct. likewise between R1, R5, R9, .... etc.
therefore this takes 1 clock cycle to read or write, and there are 4 such "paths" between regfiles, so all *four* sets of vector ops (R8=R4+R0, R9=R5+R1) all do not interfere with each other.
however let us say that you make the "mistake" of doing this:
Vector A fits into R0 R1 R2 R3
Vector B fits into R4 R5 R6 R7
result C is to go into R9 R10 R11 R12
now although the reads (A, B) work fine, the result R0+R4, needing to go into R9, it is in the *wrong lane* and must be dropped into the "cyclic buffer". it will be a *three* cycle latency before it gets written to the regfile.
otherwise we have to have a full crossbar (12 or 16 way READ and 8 or 10 WRITE) and that's just completely insane.
I guest the TLDR question would be. Is there a reason the decoder can't issue one of the vector zero-overhead loops alongside subsequent instructions, potentially even out of multiple vector-loops at once?
SMT would be a way to do this withing the core but is heavy and involves OS support for swapping. Maybe some way to spawn asynchonous threadlets?
Also most of the discussion of simple-V centers on RISC-V, and not on POWER so it's hard to tell what's essential to the idea and what came about simply for better RISC-V integration.
Comment