Announcement

**lkcl** · 09 October 2020, 05:20 PM

Originally posted by WorBlux View Post

SIMT would be quite a beast, but isn't exactly the GPU feature I'm talking about. For example, a single nVidia Ampere SM has a 256k register file that can be divided amoung up to 32 thread blocks. (SIMT being found as an optimization within the thread block) But because the register files are statically allocated to thread blocks, the SM's internal scheduler can quickly flip between thread blocks to cover memory latency/stalls.

interesting. this sounds very much like an optimised barrel processor: i say "optimised", where barrel processors are normally fair-scheduling, you're talking about instant swapping between regfiles.

whereas, a hybrid CPU, being effectively "a standard SMP arrangement with extra opcode bells and whistles" would need a linux kernel OS context-switch.

And I've tried to find more on your implementation of simple-V, but can't quite find what exactly is going on.

the summary takes 30 seconds. a full debrief takes 7 hours.

However if you're striding across very large vectors you can't keep it all in cache,

this is why we're extending the PowerISA regfile to 128 FP and 128 INT regs.

and I suspect you may even have a hard time streaming from memory fast enough.

Jeff Bush's Nyuzi paper, nyuzipass2015, already made this abundantly clear, hence why 128 FP and 128 INT regs. you absolutely cannot have the LOAD-processing-STORE loop interrupted by register spill.

(edit: well.. you can... but the power consumption penalty would terminate all and any possibility of having a commercially-viable processor. logically therefore, you don't do that!)

You say a vector instruction will essentially stop instruction decode other execution until the vector op is complete,

ah no, not quite. the vector instruction is basically not really a vector at all, it's a "for-loop from 0 to VL-1 whilst the PC is not advanced i.e. it's a bit like a SUB-PC". conceptually it sits in between instruction decode and instruction issue.

it therefore shoves *elements* into the multi-issue execution engine.

now, if the VL is e.g. 4 and there is room for e.g. 8-wide multi-issue, then the instruction decode does *not* stop with that first vector instruction, it goes, "hmm if i decode the next instruction as well i can shove an extra 4 elements into the 8-wide multi-issue"

and at *that* point it will go "ok i can't do any more in this cycle"

but because all the Computation Units are pipelines (except DIV) then on the next cycle guess what? next instruction decode gets 8 more free issue slots, and off we go again.

but it seems like at that point you are committed, and if you're on memory stall, there's not painless early out or swap/resume built in. Presumably you have to break it up internally to register-sized load/stores, but it's not clear if these can commit/pause/resume independently.

memory load-store is basically exactly as it would be for a multi-issue superscalar out-of-order load-store, but most first-time processor architects wouldn't dream of creating a 6 to 8 multi-issue load-store microarchitecture. even BOOM has only just recently added 2 simultaneous load-stores.

to cope with the kind of memory load anticipated, i had to spend several months with Mitch Alsup on comp.arch last year, to get enough of an understanding of how to do it.

To cover memory latency I'd expect a lot of loads in flight and a lot of places to put it.

yes. and a minimum 256 bit L2 cache data path, plus 4 "striped" L1 caches. absolutely mental. *nobody* in the open hardware has tried designing something like this as a first processor! everyone does like 32-bit L1 caches or 64-bit, maybe.

I do see you have a proposal to bank/divide vector registers and that's maybe closer to what I'm thinking, assigning a bank to a specific op. Then when you hit a stall, you can switch to a vector op going on a different bank, and if an op is active on it try to continue it, or if the bank is empty, look at the scoreboard and try to find a another vector op.

not quite: the plan is to "stripe" the register file so that vectors are optimal, and to provide a cyclic ring-buffer for scalar workloads that don't quite fit that. example:

Vector A fits into R0 R1 R2 R3
Vector B fits into R4 R5 R6 R7
result C is to go into R8 R9 R10 R11

the data paths between R0, R4, R8, R12, R16 (etc) are immediate and direct. likewise between R1, R5, R9, .... etc.

therefore this takes 1 clock cycle to read or write, and there are 4 such "paths" between regfiles, so all *four* sets of vector ops (R8=R4+R0, R9=R5+R1) all do not interfere with each other.

however let us say that you make the "mistake" of doing this:

Vector A fits into R0 R1 R2 R3
Vector B fits into R4 R5 R6 R7
result C is to go into R9 R10 R11 R12

now although the reads (A, B) work fine, the result R0+R4, needing to go into R9, it is in the *wrong lane* and must be dropped into the "cyclic buffer". it will be a *three* cycle latency before it gets written to the regfile.

otherwise we have to have a full crossbar (12 or 16 way READ and 8 or 10 WRITE) and that's just completely insane.

I guest the TLDR question would be. Is there a reason the decoder can't issue one of the vector zero-overhead loops alongside subsequent instructions, potentially even out of multiple vector-loops at once?

they're not quite ZOHLsm but yes if you cognitively disconnect "decode" from "issue" then consider SV to be "a compressed version of decode", we can still have multi-issue decode and multi-issue execution.

SMT would be a way to do this withing the core but is heavy and involves OS support for swapping. Maybe some way to spawn asynchonous threadlets?

yyeah there is so much to get done before considering doing that, although hilariously we considered overloading "branch" as a way to "start threads".

Also most of the discussion of simple-V centers on RISC-V, and not on POWER so it's hard to tell what's essential to the idea and what came about simply for better RISC-V integration.

we have to do a full from-scratch redesign, in particular taking into account Condition Registers in PowerISA. sigh. https://bugs.libre-soc.org/show_bug.cgi?id=213

**lkcl** · 09 October 2020, 06:24 PM

Originally posted by xfcemint View Post

I absolutely disagree with what you are implying here.

- You don't know and it is hard to predict the use-cases for this kind of SoC. Historically, we know: faster is better. We shouldn't settle on 25 GFLOPS, that way too low, that is way outdated.

at the time i wrote that, i had a potential customer wanting a smartwatch processor. when i told him that the processor was likely to be able to do 25 GFLOPs and was 4-core he slightly freaked out because the power consumption had to be *under* 2.0 watts. 25 GFLOPs was *way* too much.

- Instead of focusing on a single number, like 25 GFLOPS, I propose designing a solution that can cover a wide range of performance requirements.

that's exactly what we've done.

**lkcl** · 09 October 2020, 06:29 PM

Originally posted by xfcemint View Post

I like it. I was thinking of something similar. Dividing the register file into 'stripes' seems fine.

What I don't get: what's the point of having 128 + 128 registers in a GPU core? Isn't that way too much?

not in the slightest bit. look up the stats on Broadcom VideoCore IV, MALI 400, Vivante and AMDGPU. they're all in the 128 to 256 register range.

And, of course, I assume you are already planning on cutting out the entire FP64,

no, of course not, because (just as you say in the next sentence, we have to be OpenPOWER ISA compliant

full-precision sin, cos, and other useless stuff from the GPU cores.

we have two sets to balance against: Vulkan compliance, and OpenPOWER compliance. interestingly neither of those specify the level of performance. it's the customers that determine, with their buying decisions, whether you have to hit certain power-performance targets.

Or, if you must keep it to remain POWER-ISA compatible, then those operations would just be emulated/transcoded/microcoded.

we can't go ridiculously low performance: if it sucks, nobody buys. however yes, FP64 will not get "ultimate highest priority" here. we'll do maybe one 64-bit DIV FSM but provide far more (parallel) 32-bit DIV pipelines / FSMs for example.

So, I would guess that most calculations would be executed as 4 x FP16 or 2xFP32 on 64-bit registers. That's where the horsepower should be concentrated, I think.

yyep, you're getting it

**lkcl** · 09 October 2020, 08:09 PM

Originally posted by xfcemint View Post

4) Complicates the design significantly. In particular, the switch-over operation may have a lot of dependencies.

yes. i mean, there's a reason why "scalar" (normal, SMP) CPUs have scratch registers for context-switching (MIPS, RISC-V in particular) it's to get fast context-switches. where you start to include bank-switching of full register sets, including SPRs, it starts to make me twitchy about implementing something like that in a hybrid context. pure (dedicated) GPU, no problem.

**WorBlux** · 12 October 2020, 03:22 PM

Originally posted by lkcl View Post

interesting. this sounds very much like an optimised barrel processor: i say "optimised", where barrel processors are normally fair-scheduling, you're talking about instant swapping between regfiles.

whereas, a hybrid CPU, being effectively "a standard SMP arrangement with extra opcode bells and whistles" would need a linux kernel OS context-switch.

Very similar, most barrel processors have some way to accelerate the switch The Sparc T3 was all about memory access in the context of a database. But I don't think GPU's even need that, probably just a couple architectural registers of an offset and bounds. No clue on the exact details of the scheduler and I suspect it's a gaurded secret.

Originally posted by lkcl View Post

the summary takes 30 seconds. a full debrief takes 7 hours.

this is why we're extending the PowerISA regfile to 128 FP and 128 INT regs.

Your register and cache video helped explain some of it, but still don't think I fully understand the 6600 overall. 7 Hours seems optimistic just for that aspect. I do occasionally find myself on comp.arch sometimes just trying to glean interesting tidbits when I can.

Originally posted by lkcl View Post

Jeff Bush's Nyuzi paper, nyuzipass2015, already made this abundantly clear, hence why 128 FP and 128 INT regs. you absolutely cannot have the LOAD-processing-STORE loop interrupted by register spill.

(edit: well.. you can... but the power consumption penalty would terminate all and any possibility of having a commercially-viable processor. logically therefore, you don't do that!)

Indeed multiple round trips make no sense at all,

Originally posted by lkcl View Post

ah no, not quite. the vector instruction is basically not really a vector at all, it's a "for-loop from 0 to VL-1 whilst the PC is not advanced i.e. it's a bit like a SUB-PC". conceptually it sits in between instruction decode and instruction issue.

it therefore shoves *elements* into the multi-issue execution engine.

now, if the VL is e.g. 4 and there is room for e.g. 8-wide multi-issue, then the instruction decode does *not* stop with that first vector instruction, it goes, "hmm if i decode the next instruction as well i can shove an extra 4 elements into the 8-wide multi-issue"

and at *that* point it will go "ok i can't do any more in this cycle"

but because all the Computation Units are pipelines (except DIV) then on the next cycle guess what? next instruction decode gets 8 more free issue slots, and off we go again.

Poor DIV, always the black sheep of the family. But I can now see how it's be easy to multiplex the issue.

I think I'm starting to get it. The X.org presentation helped. So the decoder sees the Vector instruction, and knows the element width, and will pack a FU/registers SIMD style, and also continue into a neighboring FU. I suppose a Matrix takes care of unaligned/incomplete widths, but that i might be harder on alternative implementations.

And if I'm not wrong, at the end of the day, the limit of VL is based on the number of architectural registers available and their width? Hence the proposal for the official extension.

Anyways to be more specific, say you've entered into vector mode and the next instructions in the queue are
load x to r13
load Vx to r0...3
load Vy to r4...7
mult r13, r4..7 to R17...20
add r4...7, r17...20 to r13...r16
store r13..16 to Vz

And say Vy is in L1 cache, but Vx is isn't. Seems like load request for Vx could clog up you're load/store units even though theres potential tor a better scheduale. Or can loads overtake other loads in the pipeline?

Originally posted by lkcl View Post

memory load-store is basically exactly as it would be for a multi-issue superscalar out-of-order load-store, but most first-time processor architects wouldn't dream of creating a 6 to 8 multi-issue load-store microarchitecture. even BOOM has only just recently added 2 simultaneous load-stores.

to cope with the kind of memory load anticipated, i had to spend several months with Mitch Alsup on comp.arch last year, to get enough of an understanding of how to do it.

Indeed, it does look like a very ambitious project, even more so once you drill into the details. It's a real shame RISC-V wasn't more accommodating.

Originally posted by lkcl View Post

yes. and a minimum 256 bit L2 cache data path, plus 4 "striped" L1 caches. absolutely mental. *nobody* in the open hardware has tried designing something like this as a first processor! everyone does like 32-bit L1 caches or 64-bit, maybe.

I can see why you're using a python flavor to do it.

Originally posted by lkcl View Post

not quite: the plan is to "stripe" the register file so that vectors are optimal, and to provide a cyclic ring-buffer for scalar workloads that don't quite fit that. example:

Vector A fits into R0 R1 R2 R3
Vector B fits into R4 R5 R6 R7
result C is to go into R8 R9 R10 R11

the data paths between R0, R4, R8, R12, R16 (etc) are immediate and direct. likewise between R1, R5, R9, .... etc.

therefore this takes 1 clock cycle to read or write, and there are 4 such "paths" between regfiles, so all *four* sets of vector ops (R8=R4+R0, R9=R5+R1) all do not interfere with each other.

however let us say that you make the "mistake" of doing this:

Vector A fits into R0 R1 R2 R3
Vector B fits into R4 R5 R6 R7
result C is to go into R9 R10 R11 R12

now although the reads (A, B) work fine, the result R0+R4, needing to go into R9, it is in the *wrong lane* and must be dropped into the "cyclic buffer". it will be a *three* cycle latency before it gets written to the regfile.

otherwise we have to have a full crossbar (12 or 16 way READ and 8 or 10 WRITE) and that's just completely insane.

Does that mean the entire pipeline has to stall while waiting on the buffer? If so... ouch.

Originally posted by lkcl View Post

they're not quite ZOHLsm but yes if you cognitively disconnect "decode" from "issue" then consider SV to be "a compressed version of decode", we can still have multi-issue decode and multi-issue execution.

yyeah there is so much to get done before considering doing that, although hilariously we considered overloading "branch" as a way to "start threads".

we have to do a full from-scratch redesign, in particular taking into account Condition Registers in PowerISA. sigh. https://bugs.libre-soc.org/show_bug.cgi?id=213

Overloading the branches for predicates does seem pretty clever though.

**WorBlux** · 12 October 2020, 04:08 PM

Originally posted by lkcl View Post

yes. i mean, there's a reason why "scalar" (normal, SMP) CPUs have scratch registers for context-switching (MIPS, RISC-V in particular) it's to get fast context-switches. where you start to include bank-switching of full register sets, including SPRs, it starts to make me twitchy about implementing something like that in a hybrid context. pure (dedicated) GPU, no problem.

Understandable, I've got some ideas, but keep getting stuck on details. Maybe normal prediction and prefetch will be plenty good in practice. And maybe not stuffing all the load units full from a single load-vector source instruction.

**lkcl** · 17 October 2020, 09:04 AM

Originally posted by WorBlux View Post

Your register and cache video helped explain some of it, but still don't think I fully understand the 6600 overall. 7 Hours seems optimistic just for that aspect.

ah yes: the 6600 and its precise-exception augmentations took me 5 months to understand. SimpleV's specification details which are ISA-independent (nothing to do with the 6600) "only" took 7 hours

I do occasionally find myself on comp.arch sometimes just trying to glean interesting tidbits when I can.

it's pretty high traffic and people love deviating

Poor DIV, always the black sheep of the family. But I can now see how it's be easy to multiplex the issue.

jacob came up with a "combined" algorithm that covers DIV, SQRT and R-SQRT in the same unit(s). this gives us something like a 50% increase in silicon area for a *combined* unit but then a 2/3 reduction in the *number* of such units required.

I think I'm starting to get it. The X.org presentation helped. So the decoder sees the Vector instruction, and knows the element width, and will pack a FU/registers SIMD style,

yes

and also continue into a neighboring FU.

yes, by *automatically* "masking out" the elements that don't fit that particular back-end SIMD unit, so that the programmer *does NOT* have to get into SIMD setup/teardown Hell

I suppose a Matrix takes care of unaligned/incomplete widths,

not quite: the masking takes care of it. as far as the actual ALU is concerned it doesn't care if it's been told to do 1x64 op, 2x32 ops, 4x16 ops or 8x8 ops (masked or unmasked).

but that i might be harder on alternative implementations.

you're telling me.

And if I'm not wrong, at the end of the day, the limit of VL is based on the number of architectural registers available and their width?

well, if you try to slam 64x FP64 operations into the engine then yes you're going to run out of registers. if however you try 64 INT8 operations those will get spread out across 8x SIMD ALUs taking 8 64-bit registers each, which is... tolerable.

Hence the proposal for the official extension.

ah the reason for the official extension proposal is because we simply cannot be the de-facto "hard fork" maintainers of u-boot, coreboot, linux kernel, gcc, llvm, debian distro, fedora distro (which Redhat will object to for Trademark reasons anyway), i mean the resources to do all that would be absolutely mental.

this is the primary reason why we dropped RISC-V, because they failed, persistently and regularly, under their legal responsibilities under Trademark Law, to respond to reasonable in-good-faith requests to be included in the enhancement of the RISC-V ISA *without* completely compromising our business objectives.

moving on

Anyways to be more specific, say you've entered into vector mode and the next instructions in the queue are
load x to r13
load Vx to r0...3
load Vy to r4...7
mult r13, r4..7 to R17...20
add r4...7, r17...20 to r13...r16
store r13..16 to Vz

And say Vy is in L1 cache, but Vx is isn't. Seems like load request for Vx could clog up you're load/store units even though theres potential tor a better scheduale. Or can loads overtake other loads in the pipeline?

if we have enough LD/ST Reservation Stations (6-12 depending on required throughput), then yes. and as long as the memory locations are non-overlapping in the lower 12 bits, yes. that's a Mitch Alsup trick which saves hugely on address-compare XOR gates. by only comparing on the bottom 12 bits of the address against all other addresses (bear in mind that's an O(N^2) algorithm so is one HELL of a lot of gates if you have say 8 or 12 LD/ST RSes) you may end up "overzealously" catching some addresses that *might* not overlap in their upper bits, but if you did so it would find opportunities for parallelism.

i.e the fallback is "these LD/STs are going to be done sequentially if we *can't* find opportunities for parallelsim" rather than "assume everything's done in parallel and whoops we missed some, wark, data-corruption"

Indeed, it does look like a very ambitious project, even more so once you drill into the details.

ahh... yah

It's a real shame RISC-V wasn't more accommodating.

hey they did us a favour. who wants to create a processor where the people behind it are spiteful, vengeful, arrogant d***s?

moving on...

I can see why you're using a python flavor to do it.

it would be absolute hell and require 5x the engineers to not do this with OO techniques. or... we could... but we'd need to treat VHDL / Verilog as a "machine code target" with auto-generators (written probably in python) that used templates in VHDL/Verilog and filled in the gaps (size of element width) etc. the maintainability and readability of such an effort would be hell (i've tried).

best to just stick with a modern OO programming language entirely.

"not quite: the plan is to "stripe" the register file so that vectors are optimal, and to provide a cyclic ring-buffer for scalar workloads that don't quite fit that. example:"

Does that mean the entire pipeline has to stall while waiting on the buffer? If so... ouch.

ah no. the issue engine is independent, the Reservation Stations are independent and their latches (called "Nameless Registers" in augmented-6600 terminology) act as buffers. as long as you still have RSes to reserve, the issue engine does not stall, and the RSes are *not* dependent on the Register File(s) for resource allocation. however the *moment* any given instruction cannot reserve a required RS, *then* you must stall.

couple of notes:

1) 6600 is not a pipelined architecture: it's a parallel-processing architecture where the Computation Units (ALUs) can be pipelines or FSMs or bits of wet string for all it cares. therefore, if the Function Units can't get a word in to read/write from the Regfiles, such that their stuff hangs around in the Reservation Stations, *then* you get a stall (because no free RSes). so that increased latency (because of the cyclic buffer between RSes and Regfiles) means that you may have to increase the number of RSes to compensate (that's if you care about the non-vector path... which we don't)

2) Thornton and Cray were so hyper-intelligent and it was so early that they solved problems that they didn't know existed (or would become "problems" for other architects). consequently they didn't even notice that the RS "latches" were a form of "Register Renaming" and it's only an extensive retrospective analysis and comparison against the Tomasulo Algorithm that i even noticed that the RS latches are directly equivalent to "Register renaming". even Patterson, one of the world's leading academics, completely failed to notice this, angering and annoying the s*** out of Mitch Alsup enough for Mitch to write two supplementary chapters to Thornton's book, "Design of a Computer".

Overloading the branches for predicates does seem pretty clever though.

that was for RISC-V. OpenPOWER ISA, everything is based around Condition Registers. so, i am advocating that we simply vectorise those (and increase their number to 64 or 128)

213 – SimpleV Standard writeup needed

https://bugs.libre-soc.org/show_bug.cgi?id=213#c48

**lkcl** · 17 October 2020, 09:08 AM

Originally posted by WorBlux View Post

Understandable, I've got some ideas, but keep getting stuck on details. Maybe normal prediction and prefetch will be plenty good in practice. And maybe not stuffing all the load units full from a single load-vector source instruction.

the nice thing about the predication is, it drops on top of the SIMD masks, and from there through to regfile byte-write-enables. no matter the element width, it's all good. it means that for a 64 bit operation, writing to the regfile we need to raise 8x byte-level write lines, but that's standard practice for SRAMs in L1 and L2 caches so cell library developers are going "yawn" at that (small) innovation.

**WorBlux** · 18 October 2020, 12:03 AM

First off, thanks for the reply, it's clarified quite a few things for me.

Originally posted by lkcl View Post

well, if you try to slam 64x FP64 operations into the engine then yes you're going to run out of registers. if however you try 64 INT8 operations those will get spread out across 8x SIMD ALUs taking 8 64-bit registers each, which is... tolerable.

So the API supports it, but in practice the compiler (mesa) will tune VL sizes to the hardware (but I guess this is what mesa does for everything Vulkan/OpenGL/OpenCL anyways). Even for native binary code, it seems more viable to include code paths for different 2^n bit optimal chunks, rather than trying to deal with every SIMD opcode/intrinsic under the sun. GCC could easily future-proof code by including optimized paths for yet unseen sizes, but you'd never be able to do that with unreleased SIMD extensions.

Originally posted by lkcl View Post

ah no. the issue engine is independent, the Reservation Stations are independent and their latches (called "Nameless Registers" in augmented-6600 terminology) act as buffers. as long as you still have RSes to reserve, the issue engine does not stall, and the RSes are *not* dependent on the Register File(s) for resource allocation. however the *moment* any given instruction cannot reserve a required RS, *then* you must stall.

couple of notes:

1) 6600 is not a pipelined architecture: it's a parallel-processing architecture where the Computation Units (ALUs) can be pipelines or FSMs or bits of wet string for all it cares. therefore, if the Function Units can't get a word in to read/write from the Regfiles, such that their stuff hangs around in the Reservation Stations, *then* you get a stall (because no free RSes). so that increased latency (because of the cyclic buffer between RSes and Regfiles) means that you may have to increase the number of RSes to compensate (that's if you care about the non-vector path... which we don't)

2) Thornton and Cray were so hyper-intelligent and it was so early that they solved problems that they didn't know existed (or would become "problems" for other architects). consequently they didn't even notice that the RS "latches" were a form of "Register Renaming" and it's only an extensive retrospective analysis and comparison against the Tomasulo Algorithm that i even noticed that the RS latches are directly equivalent to "Register renaming". even Patterson, one of the world's leading academics, completely failed to notice this, angering and annoying the s*** out of Mitch Alsup enough for Mitch to write two supplementary chapters to Thornton's book, "Design of a Computer".

Rename becasue of the dual FU-FU and FU-Reg DM's. If RS A wants to write to r3, RS B wants to read R3 from RS A, and RS C also wants to write to r3, there's no reason RS C can't go ahead and do it's operation and keep the result on it's output latches while waiting for RS A to finish and RS B to pull it's read. I hope I'm starting to get it.

1. So a several FU's might share a pipilined ALU, so long as it can track and buffer the results, But once an FU is issued an instruction it has to track it to commit?

I may just have to go read that book. Are Mitch's chapters of addendum publicly availible?

Originally posted by lkcl View Post

that was for RISC-V. OpenPOWER ISA, everything is based around Condition Registers. so, i am advocating that we simply vectorise those (and increase their number to 64 or 128)

213 – SimpleV Standard writeup needed

https://bugs.libre-soc.org/show_bug.cgi?id=213#c48

I'm looking at this CR thing for a while now, digging into that bug report, and the Power ISA specification, and not really getting any great ideas.

One really bad idea - Ignore the CR and add a byte of mask at the bottom of each GPR. But of course the would make register spill/save a nighmare. Plus doesn't really help with GT/LT/EQ.

One start to an idea was to expand the CR bit feild into byte feilds (plus mask). Also seems more terrible the more I think of it. If you were only ever doing 8x simd maybe.

Oh Well, I know that's not helpful at all, but it was fun to think about. I'll be sure to follow your progress anyways.

**lkcl** · 18 October 2020, 01:02 AM

Originally posted by WorBlux View Post

First off, thanks for the reply, it's clarified quite a few things for me.

So the API supports it, but in practice the compiler (mesa) will tune VL sizes to the hardware (but I guess this is what mesa does for everything Vulkan/OpenGL/OpenCL anyways).

pretty much, yeah. i mean, the compiler will know the register allocation / usage, and normally would shove out a batch of SIMD instructions (4x 4-wide SIMD to do 16 operations), whereas with SV it would issue *one* scalar operation with VL=16, *knowing* that this means that 16 registers will be needed.

Even for native binary code, it seems more viable to include code paths for different 2^n bit optimal chunks, rather than trying to deal with every SIMD opcode/intrinsic under the sun. GCC could easily future-proof code by including optimized paths for yet unseen sizes, but you'd never be able to do that with unreleased SIMD extensions.

well, what i'm hoping is that the significant work being done on llvm for RISC-V, ARM SVE/2, and other companies with variable-length VL, will hit mainline well in advance, such that all we need to do is a minimalist amount of porting work to add SV.

Rename becasue of the dual FU-FU and FU-Reg DM's.

the FU-Regs matrix covers the information about which registers a given FU needs to read or write, whilst the FU-FU matrix preserves the *result* ordering dependency. interestingly, FU-FU preserves a DAG (Directed Acyclic Graph)

If RS A wants to write to r3, RS B wants to read R3 from RS A, and RS C also wants to write to r3, there's no reason RS C can't go ahead and do it's operation and keep the result on it's output latches while waiting for RS A to finish and RS B to pull it's read. I hope I'm starting to get it.

pretty much

B has a Read-after-Write hazard on A, C has a Write-after-Read hazard on B. yes absolutely, C can go ahead in parallel, create the result, and once the WaR hazard is dropped by B, the "hold" goes away,

C is then allowed to raise "Write_Request", C will (at some point) be notified "ok, RIGHT NOW, you must put data, RIGHT NOW, on this clock cycle, for one cycle only, the data you want writing to the regfile". this is the "GO_WRITE" signal, and following that GO_WRITE (the cycle after), C absolutely must drop its Write_Request (because it's done its write). that "drop" of the Write_Request also goes into the FU-FU and FU_Regs Dependency Matrices to say "i no longer have a dependency: i'm totally done, no longer busy, and therefore free to be issued another instruction".

it's ultimately really quite compact and beautifully elegant, very little actual silicon, just hell to explain. even Thornton, in "Design of a Computer", specifically says this (around p.126 i think). i've found it usually takes several weeks for people to grasp the basics, and about 3 months to truly get it.

1. So a several FU's might share a pipilined ALU,

yes. Mitch Alsup calls this "Concurrent Computation Units". basically if you have a 4-long pipeline, you have at least 4 "RSes" and you schedule one (and only one) of them to let it get data into the front of that pipeline, in each clock cycle.

so long as it can track and buffer the results, But once an FU is issued an instruction it has to track it to commit?

it's non-negotiably critical that they do so. failure to keep track of results will guaranteed 100% result in data corruption.

the only thing that's slightly odd in the Concurrent Computation Unit case is: the FU is *not* the pipeline, it's the RS connected *to* the Pipeline. or, put another way, 4x RSes connected to a shared (mutexed) pipeline is actually *FOUR* separate Function Units.

this got very painful to explain on comp.arch that it was important to computer science that there be different terminology to distinguish "hazard-aware Function Unit" from "pipeline". we had quite a lot of idiots absolutely categorically insist that "FU equals pipeline, shuttup you moron".

eeeeventually after a week of quite painful discussion the term "Phase-aware Function Unit" occurred to me, as a way to distinguish this from "when people start treating pipelines as synonymous with FUs".

Phase-aware to mean "the FU is aware at all times and carries with it the responsibility for tracking and notifying of its operands and its result".

would you believe it, there is no modern industry-standard term for "Phase-aware Function Unit"?

I may just have to go read that book. Are Mitch's chapters of addendum publicly availible?

yes if you send me your email address (PM me) and indicate that you agree that if you share the files with anyone else you must ask them to credit Mitch Alsup if they use any of the material in it, and to require them to (recursively) request the same conditions (recursively) on those follow-on recipients.

I'm looking at this CR thing for a while now, digging into that bug report, and the Power ISA specification, and not really getting any great ideas.

One really bad idea - Ignore the CR and add a byte of mask at the bottom of each GPR. But of course the would make register spill/save a nighmare. Plus doesn't really help with GT/LT/EQ.

One start to an idea was to expand the CR bit feild into byte feilds (plus mask). Also seems more terrible the more I think of it. If you were only ever doing 8x simd maybe.

tricky, isn't it? now you see why it took 18 months to design SV (and implement the simulator).

btw there are no bad ideas at this stage.

Oh Well, I know that's not helpful at all, but it was fun to think about. I'll be sure to follow your progress anyways.

Announcement

Libre-SOC Still Persevering To Be A Hybrid CPU/GPU That's 100% Open-Source

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment