Announcement

**lkcl** · 19 September 2020, 06:14 AM

Originally posted by xfcemint View Post

The simplest way to design a GPU is probably the CPU+vector ISA extensions path. So, from the perspective of simplifying the project, ISA extensions are not such a bad idea at all.

we do need to be very careful. the reason for the hybrid architecture is to be able to cut out the "CPU-userspace-kernelspace-serialisation-PCIe-deserialisation-GPU-execution *and back again*" insanity.

the Khronos Group is currently working on adding ray tracing to Vulkan, and when the presenter doing the XDC2020 talk said, "now you can call this API recursively", everyone on the IRC channel went "oink, did he really say that?"

and the reason is because everyone there knows the full implications for the driver stack in a traditional split CPU-GPU architecture: they're going to have to create an RPC mechanism across that inter-processor bridge - one that can only be safely done if protected by the linux kernel - that can now do recursion for god's sake!

think about that for a minute. the insanity of doing full serialisation-deserialisation of function call parameters from CPU userspace, jumping to Linux kernelspace and sending serialised function calls over to a GPU which unpacks them at the other end - just went recursive?? they're going to have to "mirror" the state of a stack! no wonder NVidia charges so much damn money for their GPUs!

whereas with the hybrid architecture, we just... make the function call. ray-tracing is recursive? so what. it stays entirely in userspace. it's a *userspace* recursive function call and a *userspace* stack: it doesn't even go into a linux kernel context-switch because the Kazan Vulkan driver (and the MESA one) are *entirely in userspace*. the 3D GPU opcodes we're adding are called... *from userspace*. they're called from a shader binary that was compiled by the SPIR-V compiler inside the Vulkan driver... *but they're called from userspace*.

this cuts man-years off the development cycle, makes end-user application development simpler and easier to debug, and much more. and is literally an order of magnitude simpler to implement.

however - all of the ISA extension additions is predicated on "approval" from the OpenPOWER Foundation, through the development of a "yes you can isolate these custom extensions behind an escape sequence system" extension that *itself* has to be properly reviewed and ratified. absolutely nobody can simply drop a set of unauthorised custom modifications to the OpenPOWER ISA without also expecting to have an army of IBM lawyers drop a legal ton of bricks on their head.

and that's why we're also going to the trouble of making sure that there is a justification for *other* OpenPOWER Foundation Members to use (and therefore support) the ISA extensions. adding IEEE754 sin, cos and atan2 to the scalar PowerISA can be viewed as useful in HPC environments, for example. so it's a long road ahead.

**lkcl** · 19 September 2020, 06:22 AM

Originally posted by xfcemint View Post

I think the GPU core should be different from the CPU core. They can be very similar, they can both be based on POWER ISA, but they should be different.

ahh... ahh... i like it! i don't think anyone's suggested that before. i don't know why. it's a natural extension of the big.little idea.

Then you can cut on the number of transistors in the GPU core, make the core smaller, simpler and more efficient. Keep in only what is necessary for a GPU. Then you can keep the CPU core bigger, faster and more power hungry.

yeah. no this is really exciting. i mean, originally (like, only 3 days ago) i was thinking, in big.little you could have the little cores with only say 8k Instruction-Cache and massively deep back-end SIMD ALUs (still with the Vector front-end though), but what hadn't occurred to me was to *drop* parts of the PowerISA on those cores which aren't strictly needed.

i need to think about that. the reason is because there are currently only 4 "Platforms" in the OpenPOWER v3.1B Specification: AIX-compliant, UNIX-compliant, Embedded and Embedded-no-FPU. what you describe - which is a damn good idea - doesn't really fit any of those. i may have to raise this with the OpenPOWER Foundation, so thank you!

**lkcl** · 19 September 2020, 08:24 AM

Originally posted by xfcemint View Post

I would rate it as success as soon as they have it on FPGA, even with just the CPU working, without any GPU extensions. So, when this FPGA can run something like DOSbox with Quake, that is a success.

yeah i have the litex BIOS running in FPGA, including initialising the DDR3 DRAM, the only major thing left from running a linux OS is the MMU. at that point, it's Doom all the way

Also, this CPU is designed with OoO scheduler? Wow, that's already freaking amazing! If it additionaly has some kind of GPU acceleration capabilities, by whatever means - that's super fantastic. If that existed and was open source, some company would just take it and etch it on silicon - if only to create a RPi competitor.

ok so i have the _pieces_ in place - thanks to Mitch Alsup (6 months studying the 6600 architecture and his augmentations), so i've planned the Computation Units based around that. last year i had a prototype up and running including shadowing which is how you do precise exceptions and pull back anything that's issued after a branch-point. i'll need about 4-6 weeks clear doing nothing else to get that added in, and it's not time to do that, just yet.

in the meantime it's running a very VERY basic FSM using the "pieces" that are already designed, prepared and tested *in advance* to have the 6600 Dependency Matrices dropped in and connected to them.

**lkcl** · 19 September 2020, 08:35 AM

Originally posted by xfcemint View Post

Glad to be of help.

Also, I can see one issue there, which is also a suggestion: I don't see any need for a complex OoO scheduler in the GPU cores. Even a superscalar issue is probably too much. I mean, an OoO scheduler will just waste transistors and power. So the best is probably to replace it with some simpler scheduler, which needs additional work to be designed.

the *Tomasulo* algorithm if made multi-issue would indeed be an O(N^2) power increase and also an O(N^2) increase in design complexity. however i spent *6 months* with Mitch Alsup, one of the world's leading experts in commercial-grade CPU design, learning how to do this properly.

the multi-issue superscalar aspect is "the" chosen way to not just get vectors in, it's also there to make sure that resources are properly utilised. imagine that you have a VL=3 or VL=12. which is standard fare for XYZ matrices and vectors. but... vectors of length 3 don't fit into SIMD of depth 4, do they? you *always* run at only 75% "lane" utilisation, *unless* you waste yet more CPU cycles reorganising 3x4 data into 4x3 data or as they did in MALI actually add dedicated 3x4 matrix opcodes which makes life even more hell for programmers than GPU programming already is.

for our engine, the fact that all operations basically boil down to scalar multi-issue, then, well, on the 1st clock cycle the first XYZ row of the 3x4 gets thrown into the first 3 instructions of the 4-wide multi-issue execution *and the 1st element of the 2nd row as well*. on the next clock cycle elements Y and Z of the 2nd row plus the elements X and Y of the *3rd* row get thrown into the 4-wide multi-issue execution engine, and finally the last remaining elements fit cleanly into the 3rd clock cycle.

see how easy that was? where is the special hard-coded patented 3x4 matrix opcode? where are the horrendous messy cycle-wasting matrix transpose instructions? completely gone, not even needed.

point being that i actually thought about this - in some significant detail. trying the above without an OoO multi-issue engine would actually be far more technically difficult.

**OneTimeShot** · 19 September 2020, 10:46 AM

Originally posted by xfcemint View Post

About 3x4 matrix opcodes - aren't GPUs mostly bound by texture shader performance? Why would a texture shader need matrices at all (I don't know- I never wrote a single shader. I wrote lots of CUDA code, but not for graphics). I would imagine that a texture shader mostly needs multiply-add and bilinear filtering. Lots of that, and no need for complex matrix opcodes.

Every stage of a GPU is fully programmable, from initial generation of meshes from raw data, through projecting them from 3d space onto 2d with depth, through to selecting the colour for each visible pixel based on textures and calculated light positions. Getting a triangle is basically like writing several CUDA programs that do specific parts of 3d image display pipeline. The complexity graphics vs CUDA is that the GPU need to schedule hundreds of different programs dynamically, whereas CUDA is usually a single parallel workload.

In any case, it is important to note that even with CUDA you do not have lots of independent cores running. The GPU cores are optimised in groups to schedule the same program, accessing the same shared data (uniforms) on different inputs to the same opcodes. Using conditional if/else logic or variable repetition count loops stalls other running cores in the group into doing NOPs until everything is executing the same code again. This is why you can fit 100 GPU cores for every CPU core onto your silicon.

**lkcl** · 19 September 2020, 12:09 PM

Originally posted by xfcemint View Post

Oh, here is another thing that I just thought:

So, perhaps you can just do some simplification of the current OoO design, to cut the number of transistors. It doesn't have to be in-order execution at all, just in-order issue and single issue will probably cut out a significant number of transistors.

if we do it carefully (creatively) we can get away with around 50,000 gates for the out-of-order dependency matrices. a typical 64-bit multiplier is around 15,000 gates, and the DIV/SQRT/RSQRT pipeline i think was... 50,000 gates, possibly higher (it covers all 3 of those functions). we need 4 of those 64-bit multipliers, plus some more ALUs...

see how those 50,000 gates for the dependency matrices doesn't look so big? and given that they're one-hot encoding the power consumption is pretty small.

GPUs typically have something insane like 30% of the entire ASIC dedicated to computation. in a "scalar" CPU it's more like... 2% (!) where even the register files take up more than that!

**lkcl** · 19 September 2020, 12:12 PM

Originally posted by xfcemint View Post

I've heard of Tomasulo algorithm. I have a very rough idea of what it does. I'm not a real CPU designer, I do it just for hobby.

the youtube videos on it that are the top hits are pretty good, make it really clear. once that's understood, i wrote a page on how to topologically "morph" to an (augmented) 6600 design. it basically involves changing all binary-address lookups (CAMs in particular) into *unary* (one-bit, one-hot) tables, which has the distinct advantage of far less power consumption to make a match (a single AND gate activates rather than a massive suite of XOR gates), and also alows multi-hot which is, ta-daaa, how you do multi-issue with virtually no extra hardware https://libre-soc.org/3d_gpu/archite...ransformation/

**OneTimeShot** · 19 September 2020, 12:20 PM

Originally posted by xfcemint View Post

I don't even know the correct names: I think the per-pixel shader is actually the one that applies the post-processing effect on the final image.

The most important is the shader that samples textures. How is it called? It is generally run as a single ray per pixel, but, you can have multiple rays per pixel to do some antialiasing. I had to go to Wikipedia: apparently, it's called fragment shader or pixel shader, confusingly.

There are people on this forum much better qualified to go into that level of detail than me, but Microsoft have a nice diagram here: https://docs.microsoft.com/en-us/win...ith-directx-12

Originally posted by xfcemint View Post

Well, I don't see a big complexity there. When an SM is done with one thing, the GPU scheduler runs another thing on it. If it runs 16x8 blocks of pixels in simple screen order, it will do fine, but to get additional 10% performance maybe it can try blocks covering the same triangle.

The pixel shader normally runs by triangle, but I that is the basic model. If you are transforming vertices, do 1000 at a time, if you are calculating the colour of pixels, do 1000 at a time, etc. My understanding is that scheduling stuff is where the real smarts are (especially loading caches at the right time and so forth). Anyone can put 5000 cores on a chip, but the complexity getting them all work to do.

Originally posted by xfcemint View Post

What I don't get is: what's the implementation of register crossbar and the register bus? I thought that tri-state busses are to be avoided on ICs. So how does it manage that huge crossbar with just MUXes and DEMUXes? Maybe it's just a lot of transistors for that crossbar.

I have a couple of friends working in the industry you'd enjoy talking to, but you have exceed my knowledge now.

**lkcl** · 19 September 2020, 12:21 PM

Originally posted by xfcemint View Post

The most important is the shader that samples textures.

yeah here we will need a special opcode that takes an array of 4 pixel values, (N,M) (N+1,M), (N,M+1), (N+1,M+1), and an xy pair from 0.0 to 1.0. the pixel value returned (ARGB) will be the linear interpolation between the 4 incoming pixel values, according to the xy coordinates.

trying that in software only rather than having a single-cycle (or pipelined) opcode was exactly why Larrabee failed.

Yeah, I guessed that, and it's absolutely the same with CUDA threads. You have to avoid thread DIVERGENCE. A conditional instruction *can* (but doesn't have to) split a warp into two parts, then everything needs to be executed twice (or multiple times).

What I don't get is: what's the implementation of register crossbar and the register bus? I thought that tri-state busses are to be avoided on ICs. So how does it manage that huge crossbar with just MUXes and DEMUXes? Maybe it's just a lot of transistors for that crossbar.

basically yes. and it's something that can be avoided with "striping".

if you have to add vectors of length 4 all the time, you *know* that A[0] = B[0] + C[0] is never going to interact with A[3] = B[3] + C[3].

therefore what you do is: you *stripe* the register file (into 4 "lanes") so that R0 can *never* interact with R1,R2,R3, but ONLY with R4, R8, R12, R16 etc. likewise R1 can *never* interact with anything other than R5, R9, R13, R17 etc.

of course that's a bit s*** for general-purpose computing, so you add some slower data paths (maybe a shift register or a separate broadcast bus) but at least you didn't have to have a massive 4x4 64-bit crossbar taking up thousands of gates and bristling with wires.

turns out that one of the major problems for crossbars is not the number of MUXes, it's the number of wires in and out.

**ermo** · 19 September 2020, 02:03 PM

Originally posted by lkcl View Post

25 years ago i got such bad RSI (known as carpal tunnel in the U.S.) that i had to minimise typing. it got so bad that one day i couldn't get into my house because i couldn't turn the key in the lock.

like the "pavlov dog", if it actually physically hurts to stretch your fingers just to reach a shift key, pretty soon you stop doing it. however when it comes to proper nouns, sometimes i find that the respect that i have for such words "over-rides" the physical pain that it causes me to type the word.

Sounds painful - sorry to hear.

Best of luck with the project!

Announcement

Libre-SOC Still Persevering To Be A Hybrid CPU/GPU That's 100% Open-Source

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment