Announcement

**lkcl** · 19 September 2020, 07:28 PM

Originally posted by xfcemint View Post

As I understand it, the GPU-alike solution is to have a special texture unit, which is a completely separate unit. Texture unit does the bilinear interpolation.

Your solution is a special opcode for bilinear interpolation. Your opcode takes a cycle, it uses resources in the OoO scheduler and uses CPU registers to store pixel RGB values. Pixel RGB values have to be separately loaded (from texture cache? Or from a general-purpose L1 cache?), which wastes even more cycles. You don't want to waste those cycles because the bottleneck of your solution is the instruction issue throughput.

In comparison, a separate texture unit needs 0 additional cycles to do bilinear filtering (since the inputs are texture sample coordinates x,y in texture coordinate space). A separate texture unit has direct access to memory. The downside is that a texture unit needs lots of multipliers (for bilinear filtering) and it needs its own texture cache. So, that is a lot of transistors for a texture unit, but the utilization is usually excellent, much better than in your current solution.

ok, so there's two things here:

1) it looks like you know quite a bit more than me about GPU texturisation (i'm learning!), where when we get to it i was counting on input from Jacob (designer of Kazan), Mitch Alsup (one of the architects behind Samsung's recent GPU), and so on.

2) how 6600-style OoO works. this bit i *do* know about, and i forgot to mention something: namely that the way it works is, every "operation" is monitored for its start and completion, and that it *doesn't matter* if it's a FSM (like a DIV unit), or a single-stage pipeline, or a multi-stage pipeline, or an early-out variable-length pipeline: what matters and only matters is that every operation is "monitored" from start to finish, 100% without fail.

consequently, what you describe (the texture unit, with its cache), *can be slotted in as a Function Unit into the 6600 OoO architecture*.

in fact, we could if necessary add many more than just one of them.

For Larabee, if Intel didn't at least do a separate opcode for bilinear filtering, well, that is "beyond stupid". I thought they had a texture unit attached to each x86 core, that would be a minimum for a GPU from my point of view.

not that i am aware of (at least, certainly Nyuzi did not, because jeff deliberately "tracked" and researched the "pure soft GPU" angle).

It is very hard for me to predict how your solution is going to perform.

we don't know either! however what we have is a strategy for calculating that (based on Jeff's Nyuzi work, well worth reading) where he shows how not only to measure "pixels / clock" performance, but also how to work out which bits of any given algorithm are contributing to the [lack of] performance.

he then shows precisely how to then optimise the *architecture* to get better performance. and there are some real surprises in it, the L1 cache munches enormous amounts of power for example. you should be able to track the paper down via this https://www.researchgate.net/publica...pen_source_GPU

You have a big OoO scheduler and a rather small ALU compared to a GPU.

ah no, you misunderstand: setting the ALU sizes, capabilities and quantities *of* each ALU is a matter of dialing a parameter in a python dictionary. we literally change the number of MUL pipelines with a single line of code and the ALU(s) go ballistic and so do the number of gates.

the question then becomes: *should* you do that, what's the business case for doing so, and will people pay money for the resultant product?

That kind of CPU has very limited compute resources to compete with a GPU. You try to compensate by using triple instruction issue to get better utilization of ALU units.

exactly, but more than that, we can crank up the number *of* ALUs - of each different type - to dial in the performance according to what we find out when we get to run benchmarks, just like Jeff Bush did.

Well, if you already have an OoO scheduler, at least your solution looks relatively simple to design.

eexxaaactlyyy, where because we have the pieces in place we're not going "omg, omg we backed ourselves into a corner with this stupid in-order design, why the hell did we do that, arg arg we now have to chuck out everything we've developed over the past NN months and start again".

by tackling head-on what's normally considered to be a "hard" processor design we've given ourselves the design flexibility to just go "yawn, did we need to up the number of DIV units again? ahh let me just stretch across to the keyboard and change that 2 to a 3, ahh that was really difficult"

What can I say? The simplest way to increase the compute capability of your CPU is SIMD.

if you look in the new POWER10 architecture - which is a *splutter* 8-way multi-issue, they actually put down *two* separate and distinct 128-bit VSX ALUs / pipelines.

we can do exactly the same thing.

we can put down *MULTIPLE* SIMD ALUs.

we do *NOT* have to do the insanity of increasing the SIMD width from 64 bit to 128 bit to 256 bit to 512 bit.

we have the option - the flexibility - to put down 1x 64-bit SIMD ALUs, 2x 64-bit SIMD ALUs, 3x, 4x, 8x - and have the OoO Execution Engine take care of it.

(actually, what's more likely to happen is, when we come to do high-performance CPU-GPUs, because the Dependency Matrices increase in size O(N^2), is that we'll put down 1x 64-bit SIMD ALU, 1x 128-bit SIMD ALU, 1x 256-bit SIMD ALU and so on)

and, remember: all of this complexity - whatever happens at the back-end - is entirely hidden from the developer with a "VL" front-end, all exactly the same programs. *no* need to even know that there's NNN back-end SIMD units.

In that case, you need to look at the common shader code and how can it be compiled for your CPU. Shaders can be run massively in parallel, so how is your CPU going to exploit that?

see above. and, also, remember, if it gets architecturally too complex for a single CPU, we just increase the number of cores on the SMP NOC instead (look up OpenPITON, it's one of the options we can use, to go up to 500,000 cores).

Think about adding a separate texture unit, connected to the ALU. Texture unit has its own cache and direct access to memory.

ok so if it has memory access, then that's a little more complex, because the LDs / STs also have to be monitored by Dependency Matrices. yes, really: in an OoO architecture you cannot let LDs / STs go unmonitored either, because otherwise you get memory corruption.

studying and learning about this (and properly implementing it) i think took about 2 out of those 5 months of learning about augmented 6600 from Mitch Alsup.

I'm going to think some more about it, but I think that I don't have sufficient knowledge and experience on this subject to be of much further help. It is too hard to see how the shader code is going to be compiled and what amount of compute capability per transistor can your CPU provide.

we honestly don't know yet (and can only have an iterative strategy to "see what happens"). in practical terms we're still at the "let's get the scalar core operational" phase along with "planning the pieces in advance ready for adding GPU stuff", preparing the groundwork for entering that "iterative feedback loop" phase (just like Jeff Bush did on Nyuzi). for which, actually, this conversation has been fantastic preparation, very grateful for the opportunity.

**lkcl** · 20 September 2020, 05:43 AM

Originally posted by xfcemint View Post

The problem with adding a texure unit is that it is a lot of work.

It is much, much easier to just use a special instruction for bilinear filtering.

So, for start, perhaps it is a better idea to not use a texture unit.

this sounds exactly like the kind of useful strategy that would get us some reasonable performance without a full-on approach, and, as a hybrid processor it would fit better and it's also much more along the RISC strategy. thank you for the suggestion, i've documented it here https://bugs.libre-soc.org/show_bug.cgi?id=91

**lkcl** · 20 September 2020, 05:57 AM

Originally posted by xfcemint View Post

You can do that in hardware?

you can do absolutely anything you want in hardware, it's just a matter of thinking it up and having the time/funding to do it. the fact that nobody else has ever thought of instructions in this fashion (as being a hardware type of software API and developing a sort-of "compression") is... well... *shrug*

we can do it in Simple-V by analysing the "element width over-ride" and the "vector length".

* elwidth override says "i know this is supposed to be FPADD64 however for all intents and purposes i want it to be FP16 for this upcoming instruction"

* VL override says "i know this is supposed to be scalar FPADD64^wFPADD16 however for all intents and purposes i want you to shove as many (up to VL) instructions into the execution units as you can possible jam"

what then happens is: the "SIMD-i-fier" goes, "hmm, that's FP16, and VL is 8, so hmmm i can break that down into two lots of SIMD 4x FP16s and jam them into these two free SIMD FP ALUs".

I didn't know that is possible. I have never heard of a front-end that can fuse instructions into SIMD for back-end.

there are a lot of innovations that are possible in SimpleV which have never been part of any publicly-documented academic or commercial micro-architecture. twin predication being one of them.

just because nobody else has thought of it does not mean that it is not possible: it just means... nobody else has ever thought of it.

That looks just too crazy to me. Even doing this in a software compiler is a serious problem.

which is precisely why we're not even remotely considering it in a software compiler. that has known to be insane.

remember: *all of this as far as the programmer is concerned is hidden behind that Vector Front-end ISA*. the programmer *only* has to think in terms of setting the Vector Length and setting the elwidth.

You must mean: the shader compiler fuses a small number of shader thread instances into a single CPU thread to create opportunities for using SIMD. This one CPU thread can be called a warp, since it actually handles a few shader instances simultaneously.

no, i mean that there's a very simple analysis, just after instruction decode phase - a new phase - which analyses the Vector Length and the Element Width "context". and if they are appropriately set (elwidth = 8 bit / 16 bit / 32 bit) and if there are free available SIMD ALUs, multiple operations are pushed into the ALUs.

it's very straightforward but is sufficiently involved that it may have to be done as its own pipeline stage. on first iterations of implementations however i will try to avoid doing that because it will introduce yet another stage of latency. we just have to see.

**lkcl** · 20 September 2020, 06:11 AM

Originally posted by xfcemint View Post

Word "gates" is ambigous to me. Could mean: CMOS implementation of AND, OR, NOT logic gates. Also, there are two possible versions of those: the ones with obligatory complement output, or without it. "Gates" could also mean the total number of transistors.

By the numbers you are posting, I guess you are speaking of CMOS gates, with about 4 transistors per gate.

industry-standard average is around 2 transistors per CMOS gate, yes: one to pull "HI", the other to pull "LO".

If you can fit an entire GPU core in less than 5 million transistors, you are flying. So, I would do about one million transistors for a decoder plus the OoO engine, one million for L1 instructions, one million for L1 data. Then see what ALU units you need to maximize compute power.

allow me to divide those numbers back into gates (divide by 2). 500k gates for a decoder and the OoO engine is off by an order of magnitude. the PowerISA decoder is (as long as we do not have to do VSX) around the 5,000 to 8,000 gate mark, and the OoO engine should be around 50k.

L1 caches are.... hmm let's do a quick google search

* https://www.reddit.com/r/ECE/comment..._size_in_gate/
* https://www.researchgate.net/post/Ho...0nm_technology

so that's 6 transistors (aka 3 "gates").

the other post says you can expect about 55% "efficiency" (50-ish percent "crud" for addressing). so let's say 32k cache, that's x3 for gates, x2 for "crud" so that's 192k gates (which if you reaaally want to do it in transistors is 384k transistors)

yyeah you weren't far off if you were expecting 64k L1 cache sizes

That is all going to run very hot, so you need very low clocks for GPU cores. Save power in any way you can.

yes, we came up with a way to open the latches between pipeline stages, turns out IBM invented this somewhere in 1990, now of course you can just use something called "clock gating", however for the initial 180nm ASIC, nmigen does not support clock gating, nor do we have a cell for it, nor does coriolis2 support the concept.

so quite a lot of development work needed there before we can use clock gating, and in the meantime we can use that older technique of a "bypass" on the pipeline latches.

**log0** · 20 September 2020, 07:15 AM

Originally posted by xfcemint View Post

For Larabee, if Intel didn't at least do a separate opcode for bilinear filtering, well, that is "beyond stupid". I thought they had a texture unit attached to each x86 core, that would be a minimum for a GPU from my point of view.

Larrabee doesn't have fixed function logic for rasterization, interpolation, blending as they are not a bottleneck. But it has fixed function texture units with dedicated 32KB caches, to handle all the filtering methods and texture formats I guess.

If Libre-SOC is serious about the GPU part they'll need a texture unit too. And it might be quite a bit of work to implement one, unless there are open source designs out there that can be used.

**lkcl** · 20 September 2020, 07:33 AM

Originally posted by xfcemint View Post

No, I was expecting 8-16 KiB cache size. We are somehow counting the transistors differently, that is what causes this confusion. I can't figure out how can you do a two transistor CMOS AND gate, no way. It's 4 transistors, or 8 transistors, or even more with power gating.

there are three industry-standard terms: cells, gates and transistors. nobody thinks in terms of transistors (not any more) except people designing cell libraries. to create a typical CMOS "gate" you need two transistors: one to pull HI and one to pull LO. just that alone gives you a NOT "gate". if you put two of those in series you get a NAND "gate".

something like that.

anyway, look at the diagram again: you'll see 6 "actual transistors" (not gates, 6 *transistors*). so i

It gets even worse if you go into using more complicated gates than just AND and OR. Like, what about a custom CMOS XOR gate?

8 "transistors" according to this, although if you google the number of "gates" the answer comes up "5". which fits with my in-memory recollection.

Or a three-input gate for half adders?

half-adder is 2 logic gates (note the different terminology), one XOR plus one AND. 12 for a full adder

How many logic gates for half adder and full adder? | ResearchGate

https://www.researchgate.net/post/How_many_logic_gates_for_half_adder_and_full_adder

Read 15 answers by scientists to the question asked by Sivanandam Kaliannan on Nov 23, 2016

however if you measure in "transistors" it's 20 transistors, but it depends on the design choices (do you include carry-propagation?)

basically it's quite involved, the relationship between "gates" and "transistor count" hence why the *average* rule of thumb - after all optimisations have been run during synthesis - is around 2x.

it's quite a rabbit-hole, i wouldn't worry about it too much

**lkcl** · 20 September 2020, 07:37 AM

Originally posted by log0 View Post

If Libre-SOC is serious about the GPU part they'll need a texture unit too. And it might be quite a bit of work to implement one, unless there are open source designs out there that can be used.

i think... i think this was the sort of thing that the research projects (MIAOW, FlexGripPlus) very deliberately left out precisely because as you and xfcemint point out, they're quite involved and a whole research project on their own. we *might* find something in not-a-GPLGPU, Jeff Bush pointed out that although it's fixed-function (based on PLAN9) there was some startling similarity to modern GPU design still in there.

appreciate the heads-up, i am taking notes and making sure this goes into the bugtracker which is now cross-referenced to this discussion, so thank you log0

**lkcl** · 20 September 2020, 10:45 AM

Originally posted by xfcemint View Post

Here is an even better variation:

A SAMPLE instruction takes as inputs:
- a pixel format of the texture
- the address of the texture in memory
- texture pitch
- (x,y) sample coordinates, floating point, in the texture native coordinate space

The result is an RGB(A) sample.

Then, you also need a separate instruction to help computing the (x,y) sample coordinates, because they likely need to be converted to texture coordinate space.

yes, i believe this was what was (probably, we have to confirm) along the lines of what Jacob and Mitch hinted at (except not with the texture cache idea, i think we were planning to use the standard L1 data cache, will make sure the idea is discussed).

I'm very proud to have contributed to this project of yours. I think that the basic concepts are sound and it is a good idea. I would love to have an open-source GPU. So, good luck.

really appreciated xfcemint. it's... compelling, isn't it?

once you get into it, the opportunity to like, y'know, actually have something that *you* designed actually go into silicon, rather than waiting around for a billion-dollar company to do it and leave you out, not in the slightest bit consulted - it's addictive, isn't it?

**programmerjake** · 20 September 2020, 11:50 PM

Thanks for all the good ideas!

Originally posted by xfcemint View Post

Now the fun part:
All the texture caches on a SoC are connected by their own bus. If there is a cache miss at one cache, the load request doesn't fall back to main memory yet. Instead, it broadcasts the read request on the cache bus, and other caches can respond if the requested cache line is cached by them. If other caches have the requested data, it is transmitted over the cache bus and loaded into the requesting cache.

Only when all other caches produce a cache miss does the load request fall back to main memory.

rather than having a shared texture cache, I think it might be more beneficial to instead use the shared L2 cache, since that way, non-GPU software can still use all the cache space, rather than only having half as much space due to the rest of the available area being used for a separate L2 texture cache. This also allows textures to use a much larger cache when running GPU tasks, improving performance. Having dedicated L1 caches per-core for textures seems like it could be a good idea, though on the other hand, using that area for a bigger data cache with more ports might be more beneficial since it would be used for non-texture loads/stores too. That all needs to be tested somehow.

**programmerjake** · 20 September 2020, 11:56 PM

Originally posted by xfcemint View Post

Oh, another thing:

If you need some kind of hub/arbiter/controller on the cache bus, you can drop in one of your POWER CPU cores to act as a hub for the entire bus. I hope that it can doo sufficiently high request throughput. I think it can.

That seems waay too complex for the relatively simple task of controlling a bus -- the libre-soc power core is likely to be >50k gates, whereas a bus controller finite-state-machine is likely to be <1k gates

Announcement

Libre-SOC Still Persevering To Be A Hybrid CPU/GPU That's 100% Open-Source

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment