Announcement

Collapse
No announcement yet.

Libre-SOC Still Persevering To Be A Hybrid CPU/GPU That's 100% Open-Source

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • lkcl
    replied
    Originally posted by xfcemint View Post
    Here is an even better variation:

    A SAMPLE instruction takes as inputs:
    - a pixel format of the texture
    - the address of the texture in memory
    - texture pitch
    - (x,y) sample coordinates, floating point, in the texture native coordinate space

    The result is an RGB(A) sample.

    Then, you also need a separate instruction to help computing the (x,y) sample coordinates, because they likely need to be converted to texture coordinate space.

    yes, i believe this was what was (probably, we have to confirm) along the lines of what Jacob and Mitch hinted at (except not with the texture cache idea, i think we were planning to use the standard L1 data cache, will make sure the idea is discussed).

    I'm very proud to have contributed to this project of yours. I think that the basic concepts are sound and it is a good idea. I would love to have an open-source GPU. So, good luck.
    really appreciated xfcemint. it's... compelling, isn't it? once you get into it, the opportunity to like, y'know, actually have something that *you* designed actually go into silicon, rather than waiting around for a billion-dollar company to do it and leave you out, not in the slightest bit consulted - it's addictive, isn't it?




    Leave a comment:


  • lkcl
    replied
    Originally posted by log0 View Post

    If Libre-SOC is serious about the GPU part they'll need a texture unit too. And it might be quite a bit of work to implement one, unless there are open source designs out there that can be used.
    i think... i think this was the sort of thing that the research projects (MIAOW, FlexGripPlus) very deliberately left out precisely because as you and xfcemint point out, they're quite involved and a whole research project on their own. we *might* find something in not-a-GPLGPU, Jeff Bush pointed out that although it's fixed-function (based on PLAN9) there was some startling similarity to modern GPU design still in there.

    appreciate the heads-up, i am taking notes and making sure this goes into the bugtracker which is now cross-referenced to this discussion, so thank you log0

    Leave a comment:


  • lkcl
    replied
    Originally posted by xfcemint View Post

    No, I was expecting 8-16 KiB cache size. We are somehow counting the transistors differently, that is what causes this confusion. I can't figure out how can you do a two transistor CMOS AND gate, no way. It's 4 transistors, or 8 transistors, or even more with power gating.
    there are three industry-standard terms: cells, gates and transistors. nobody thinks in terms of transistors (not any more) except people designing cell libraries. to create a typical CMOS "gate" you need two transistors: one to pull HI and one to pull LO. just that alone gives you a NOT "gate". if you put two of those in series you get a NAND "gate".

    something like that.

    anyway, look at the diagram again: you'll see 6 "actual transistors" (not gates, 6 *transistors*). so i

    It gets even worse if you go into using more complicated gates than just AND and OR. Like, what about a custom CMOS XOR gate?
    8 "transistors" according to this, although if you google the number of "gates" the answer comes up "5". which fits with my in-memory recollection.


    Or a three-input gate for half adders?
    half-adder is 2 logic gates (note the different terminology), one XOR plus one AND. 12 for a full adder


    however if you measure in "transistors" it's 20 transistors, but it depends on the design choices (do you include carry-propagation?)


    basically it's quite involved, the relationship between "gates" and "transistor count" hence why the *average* rule of thumb - after all optimisations have been run during synthesis - is around 2x.

    it's quite a rabbit-hole, i wouldn't worry about it too much

    Leave a comment:


  • log0
    replied
    Originally posted by xfcemint View Post
    For Larabee, if Intel didn't at least do a separate opcode for bilinear filtering, well, that is "beyond stupid". I thought they had a texture unit attached to each x86 core, that would be a minimum for a GPU from my point of view.
    Larrabee doesn't have fixed function logic for rasterization, interpolation, blending as they are not a bottleneck. But it has fixed function texture units with dedicated 32KB caches, to handle all the filtering methods and texture formats I guess.

    If Libre-SOC is serious about the GPU part they'll need a texture unit too. And it might be quite a bit of work to implement one, unless there are open source designs out there that can be used.

    Leave a comment:


  • lkcl
    replied
    Originally posted by xfcemint View Post

    Word "gates" is ambigous to me. Could mean: CMOS implementation of AND, OR, NOT logic gates. Also, there are two possible versions of those: the ones with obligatory complement output, or without it. "Gates" could also mean the total number of transistors.

    By the numbers you are posting, I guess you are speaking of CMOS gates, with about 4 transistors per gate.
    industry-standard average is around 2 transistors per CMOS gate, yes: one to pull "HI", the other to pull "LO".

    If you can fit an entire GPU core in less than 5 million transistors, you are flying. So, I would do about one million transistors for a decoder plus the OoO engine, one million for L1 instructions, one million for L1 data. Then see what ALU units you need to maximize compute power.
    allow me to divide those numbers back into gates (divide by 2). 500k gates for a decoder and the OoO engine is off by an order of magnitude. the PowerISA decoder is (as long as we do not have to do VSX) around the 5,000 to 8,000 gate mark, and the OoO engine should be around 50k.

    L1 caches are.... hmm let's do a quick google search

    * https://www.reddit.com/r/ECE/comment..._size_in_gate/
    * https://www.researchgate.net/post/Ho...0nm_technology


    so that's 6 transistors (aka 3 "gates").

    the other post says you can expect about 55% "efficiency" (50-ish percent "crud" for addressing). so let's say 32k cache, that's x3 for gates, x2 for "crud" so that's 192k gates (which if you reaaally want to do it in transistors is 384k transistors)

    yyeah you weren't far off if you were expecting 64k L1 cache sizes


    That is all going to run very hot, so you need very low clocks for GPU cores. Save power in any way you can.
    yes, we came up with a way to open the latches between pipeline stages, turns out IBM invented this somewhere in 1990, now of course you can just use something called "clock gating", however for the initial 180nm ASIC, nmigen does not support clock gating, nor do we have a cell for it, nor does coriolis2 support the concept.

    so quite a lot of development work needed there before we can use clock gating, and in the meantime we can use that older technique of a "bypass" on the pipeline latches.
    Last edited by lkcl; 20 September 2020, 07:01 AM.

    Leave a comment:


  • lkcl
    replied
    Originally posted by xfcemint View Post

    You can do that in hardware?
    you can do absolutely anything you want in hardware, it's just a matter of thinking it up and having the time/funding to do it. the fact that nobody else has ever thought of instructions in this fashion (as being a hardware type of software API and developing a sort-of "compression") is... well... *shrug*

    we can do it in Simple-V by analysing the "element width over-ride" and the "vector length".

    * elwidth override says "i know this is supposed to be FPADD64 however for all intents and purposes i want it to be FP16 for this upcoming instruction"

    * VL override says "i know this is supposed to be scalar FPADD64^wFPADD16 however for all intents and purposes i want you to shove as many (up to VL) instructions into the execution units as you can possible jam"

    what then happens is: the "SIMD-i-fier" goes, "hmm, that's FP16, and VL is 8, so hmmm i can break that down into two lots of SIMD 4x FP16s and jam them into these two free SIMD FP ALUs".

    I didn't know that is possible. I have never heard of a front-end that can fuse instructions into SIMD for back-end.
    there are a lot of innovations that are possible in SimpleV which have never been part of any publicly-documented academic or commercial micro-architecture. twin predication being one of them.

    just because nobody else has thought of it does not mean that it is not possible: it just means... nobody else has ever thought of it.

    That looks just too crazy to me. Even doing this in a software compiler is a serious problem.
    which is precisely why we're not even remotely considering it in a software compiler. that has known to be insane.

    remember: *all of this as far as the programmer is concerned is hidden behind that Vector Front-end ISA*. the programmer *only* has to think in terms of setting the Vector Length and setting the elwidth.


    You must mean: the shader compiler fuses a small number of shader thread instances into a single CPU thread to create opportunities for using SIMD. This one CPU thread can be called a warp, since it actually handles a few shader instances simultaneously.
    no, i mean that there's a very simple analysis, just after instruction decode phase - a new phase - which analyses the Vector Length and the Element Width "context". and if they are appropriately set (elwidth = 8 bit / 16 bit / 32 bit) and if there are free available SIMD ALUs, multiple operations are pushed into the ALUs.

    it's very straightforward but is sufficiently involved that it may have to be done as its own pipeline stage. on first iterations of implementations however i will try to avoid doing that because it will introduce yet another stage of latency. we just have to see.

    Leave a comment:


  • lkcl
    replied
    Originally posted by xfcemint View Post
    The problem with adding a texure unit is that it is a lot of work.

    It is much, much easier to just use a special instruction for bilinear filtering.

    So, for start, perhaps it is a better idea to not use a texture unit.
    this sounds exactly like the kind of useful strategy that would get us some reasonable performance without a full-on approach, and, as a hybrid processor it would fit better and it's also much more along the RISC strategy. thank you for the suggestion, i've documented it here https://bugs.libre-soc.org/show_bug.cgi?id=91

    Leave a comment:


  • lkcl
    replied
    Originally posted by xfcemint View Post

    As I understand it, the GPU-alike solution is to have a special texture unit, which is a completely separate unit. Texture unit does the bilinear interpolation.

    Your solution is a special opcode for bilinear interpolation. Your opcode takes a cycle, it uses resources in the OoO scheduler and uses CPU registers to store pixel RGB values. Pixel RGB values have to be separately loaded (from texture cache? Or from a general-purpose L1 cache?), which wastes even more cycles. You don't want to waste those cycles because the bottleneck of your solution is the instruction issue throughput.

    In comparison, a separate texture unit needs 0 additional cycles to do bilinear filtering (since the inputs are texture sample coordinates x,y in texture coordinate space). A separate texture unit has direct access to memory. The downside is that a texture unit needs lots of multipliers (for bilinear filtering) and it needs its own texture cache. So, that is a lot of transistors for a texture unit, but the utilization is usually excellent, much better than in your current solution.
    ok, so there's two things here:

    1) it looks like you know quite a bit more than me about GPU texturisation (i'm learning!), where when we get to it i was counting on input from Jacob (designer of Kazan), Mitch Alsup (one of the architects behind Samsung's recent GPU), and so on.

    2) how 6600-style OoO works. this bit i *do* know about, and i forgot to mention something: namely that the way it works is, every "operation" is monitored for its start and completion, and that it *doesn't matter* if it's a FSM (like a DIV unit), or a single-stage pipeline, or a multi-stage pipeline, or an early-out variable-length pipeline: what matters and only matters is that every operation is "monitored" from start to finish, 100% without fail.

    consequently, what you describe (the texture unit, with its cache), *can be slotted in as a Function Unit into the 6600 OoO architecture*.

    in fact, we could if necessary add many more than just one of them.


    For Larabee, if Intel didn't at least do a separate opcode for bilinear filtering, well, that is "beyond stupid". I thought they had a texture unit attached to each x86 core, that would be a minimum for a GPU from my point of view.
    not that i am aware of (at least, certainly Nyuzi did not, because jeff deliberately "tracked" and researched the "pure soft GPU" angle).

    It is very hard for me to predict how your solution is going to perform.
    we don't know either! however what we have is a strategy for calculating that (based on Jeff's Nyuzi work, well worth reading) where he shows how not only to measure "pixels / clock" performance, but also how to work out which bits of any given algorithm are contributing to the [lack of] performance.

    he then shows precisely how to then optimise the *architecture* to get better performance. and there are some real surprises in it, the L1 cache munches enormous amounts of power for example. you should be able to track the paper down via this https://www.researchgate.net/publica...pen_source_GPU




    You have a big OoO scheduler and a rather small ALU compared to a GPU.
    ah no, you misunderstand: setting the ALU sizes, capabilities and quantities *of* each ALU is a matter of dialing a parameter in a python dictionary. we literally change the number of MUL pipelines with a single line of code and the ALU(s) go ballistic and so do the number of gates.

    the question then becomes: *should* you do that, what's the business case for doing so, and will people pay money for the resultant product?

    That kind of CPU has very limited compute resources to compete with a GPU. You try to compensate by using triple instruction issue to get better utilization of ALU units.
    exactly, but more than that, we can crank up the number *of* ALUs - of each different type - to dial in the performance according to what we find out when we get to run benchmarks, just like Jeff Bush did.

    Well, if you already have an OoO scheduler, at least your solution looks relatively simple to design.
    eexxaaactlyyy, where because we have the pieces in place we're not going "omg, omg we backed ourselves into a corner with this stupid in-order design, why the hell did we do that, arg arg we now have to chuck out everything we've developed over the past NN months and start again".

    by tackling head-on what's normally considered to be a "hard" processor design we've given ourselves the design flexibility to just go "yawn, did we need to up the number of DIV units again? ahh let me just stretch across to the keyboard and change that 2 to a 3, ahh that was really difficult"

    What can I say? The simplest way to increase the compute capability of your CPU is SIMD.
    if you look in the new POWER10 architecture - which is a *splutter* 8-way multi-issue, they actually put down *two* separate and distinct 128-bit VSX ALUs / pipelines.

    we can do exactly the same thing.

    we can put down *MULTIPLE* SIMD ALUs.

    we do *NOT* have to do the insanity of increasing the SIMD width from 64 bit to 128 bit to 256 bit to 512 bit.

    we have the option - the flexibility - to put down 1x 64-bit SIMD ALUs, 2x 64-bit SIMD ALUs, 3x, 4x, 8x - and have the OoO Execution Engine take care of it.

    (actually, what's more likely to happen is, when we come to do high-performance CPU-GPUs, because the Dependency Matrices increase in size O(N^2), is that we'll put down 1x 64-bit SIMD ALU, 1x 128-bit SIMD ALU, 1x 256-bit SIMD ALU and so on)

    and, remember: all of this complexity - whatever happens at the back-end - is entirely hidden from the developer with a "VL" front-end, all exactly the same programs. *no* need to even know that there's NNN back-end SIMD units.

    In that case, you need to look at the common shader code and how can it be compiled for your CPU. Shaders can be run massively in parallel, so how is your CPU going to exploit that?
    see above. and, also, remember, if it gets architecturally too complex for a single CPU, we just increase the number of cores on the SMP NOC instead (look up OpenPITON, it's one of the options we can use, to go up to 500,000 cores).


    Think about adding a separate texture unit, connected to the ALU. Texture unit has its own cache and direct access to memory.
    ok so if it has memory access, then that's a little more complex, because the LDs / STs also have to be monitored by Dependency Matrices. yes, really: in an OoO architecture you cannot let LDs / STs go unmonitored either, because otherwise you get memory corruption.

    studying and learning about this (and properly implementing it) i think took about 2 out of those 5 months of learning about augmented 6600 from Mitch Alsup.



    I'm going to think some more about it, but I think that I don't have sufficient knowledge and experience on this subject to be of much further help. It is too hard to see how the shader code is going to be compiled and what amount of compute capability per transistor can your CPU provide.
    we honestly don't know yet (and can only have an iterative strategy to "see what happens"). in practical terms we're still at the "let's get the scalar core operational" phase along with "planning the pieces in advance ready for adding GPU stuff", preparing the groundwork for entering that "iterative feedback loop" phase (just like Jeff Bush did on Nyuzi). for which, actually, this conversation has been fantastic preparation, very grateful for the opportunity.

    Leave a comment:


  • ermo
    replied
    Originally posted by lkcl View Post

    25 years ago i got such bad RSI (known as carpal tunnel in the U.S.) that i had to minimise typing. it got so bad that one day i couldn't get into my house because i couldn't turn the key in the lock.

    like the "pavlov dog", if it actually physically hurts to stretch your fingers just to reach a shift key, pretty soon you stop doing it. however when it comes to proper nouns, sometimes i find that the respect that i have for such words "over-rides" the physical pain that it causes me to type the word.
    Sounds painful - sorry to hear.

    Best of luck with the project!

    Leave a comment:


  • lkcl
    replied
    Originally posted by xfcemint View Post
    The most important is the shader that samples textures.
    yeah here we will need a special opcode that takes an array of 4 pixel values, (N,M) (N+1,M), (N,M+1), (N+1,M+1), and an xy pair from 0.0 to 1.0. the pixel value returned (ARGB) will be the linear interpolation between the 4 incoming pixel values, according to the xy coordinates.

    trying that in software only rather than having a single-cycle (or pipelined) opcode was exactly why Larrabee failed.


    Yeah, I guessed that, and it's absolutely the same with CUDA threads. You have to avoid thread DIVERGENCE. A conditional instruction *can* (but doesn't have to) split a warp into two parts, then everything needs to be executed twice (or multiple times).


    What I don't get is: what's the implementation of register crossbar and the register bus? I thought that tri-state busses are to be avoided on ICs. So how does it manage that huge crossbar with just MUXes and DEMUXes? Maybe it's just a lot of transistors for that crossbar.
    basically yes. and it's something that can be avoided with "striping".

    if you have to add vectors of length 4 all the time, you *know* that A[0] = B[0] + C[0] is never going to interact with A[3] = B[3] + C[3].

    therefore what you do is: you *stripe* the register file (into 4 "lanes") so that R0 can *never* interact with R1,R2,R3, but ONLY with R4, R8, R12, R16 etc. likewise R1 can *never* interact with anything other than R5, R9, R13, R17 etc.

    of course that's a bit s*** for general-purpose computing, so you add some slower data paths (maybe a shift register or a separate broadcast bus) but at least you didn't have to have a massive 4x4 64-bit crossbar taking up thousands of gates and bristling with wires.

    turns out that one of the major problems for crossbars is not the number of MUXes, it's the number of wires in and out.

    Leave a comment:

Working...
X