Announcement

Collapse
No announcement yet.

Libre-SOC Still Persevering To Be A Hybrid CPU/GPU That's 100% Open-Source

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #71
    Originally posted by xfcemint View Post

    You can do that in hardware?
    you can do absolutely anything you want in hardware, it's just a matter of thinking it up and having the time/funding to do it. the fact that nobody else has ever thought of instructions in this fashion (as being a hardware type of software API and developing a sort-of "compression") is... well... *shrug*

    we can do it in Simple-V by analysing the "element width over-ride" and the "vector length".

    * elwidth override says "i know this is supposed to be FPADD64 however for all intents and purposes i want it to be FP16 for this upcoming instruction"

    * VL override says "i know this is supposed to be scalar FPADD64^wFPADD16 however for all intents and purposes i want you to shove as many (up to VL) instructions into the execution units as you can possible jam"

    what then happens is: the "SIMD-i-fier" goes, "hmm, that's FP16, and VL is 8, so hmmm i can break that down into two lots of SIMD 4x FP16s and jam them into these two free SIMD FP ALUs".

    I didn't know that is possible. I have never heard of a front-end that can fuse instructions into SIMD for back-end.
    there are a lot of innovations that are possible in SimpleV which have never been part of any publicly-documented academic or commercial micro-architecture. twin predication being one of them.

    just because nobody else has thought of it does not mean that it is not possible: it just means... nobody else has ever thought of it.

    That looks just too crazy to me. Even doing this in a software compiler is a serious problem.
    which is precisely why we're not even remotely considering it in a software compiler. that has known to be insane.

    remember: *all of this as far as the programmer is concerned is hidden behind that Vector Front-end ISA*. the programmer *only* has to think in terms of setting the Vector Length and setting the elwidth.


    You must mean: the shader compiler fuses a small number of shader thread instances into a single CPU thread to create opportunities for using SIMD. This one CPU thread can be called a warp, since it actually handles a few shader instances simultaneously.
    no, i mean that there's a very simple analysis, just after instruction decode phase - a new phase - which analyses the Vector Length and the Element Width "context". and if they are appropriately set (elwidth = 8 bit / 16 bit / 32 bit) and if there are free available SIMD ALUs, multiple operations are pushed into the ALUs.

    it's very straightforward but is sufficiently involved that it may have to be done as its own pipeline stage. on first iterations of implementations however i will try to avoid doing that because it will introduce yet another stage of latency. we just have to see.

    Comment


    • #72
      Originally posted by xfcemint View Post

      Word "gates" is ambigous to me. Could mean: CMOS implementation of AND, OR, NOT logic gates. Also, there are two possible versions of those: the ones with obligatory complement output, or without it. "Gates" could also mean the total number of transistors.

      By the numbers you are posting, I guess you are speaking of CMOS gates, with about 4 transistors per gate.
      industry-standard average is around 2 transistors per CMOS gate, yes: one to pull "HI", the other to pull "LO".

      If you can fit an entire GPU core in less than 5 million transistors, you are flying. So, I would do about one million transistors for a decoder plus the OoO engine, one million for L1 instructions, one million for L1 data. Then see what ALU units you need to maximize compute power.
      allow me to divide those numbers back into gates (divide by 2). 500k gates for a decoder and the OoO engine is off by an order of magnitude. the PowerISA decoder is (as long as we do not have to do VSX) around the 5,000 to 8,000 gate mark, and the OoO engine should be around 50k.

      L1 caches are.... hmm let's do a quick google search

      * https://www.reddit.com/r/ECE/comment..._size_in_gate/
      * https://www.researchgate.net/post/Ho...0nm_technology


      so that's 6 transistors (aka 3 "gates").

      the other post says you can expect about 55% "efficiency" (50-ish percent "crud" for addressing). so let's say 32k cache, that's x3 for gates, x2 for "crud" so that's 192k gates (which if you reaaally want to do it in transistors is 384k transistors)

      yyeah you weren't far off if you were expecting 64k L1 cache sizes


      That is all going to run very hot, so you need very low clocks for GPU cores. Save power in any way you can.
      yes, we came up with a way to open the latches between pipeline stages, turns out IBM invented this somewhere in 1990, now of course you can just use something called "clock gating", however for the initial 180nm ASIC, nmigen does not support clock gating, nor do we have a cell for it, nor does coriolis2 support the concept.

      so quite a lot of development work needed there before we can use clock gating, and in the meantime we can use that older technique of a "bypass" on the pipeline latches.
      Last edited by lkcl; 09-20-2020, 07:01 AM.

      Comment


      • #73
        Originally posted by lkcl View Post

        this sounds exactly like the kind of useful strategy that would get us some reasonable performance without a full-on approach, and, as a hybrid processor it would fit better and it's also much more along the RISC strategy. thank you for the suggestion, i've documented it here https://bugs.libre-soc.org/show_bug.cgi?id=91
        From my point of view, the benefit of using special instructions instead of a separate hardware units is:

        - instructions get better utilization by OoO scheduler
        - instructions get better utilization by compiler scheduling
        - instructions can easily be executed in a pipeline and in parallel

        The downside:
        - Separate hardware is faster. It acts as an additional execution unit.
        - Separate hardware uses less OoO engine resources

        I think that a separate texture unit would be a better solution, but not by much, you you can skip it. But I do recommend implementing a separate texture cache, shared by all GPUs.

        Comment


        • #74
          Originally posted by lkcl View Post
          industry-standard average is around 2 transistors per CMOS gate, yes: one to pull "HI", the other to pull "LO".

          yyeah you weren't far off if you were expecting 64k L1 cache sizes
          No, I was expecting 8-16 KiB cache size. We are somehow counting the transistors differently, that is what causes this confusion. I can't figure out how can you do a two transistor CMOS AND gate, no way. It's 4 transistors, or 8 transistors, or even more with power gating.

          It gets even worse if you go into using more complicated gates than just AND and OR. Like, what about a custom CMOS XOR gate? Or a three-input gate for half adders?

          Comment


          • #75
            Originally posted by xfcemint View Post
            For Larabee, if Intel didn't at least do a separate opcode for bilinear filtering, well, that is "beyond stupid". I thought they had a texture unit attached to each x86 core, that would be a minimum for a GPU from my point of view.
            Larrabee doesn't have fixed function logic for rasterization, interpolation, blending as they are not a bottleneck. But it has fixed function texture units with dedicated 32KB caches, to handle all the filtering methods and texture formats I guess.

            If Libre-SOC is serious about the GPU part they'll need a texture unit too. And it might be quite a bit of work to implement one, unless there are open source designs out there that can be used.

            Comment


            • #76
              Originally posted by xfcemint View Post

              No, I was expecting 8-16 KiB cache size. We are somehow counting the transistors differently, that is what causes this confusion. I can't figure out how can you do a two transistor CMOS AND gate, no way. It's 4 transistors, or 8 transistors, or even more with power gating.
              there are three industry-standard terms: cells, gates and transistors. nobody thinks in terms of transistors (not any more) except people designing cell libraries. to create a typical CMOS "gate" you need two transistors: one to pull HI and one to pull LO. just that alone gives you a NOT "gate". if you put two of those in series you get a NAND "gate".

              something like that.

              anyway, look at the diagram again: you'll see 6 "actual transistors" (not gates, 6 *transistors*). so i

              It gets even worse if you go into using more complicated gates than just AND and OR. Like, what about a custom CMOS XOR gate?
              8 "transistors" according to this, although if you google the number of "gates" the answer comes up "5". which fits with my in-memory recollection.


              Or a three-input gate for half adders?
              half-adder is 2 logic gates (note the different terminology), one XOR plus one AND. 12 for a full adder
              https://www.researchgate.net/post/Ho...and_full_adder

              however if you measure in "transistors" it's 20 transistors, but it depends on the design choices (do you include carry-propagation?)


              basically it's quite involved, the relationship between "gates" and "transistor count" hence why the *average* rule of thumb - after all optimisations have been run during synthesis - is around 2x.

              it's quite a rabbit-hole, i wouldn't worry about it too much

              Comment


              • #77
                Originally posted by log0 View Post

                If Libre-SOC is serious about the GPU part they'll need a texture unit too. And it might be quite a bit of work to implement one, unless there are open source designs out there that can be used.
                i think... i think this was the sort of thing that the research projects (MIAOW, FlexGripPlus) very deliberately left out precisely because as you and xfcemint point out, they're quite involved and a whole research project on their own. we *might* find something in not-a-GPLGPU, Jeff Bush pointed out that although it's fixed-function (based on PLAN9) there was some startling similarity to modern GPU design still in there.

                appreciate the heads-up, i am taking notes and making sure this goes into the bugtracker which is now cross-referenced to this discussion, so thank you log0

                Comment


                • #78
                  Originally posted by lkcl View Post

                  i think... i think this was the sort of thing that the research projects (MIAOW, FlexGripPlus) very deliberately left out precisely because as you and xfcemint point out, they're quite involved and a whole research project on their own. we *might* find something in not-a-GPLGPU, Jeff Bush pointed out that although it's fixed-function (based on PLAN9) there was some startling similarity to modern GPU design still in there.

                  appreciate the heads-up, i am taking notes and making sure this goes into the bugtracker which is now cross-referenced to this discussion, so thank you log0
                  Wait, I have an even better idea.

                  Instead of having three sepatare instructions to replace a texture unit (bilinear interpolation, coordinate transform, LOAD from texture), you would be better off with a single instruction.

                  You add a custom instruction SAMPLE which does all the three mentioned things together. Maybe you can drop out the coordinate transform as a separate instruction, perhaps, if you find it necessary.

                  So a SAMPLE instruction need all the inputs that I mentioned previously. As a result, it produces a bilinearily interpolated RGB(A?) sample from a texture. Such an instruction would be a great fit to your architecture. It does a lot of work in a single instruction, so that reduces the instruction issue bottleneck and the pressure on registers. It would also be beneficial if there are 2-6 units for handling SAMPLE instructions, because it will have long latency. That would enable several SAMPLE instructions to be in flight at the same time.

                  It would work great with a texture cache that I previously described.

                  You can add this to that bug tracker.

                  I'm very proud to have contributed to this project of yours. I think that the basic concepts are sound and it is a good idea. I would love to have an open-source GPU. So, good luck.

                  Comment


                  • #79
                    Here is an even better variation:

                    A SAMPLE instruction takes as inputs:
                    - a pixel format of the texture
                    - the address of the texture in memory
                    - texture pitch
                    - (x,y) sample coordinates, floating point, in the texture native coordinate space

                    The result is an RGB(A) sample.

                    Then, you also need a separate instruction to help computing the (x,y) sample coordinates, because they likely need to be converted to texture coordinate space.


                    Comment


                    • #80
                      Some textures (maybe all textures) are tiled on triangles (the texture has finite site, but it is tiled to get an infinite texture in botx x and y axis).

                      To support that option, the instruction for transforming into texture coordinates can take as inputs the (logical) texture width and height, and then perform the modulo operation on final coordinates (texture space) to produce the tiling effect.

                      Even more, you don't have to always do the modulo operation, as in most cases the quotient will be zero. When the instruction detects that the coordinates are within the given texture logical width and height, then you can save some time and power because the modulo operation is actually a no operation.
                      Last edited by xfcemint; 09-20-2020, 10:40 AM.

                      Comment

                      Working...
                      X