Announcement

Collapse
No announcement yet.

Libre-SOC Still Persevering To Be A Hybrid CPU/GPU That's 100% Open-Source

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #51
    Originally posted by xfcemint View Post

    Glad to be of help.

    Also, I can see one issue there, which is also a suggestion: I don't see any need for a complex OoO scheduler in the GPU cores. Even a superscalar issue is probably too much. I mean, an OoO scheduler will just waste transistors and power. So the best is probably to replace it with some simpler scheduler, which needs additional work to be designed.
    the *Tomasulo* algorithm if made multi-issue would indeed be an O(N^2) power increase and also an O(N^2) increase in design complexity. however i spent *6 months* with Mitch Alsup, one of the world's leading experts in commercial-grade CPU design, learning how to do this properly.

    the multi-issue superscalar aspect is "the" chosen way to not just get vectors in, it's also there to make sure that resources are properly utilised. imagine that you have a VL=3 or VL=12. which is standard fare for XYZ matrices and vectors. but... vectors of length 3 don't fit into SIMD of depth 4, do they? you *always* run at only 75% "lane" utilisation, *unless* you waste yet more CPU cycles reorganising 3x4 data into 4x3 data or as they did in MALI actually add dedicated 3x4 matrix opcodes which makes life even more hell for programmers than GPU programming already is.

    for our engine, the fact that all operations basically boil down to scalar multi-issue, then, well, on the 1st clock cycle the first XYZ row of the 3x4 gets thrown into the first 3 instructions of the 4-wide multi-issue execution *and the 1st element of the 2nd row as well*. on the next clock cycle elements Y and Z of the 2nd row plus the elements X and Y of the *3rd* row get thrown into the 4-wide multi-issue execution engine, and finally the last remaining elements fit cleanly into the 3rd clock cycle.

    see how easy that was? where is the special hard-coded patented 3x4 matrix opcode? where are the horrendous messy cycle-wasting matrix transpose instructions? completely gone, not even needed.

    point being that i actually thought about this - in some significant detail. trying the above without an OoO multi-issue engine would actually be far more technically difficult.

    Comment


    • #52
      Oh, here is another thing that I just thought:

      It appears to me (like, obvious), that the ideal kind of instruction scheduler for a GPU would be
      - In-order issue
      - Single issue
      - Can issue instructions without waiting for previously issued instructions to complete. So, it should have some kind of dependency resolution, and maybe it can simply reuse this functionality from the OoO scheduler design.

      So, perhaps you can just do some simplification of the current OoO design, to cut the number of transistors. It doesn't have to be in-order execution at all, just in-order issue and single issue will probably cut out a significant number of transistors.

      Comment


      • #53
        Originally posted by lkcl View Post
        the multi-issue superscalar aspect is "the" chosen way to not just get vectors in, it's also there to make sure that resources are properly utilised. imagine that you have a VL=3 or VL=12. which is standard fare for XYZ matrices and vectors. but... vectors of length 3 don't fit into SIMD of depth 4, do they? you *always* run at only 75% "lane" utilisation, *unless* you waste yet more CPU cycles reorganising 3x4 data into 4x3 data or as they did in MALI actually add dedicated 3x4 matrix opcodes which makes life even more hell for programmers than GPU programming already is.

        for our engine, the fact that all operations basically boil down to scalar multi-issue, then, well, on the 1st clock cycle the first XYZ row of the 3x4 gets thrown into the first 3 instructions of the 4-wide multi-issue execution *and the 1st element of the 2nd row as well*. on the next clock cycle elements Y and Z of the 2nd row plus the elements X and Y of the *3rd* row get thrown into the 4-wide multi-issue execution engine, and finally the last remaining elements fit cleanly into the 3rd clock cycle.
        I have just read this. I think I get it. This is a novel approach what you are doing, I have never heard of anything similar.

        A 3 wide issue - isn't that too complex for a GPU?

        Isn't it simpler to just add more and wider arithmetic units, even if they have only 75% lane utilization?

        I have to think about it, this is too confusing to me to give an answer quickly. It seems to me on the first glance that using a multi-issue to get high utilization is not important for a GPU. The problem is that an inssufficient number of arithmetic operations is issued on each cycle (3 at most). That's too low for a GPU, and it will create a bottleneck.


        Comment


        • #54
          Originally posted by lkcl View Post

          the *Tomasulo* algorithm if made multi-issue would...

          see how easy that was? where is the special hard-coded patented 3x4 matrix opcode? where are the horrendous messy cycle-wasting matrix transpose instructions? completely gone, not even needed.

          point being that i actually thought about this - in some significant detail. trying the above without an OoO multi-issue engine would actually be far more technically difficult.
          I've heard of Tomasulo algorithm. I have a very rough idea of what it does. I'm not a real CPU designer, I do it just for hobby.

          About 3x4 matrix opcodes - aren't GPUs mostly bound by texture shader performance? Why would a texture shader need matrices at all (I don't know- I never wrote a single shader. I wrote lots of CUDA code, but not for graphics). I would imagine that a texture shader mostly needs multiply-add and bilinear filtering. Lots of that, and no need for complex matrix opcodes.

          Comment


          • #55
            Originally posted by xfcemint View Post

            About 3x4 matrix opcodes - aren't GPUs mostly bound by texture shader performance? Why would a texture shader need matrices at all (I don't know- I never wrote a single shader. I wrote lots of CUDA code, but not for graphics). I would imagine that a texture shader mostly needs multiply-add and bilinear filtering. Lots of that, and no need for complex matrix opcodes.
            Every stage of a GPU is fully programmable, from initial generation of meshes from raw data, through projecting them from 3d space onto 2d with depth, through to selecting the colour for each visible pixel based on textures and calculated light positions. Getting a triangle is basically like writing several CUDA programs that do specific parts of 3d image display pipeline. The complexity graphics vs CUDA is that the GPU need to schedule hundreds of different programs dynamically, whereas CUDA is usually a single parallel workload.

            In any case, it is important to note that even with CUDA you do not have lots of independent cores running. The GPU cores are optimised in groups to schedule the same program, accessing the same shared data (uniforms) on different inputs to the same opcodes. Using conditional if/else logic or variable repetition count loops stalls other running cores in the group into doing NOPs until everything is executing the same code again. This is why you can fit 100 GPU cores for every CPU core onto your silicon.

            Comment


            • #56
              Originally posted by OneTimeShot View Post

              Every stage of a GPU is fully programmable, from initial generation of meshes from raw data, through projecting them from 3d space onto 2d with depth, through to selecting the colour for each visible pixel based on textures and calculated light positions. Getting a triangle is basically like writing several CUDA programs that do specific parts of 3d image display pipeline.
              I approximately understand the general architecture of 3D graphics software. It has to do matrices to transform into world coordinates, then perpective projection which needs to use division, do z-ordering (and better not with z-buffer), has to calculate illimination on vertices, then each compute unit gets a small part of the screen where it exploits pixel-level paralelism to create warps. Besides per-ray shaders , you can also write various other shaders for other parts of the pipeline.

              I don't even know the correct names: I think the per-pixel shader is actually the one that applies the post-processing effect on the final image.

              The most important is the shader that samples textures. How is it called? It is generally run as a single ray per pixel, but, you can have multiple rays per pixel to do some antialiasing. I had to go to Wikipedia: apparently, it's called fragment shader or pixel shader, confusingly.

              [QUOTE=OneTimeShot;n1208054]
              The complexity graphics vs CUDA is that the GPU need to schedule hundreds of different programs dynamically, whereas CUDA is usually a single parallel workload.
              [QUOTE=OneTimeShot;n1208054]

              Well, I don't see a big complexity there. When an SM is done with one thing, the GPU scheduler runs another thing on it. If it runs 16x8 blocks of pixels in simple screen order, it will do fine, but to get additional 10% performance maybe it can try blocks covering the same triangle.

              Or some variation of that idea, that's all trivial and I can guess it without knowing anything about it.

              Originally posted by OneTimeShot View Post

              In any case, it is important to note that even with CUDA you do not have lots of independent cores running. The GPU cores are optimised in groups to schedule the same program, accessing the same shared data (uniforms) on different inputs to the same opcodes. Using conditional if/else logic or variable repetition count loops stalls other running cores in the group into doing NOPs until everything is executing the same code again.
              Yeah, I guessed that, and it's absolutely the same with CUDA threads. You have to avoid thread DIVERGENCE. A conditional instruction *can* (but doesn't have to) split a warp into two parts, then everything needs to be executed twice (or multiple times).

              Originally posted by OneTimeShot View Post
              This is why you can fit 100 GPU cores for every CPU core onto your silicon.
              The reason is that a GPU "core" is more a conceptual than a real element. A basic unit is a multiprocessor. One multiprocessor "has" many "cores", but that's just an illusion, it really consists of a SINGLE instruction decoder and a WIDE ALU. Probably it has one thread-mask register and a branch address and thread-mask STACK so that it can do that thread-divergence re-execution. That doesn't say anywhere in CUDA documentation, but I imagine that is the actual hardware implementation. And of course, there is a big ...what's the name.... crossbar (I guess) on the register bus so that "registers" can be permuted before being an input to the ALU.

              So you have a simple decoder, and each one decodes for many cores. One simple decoder is connected to a big arithmetic machinery. I would guess at least 8 FP32 multipliers per decoder, if not 32 or 64. So a GPU is just one big bunch of arithmentic units and a few transistors to control them.

              What I don't get is: what's the implementation of register crossbar and the register bus? I thought that tri-state busses are to be avoided on ICs. So how does it manage that huge crossbar with just MUXes and DEMUXes? Maybe it's just a lot of transistors for that crossbar.

              Comment


              • #57
                Originally posted by xfcemint View Post
                Oh, here is another thing that I just thought:


                So, perhaps you can just do some simplification of the current OoO design, to cut the number of transistors. It doesn't have to be in-order execution at all, just in-order issue and single issue will probably cut out a significant number of transistors.
                if we do it carefully (creatively) we can get away with around 50,000 gates for the out-of-order dependency matrices. a typical 64-bit multiplier is around 15,000 gates, and the DIV/SQRT/RSQRT pipeline i think was... 50,000 gates, possibly higher (it covers all 3 of those functions). we need 4 of those 64-bit multipliers, plus some more ALUs...

                see how those 50,000 gates for the dependency matrices doesn't look so big? and given that they're one-hot encoding the power consumption is pretty small.

                GPUs typically have something insane like 30% of the entire ASIC dedicated to computation. in a "scalar" CPU it's more like... 2% (!) where even the register files take up more than that!

                Comment


                • #58
                  Originally posted by xfcemint View Post

                  I've heard of Tomasulo algorithm. I have a very rough idea of what it does. I'm not a real CPU designer, I do it just for hobby.
                  the youtube videos on it that are the top hits are pretty good, make it really clear. once that's understood, i wrote a page on how to topologically "morph" to an (augmented) 6600 design. it basically involves changing all binary-address lookups (CAMs in particular) into *unary* (one-bit, one-hot) tables, which has the distinct advantage of far less power consumption to make a match (a single AND gate activates rather than a massive suite of XOR gates), and also alows multi-hot which is, ta-daaa, how you do multi-issue with virtually no extra hardware https://libre-soc.org/3d_gpu/archite...ransformation/

                  Comment


                  • #59
                    Originally posted by xfcemint View Post
                    I don't even know the correct names: I think the per-pixel shader is actually the one that applies the post-processing effect on the final image.

                    The most important is the shader that samples textures. How is it called? It is generally run as a single ray per pixel, but, you can have multiple rays per pixel to do some antialiasing. I had to go to Wikipedia: apparently, it's called fragment shader or pixel shader, confusingly.
                    There are people on this forum much better qualified to go into that level of detail than me, but Microsoft have a nice diagram here: https://docs.microsoft.com/en-us/win...ith-directx-12

                    Originally posted by xfcemint View Post
                    Well, I don't see a big complexity there. When an SM is done with one thing, the GPU scheduler runs another thing on it. If it runs 16x8 blocks of pixels in simple screen order, it will do fine, but to get additional 10% performance maybe it can try blocks covering the same triangle.
                    The pixel shader normally runs by triangle, but I that is the basic model. If you are transforming vertices, do 1000 at a time, if you are calculating the colour of pixels, do 1000 at a time, etc. My understanding is that scheduling stuff is where the real smarts are (especially loading caches at the right time and so forth). Anyone can put 5000 cores on a chip, but the complexity getting them all work to do.

                    Originally posted by xfcemint View Post

                    What I don't get is: what's the implementation of register crossbar and the register bus? I thought that tri-state busses are to be avoided on ICs. So how does it manage that huge crossbar with just MUXes and DEMUXes? Maybe it's just a lot of transistors for that crossbar.
                    I have a couple of friends working in the industry you'd enjoy talking to, but you have exceed my knowledge now.

                    Comment


                    • #60
                      Originally posted by xfcemint View Post
                      The most important is the shader that samples textures.
                      yeah here we will need a special opcode that takes an array of 4 pixel values, (N,M) (N+1,M), (N,M+1), (N+1,M+1), and an xy pair from 0.0 to 1.0. the pixel value returned (ARGB) will be the linear interpolation between the 4 incoming pixel values, according to the xy coordinates.

                      trying that in software only rather than having a single-cycle (or pipelined) opcode was exactly why Larrabee failed.


                      Yeah, I guessed that, and it's absolutely the same with CUDA threads. You have to avoid thread DIVERGENCE. A conditional instruction *can* (but doesn't have to) split a warp into two parts, then everything needs to be executed twice (or multiple times).


                      What I don't get is: what's the implementation of register crossbar and the register bus? I thought that tri-state busses are to be avoided on ICs. So how does it manage that huge crossbar with just MUXes and DEMUXes? Maybe it's just a lot of transistors for that crossbar.
                      basically yes. and it's something that can be avoided with "striping".

                      if you have to add vectors of length 4 all the time, you *know* that A[0] = B[0] + C[0] is never going to interact with A[3] = B[3] + C[3].

                      therefore what you do is: you *stripe* the register file (into 4 "lanes") so that R0 can *never* interact with R1,R2,R3, but ONLY with R4, R8, R12, R16 etc. likewise R1 can *never* interact with anything other than R5, R9, R13, R17 etc.

                      of course that's a bit s*** for general-purpose computing, so you add some slower data paths (maybe a shift register or a separate broadcast bus) but at least you didn't have to have a massive 4x4 64-bit crossbar taking up thousands of gates and bristling with wires.

                      turns out that one of the major problems for crossbars is not the number of MUXes, it's the number of wires in and out.

                      Comment

                      Working...
                      X