Announcement

Collapse
No announcement yet.

Libre-SOC Still Persevering To Be A Hybrid CPU/GPU That's 100% Open-Source

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Originally posted by lkcl View Post

    25 years ago i got such bad RSI (known as carpal tunnel in the U.S.) that i had to minimise typing. it got so bad that one day i couldn't get into my house because i couldn't turn the key in the lock.

    like the "pavlov dog", if it actually physically hurts to stretch your fingers just to reach a shift key, pretty soon you stop doing it. however when it comes to proper nouns, sometimes i find that the respect that i have for such words "over-rides" the physical pain that it causes me to type the word.
    Sounds painful - sorry to hear.

    Best of luck with the project!

    Comment


    • #62
      Originally posted by lkcl View Post
      yeah here we will need a special opcode that takes an array of 4 pixel values, (N,M) (N+1,M), (N,M+1), (N+1,M+1), and an xy pair from 0.0 to 1.0. the pixel value returned (ARGB) will be the linear interpolation between the 4 incoming pixel values, according to the xy coordinates.
      As I understand it, the GPU-alike solution is to have a special texture unit, which is a completely separate unit. Texture unit does the bilinear interpolation.

      Your solution is a special opcode for bilinear interpolation. Your opcode takes a cycle, it uses resources in the OoO scheduler and uses CPU registers to store pixel RGB values. Pixel RGB values have to be separately loaded (from texture cache? Or from a general-purpose L1 cache?), which wastes even more cycles. You don't want to waste those cycles because the bottleneck of your solution is the instruction issue throughput.

      In comparison, a separate texture unit needs 0 additional cycles to do bilinear filtering (since the inputs are texture sample coordinates x,y in texture coordinate space). A separate texture unit has direct access to memory. The downside is that a texture unit needs lots of multipliers (for bilinear filtering) and it needs its own texture cache. So, that is a lot of transistors for a texture unit, but the utilization is usually excellent, much better than in your current solution.

      For Larabee, if Intel didn't at least do a separate opcode for bilinear filtering, well, that is "beyond stupid". I thought they had a texture unit attached to each x86 core, that would be a minimum for a GPU from my point of view.

      It is very hard for me to predict how your solution is going to perform. You have a big OoO scheduler and a rather small ALU compared to a GPU.
      That kind of CPU has very limited compute resources to compete with a GPU. You try to compensate by using triple instruction issue to get better utilization of ALU units.

      Well, if you already have an OoO scheduler, at least your solution looks relatively simple to design.

      What can I say? The simplest way to increase the compute capability of your CPU is SIMD. In that case, you need to look at the common shader code and how can it be compiled for your CPU. Shaders can be run massively in parallel, so how is your CPU going to exploit that?

      Think about adding a separate texture unit, connected to the ALU. Texture unit has its own cache and direct access to memory.

      I'm going to think some more about it, but I think that I don't have sufficient knowledge and experience on this subject to be of much further help. It is too hard to see how the shader code is going to be compiled and what amount of compute capability per transistor can your CPU provide.

      Comment


      • #63
        Originally posted by xfcemint View Post

        As I understand it, the GPU-alike solution is to have a special texture unit, which is a completely separate unit. Texture unit does the bilinear interpolation.

        Your solution is a special opcode for bilinear interpolation. Your opcode takes a cycle, it uses resources in the OoO scheduler and uses CPU registers to store pixel RGB values. Pixel RGB values have to be separately loaded (from texture cache? Or from a general-purpose L1 cache?), which wastes even more cycles. You don't want to waste those cycles because the bottleneck of your solution is the instruction issue throughput.

        In comparison, a separate texture unit needs 0 additional cycles to do bilinear filtering (since the inputs are texture sample coordinates x,y in texture coordinate space). A separate texture unit has direct access to memory. The downside is that a texture unit needs lots of multipliers (for bilinear filtering) and it needs its own texture cache. So, that is a lot of transistors for a texture unit, but the utilization is usually excellent, much better than in your current solution.
        ok, so there's two things here:

        1) it looks like you know quite a bit more than me about GPU texturisation (i'm learning!), where when we get to it i was counting on input from Jacob (designer of Kazan), Mitch Alsup (one of the architects behind Samsung's recent GPU), and so on.

        2) how 6600-style OoO works. this bit i *do* know about, and i forgot to mention something: namely that the way it works is, every "operation" is monitored for its start and completion, and that it *doesn't matter* if it's a FSM (like a DIV unit), or a single-stage pipeline, or a multi-stage pipeline, or an early-out variable-length pipeline: what matters and only matters is that every operation is "monitored" from start to finish, 100% without fail.

        consequently, what you describe (the texture unit, with its cache), *can be slotted in as a Function Unit into the 6600 OoO architecture*.

        in fact, we could if necessary add many more than just one of them.


        For Larabee, if Intel didn't at least do a separate opcode for bilinear filtering, well, that is "beyond stupid". I thought they had a texture unit attached to each x86 core, that would be a minimum for a GPU from my point of view.
        not that i am aware of (at least, certainly Nyuzi did not, because jeff deliberately "tracked" and researched the "pure soft GPU" angle).

        It is very hard for me to predict how your solution is going to perform.
        we don't know either! however what we have is a strategy for calculating that (based on Jeff's Nyuzi work, well worth reading) where he shows how not only to measure "pixels / clock" performance, but also how to work out which bits of any given algorithm are contributing to the [lack of] performance.

        he then shows precisely how to then optimise the *architecture* to get better performance. and there are some real surprises in it, the L1 cache munches enormous amounts of power for example. you should be able to track the paper down via this https://www.researchgate.net/publica...pen_source_GPU




        You have a big OoO scheduler and a rather small ALU compared to a GPU.
        ah no, you misunderstand: setting the ALU sizes, capabilities and quantities *of* each ALU is a matter of dialing a parameter in a python dictionary. we literally change the number of MUL pipelines with a single line of code and the ALU(s) go ballistic and so do the number of gates.

        the question then becomes: *should* you do that, what's the business case for doing so, and will people pay money for the resultant product?

        That kind of CPU has very limited compute resources to compete with a GPU. You try to compensate by using triple instruction issue to get better utilization of ALU units.
        exactly, but more than that, we can crank up the number *of* ALUs - of each different type - to dial in the performance according to what we find out when we get to run benchmarks, just like Jeff Bush did.

        Well, if you already have an OoO scheduler, at least your solution looks relatively simple to design.
        eexxaaactlyyy, where because we have the pieces in place we're not going "omg, omg we backed ourselves into a corner with this stupid in-order design, why the hell did we do that, arg arg we now have to chuck out everything we've developed over the past NN months and start again".

        by tackling head-on what's normally considered to be a "hard" processor design we've given ourselves the design flexibility to just go "yawn, did we need to up the number of DIV units again? ahh let me just stretch across to the keyboard and change that 2 to a 3, ahh that was really difficult"

        What can I say? The simplest way to increase the compute capability of your CPU is SIMD.
        if you look in the new POWER10 architecture - which is a *splutter* 8-way multi-issue, they actually put down *two* separate and distinct 128-bit VSX ALUs / pipelines.

        we can do exactly the same thing.

        we can put down *MULTIPLE* SIMD ALUs.

        we do *NOT* have to do the insanity of increasing the SIMD width from 64 bit to 128 bit to 256 bit to 512 bit.

        we have the option - the flexibility - to put down 1x 64-bit SIMD ALUs, 2x 64-bit SIMD ALUs, 3x, 4x, 8x - and have the OoO Execution Engine take care of it.

        (actually, what's more likely to happen is, when we come to do high-performance CPU-GPUs, because the Dependency Matrices increase in size O(N^2), is that we'll put down 1x 64-bit SIMD ALU, 1x 128-bit SIMD ALU, 1x 256-bit SIMD ALU and so on)

        and, remember: all of this complexity - whatever happens at the back-end - is entirely hidden from the developer with a "VL" front-end, all exactly the same programs. *no* need to even know that there's NNN back-end SIMD units.

        In that case, you need to look at the common shader code and how can it be compiled for your CPU. Shaders can be run massively in parallel, so how is your CPU going to exploit that?
        see above. and, also, remember, if it gets architecturally too complex for a single CPU, we just increase the number of cores on the SMP NOC instead (look up OpenPITON, it's one of the options we can use, to go up to 500,000 cores).


        Think about adding a separate texture unit, connected to the ALU. Texture unit has its own cache and direct access to memory.
        ok so if it has memory access, then that's a little more complex, because the LDs / STs also have to be monitored by Dependency Matrices. yes, really: in an OoO architecture you cannot let LDs / STs go unmonitored either, because otherwise you get memory corruption.

        studying and learning about this (and properly implementing it) i think took about 2 out of those 5 months of learning about augmented 6600 from Mitch Alsup.



        I'm going to think some more about it, but I think that I don't have sufficient knowledge and experience on this subject to be of much further help. It is too hard to see how the shader code is going to be compiled and what amount of compute capability per transistor can your CPU provide.
        we honestly don't know yet (and can only have an iterative strategy to "see what happens"). in practical terms we're still at the "let's get the scalar core operational" phase along with "planning the pieces in advance ready for adding GPU stuff", preparing the groundwork for entering that "iterative feedback loop" phase (just like Jeff Bush did on Nyuzi). for which, actually, this conversation has been fantastic preparation, very grateful for the opportunity.

        Comment


        • #64
          Originally posted by lkcl View Post
          ok so if it has memory access, then that's a little more complex, because the LDs / STs also have to be monitored by Dependency Matrices. yes, really: in an OoO architecture you cannot let LDs / STs go unmonitored either, because otherwise you get memory corruption.

          studying and learning about this (and properly implementing it) i think took about 2 out of those 5 months of learning about augmented 6600 from Mitch Alsup.
          A texture unit has read-only access to the memory. All textures are basically just a huge array of constants. I think you don't need to monitor that by the OoO unit, because there is absolutely no need to monitor constants. Even if a texture gets corrupted, there is no problem: it's just one pixel having a wrong color. Nobody notices that.

          You need to have at most one texture unit per core. In your case, it will be exactly one texture unit, for simplification.

          A texture unit can be pipelined or not pipelined. A non-pipelined unit would accept a SIMD request to produce about 8 samples.

          The inputs for a request are:
          - texture address,
          - texture width and height in pixels, pitch,
          - pixel format,
          - (x,y) sample coordinates, x8 for 8 samples
          - optionally, a transformation matrix for x,y coordinates

          In some texture units, all the samples in a single request must be from the same texture. I think that is not strictly necessary, but it probably reduces the complexity of the texture unit.

          A texture unit usually stores a block of 4x4 pixel data in a single cache line. The textures in the GPU memory use the same format: 4x4 pixel blocks. Textures might also use Lebesgue curve. So, there is 16 pixels in a block but it doesn't have to be in RGBA format. "Pixel format" can be something really crazy. That's how texture compression works. It reduces memory bandwidth.

          The problem of adding a texture unit to your design is to figure out how to keep it utilized, because shaders don't do texture sampling all the time. When shaders are doing something else, the texture unit is doing nothing, wasting power.
          What latency should the texture unit have? Should it be a low-latency, SIMD, non-pipelined design, or a high-latency, pipelined design?






          Comment


          • #65
            The problem with adding a texure unit is that it is a lot of work.

            It is much, much easier to just use a special instruction for bilinear filtering.

            So, for start, perhaps it is a better idea to not use a texture unit.

            Comment


            • #66
              Originally posted by lkcl View Post
              and, remember: all of this complexity - whatever happens at the back-end - is entirely hidden from the developer with a "VL" front-end, all exactly the same programs. *no* need to even know that there's NNN back-end SIMD units.
              You can do that in hardware? I didn't know that is possible. I have never heard of a front-end that can fuse instructions into SIMD for back-end. That looks just too crazy to me. Even doing this in a software compiler is a serious problem.

              You must mean: the shader compiler fuses a small number of shader thread instances into a single CPU thread to create opportunities for using SIMD. This one CPU thread can be called a warp, since it actually handles a few shader instances simultaneously.

              Comment


              • #67
                You can replace the entire functionality of a texture unit by a few special instructions in your CPU:

                1. an custom instruction for bilinear filtering (you already have that)
                2. an instruction for 2x2 matrix transform
                3. an instruction to load a 2x2 pixel block from a texture.

                About item 3, you can do some complex stuff there if you want. For example, you can postulate that textures are stored as 4x4 blocks of pixels, alligned, and the instruction has to handle that. The additional complexity is that the instruction may need to load pixel data from multiple pixel blocks.

                Comment


                • #68
                  There is one more thing required to replace the functionality of a texture unit: a texture cache. A texture cache should be shared by all GPU cores on a die, the cache is read-only, and it has direct access to memory. Each CPU-GPU has special LOAD instruction(s) to load data from the texture cache.

                  A texture cache does not need to be very fast (as long as your OoO engine can find other stuff to do while waiting for data from the cache). A benefit of a texture cache is that it reduces the required bandwidth to main memory.

                  Comment


                  • #69
                    Originally posted by xfcemint View Post
                    The problem with adding a texure unit is that it is a lot of work.

                    It is much, much easier to just use a special instruction for bilinear filtering.

                    So, for start, perhaps it is a better idea to not use a texture unit.
                    this sounds exactly like the kind of useful strategy that would get us some reasonable performance without a full-on approach, and, as a hybrid processor it would fit better and it's also much more along the RISC strategy. thank you for the suggestion, i've documented it here https://bugs.libre-soc.org/show_bug.cgi?id=91

                    Comment


                    • #70
                      Originally posted by lkcl View Post

                      if we do it carefully (creatively) we can get away with around 50,000 gates for the out-of-order dependency matrices. a typical 64-bit multiplier is around 15,000 gates, and the DIV/SQRT/RSQRT pipeline i think was... 50,000 gates, possibly higher (it covers all 3 of those functions). we need 4 of those 64-bit multipliers, plus some more ALUs...
                      Word "gates" is ambigous to me. Could mean: CMOS implementation of AND, OR, NOT logic gates. Also, there are two possible versions of those: the ones with obligatory complement output, or without it. "Gates" could also mean the total number of transistors.

                      By the numbers you are posting, I guess you are speaking of CMOS gates, with about 4 transistors per gate.

                      If you can fit an entire GPU core in less than 5 million transistors, you are flying. So, I would do about one million transistors for a decoder plus the OoO engine, one million for L1 instructions, one million for L1 data. Then see what ALU units you need to maximize compute power.

                      That is all going to run very hot, so you need very low clocks for GPU cores. Save power in any way you can.

                      Comment

                      Working...
                      X