Announcement

Collapse
No announcement yet.

Libre-SOC Still Persevering To Be A Hybrid CPU/GPU That's 100% Open-Source

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Here is a solution which is in line with your line of thought, but I think it is not very good for many reasons. The primary problem is the increased complexity.

    A GPU core is capable of executing 2 warps in a time-multiplexing fashion. Only one warp starts executing. When one warp stalls due to an fetch from main memory, this is somehow detected.

    Then the current warp is paused, waiting for the memory fetch, and in the meantime the second warp executes.

    Downsides and problems:
    - requires practically doubling the number of registers
    - requires the entire shader program to fit in L1i
    - requires both warps to need same data from L1d
    - requires additional complexity in the scheduler
    - requires additional program counter and a mechanism to switch to the other warp's program counter
    - many other complexities added

    And the worst:
    - Adds less that 20% additional performance, at the cost of great additional complexity and design problems. Also note: you can't run the cores at the faster clock speed because you have a limited power budget.

    I suggest not even considering this complex variant.

    Comment


    • Isn't it simpler to design a simpler core, and just add more cores? How about that solution? That would be my solution.

      Comment


      • Originally posted by xfcemint View Post
        Isn't it simpler to design a simpler core, and just add more cores? How about that solution? That would be my solution.
        That could work, assuming GPU instructions are included, it's just less area efficient because of the large amount of instruction decoders, icaches, schedulers, etc. compared to the area dedicated to ALUs and data path. That's the route Esperanto took with their 4096-core behemoth.

        Comment


        • Originally posted by programmerjake View Post

          That could work, assuming GPU instructions are included, it's just less area efficient because of the large amount of instruction decoders, icaches, schedulers, etc. compared to the area dedicated to ALUs and data path. That's the route Esperanto took with their 4096-core behemoth.
          It is very hard to estimate what is the optimal area designated to various CPU units.

          Yes, when you have a simpler GPU, then it might be wasting area on an instruction decoder. On the other hand, the compiler can actually take advantage of this. It enables the compiler to schedule instructions better.

          I estimate that an GPU with an OoO engine is actually not wasting area on an instruction decoder, because the decoder will get good utilization. I'm assuming that the instruction decoder is sufficiently simple. So, no x86.

          Comment


          • Whether a simple GPU is wasting area on instruction caches... well, again, it is hard to tell.

            I would say that a GPU with an OoO should have about 8 KiB L1i, and it is such a good match that it doesn't really matter if it is a bit smaller or bigger. So, the L1i size is quite flexible there.

            If you really need a smaller L1i, then the optimal design is something like 1 KiB L1i with 32 KiB L2i (and some good prediction in the prefetcher). That's a lot of additional work for a design, while the benefits are minimal.

            Comment


            • Ok, a simpler GPU core is wasting area on 8 KiB L1i in the sense that the data in L1i would be duplicated across many cores. But, how much area is actually lost for this simplification of the design? I think not much.

              Comment


              • Wait a second. The GPU core is supposed to use vectorized instructions, as I previously understood. That means the core is quite fat already, definitively not wasting area on an instruction decoder or schedulers. In the presence of vectorized instructions, even the tripple issue is excessive. Vectorized instructions seem like a good idea to me.

                With vectorized instructions, the GPU core is fat enough.

                Comment


                • Originally posted by xfcemint View Post
                  Wait a second. The GPU core is supposed to use vectorized instructions, as I previously understood. That means the core is quite fat already, definitively not wasting area on an instruction decoder or schedulers. In the presence of vectorized instructions, even the tripple issue is excessive. Vectorized instructions seem like a good idea to me.

                  With vectorized instructions, the GPU core is fat enough.
                  Part of why the processor can decode and execute multiple instructions per clock is that we want it to also be decent at CPU tasks. The other part is that vectorized instructions are not the only kind of GPU instructions, there are also scalar operations that need to be run for things like computing addresses, loop counters, execution mask housekeeping (for implementing SPIR-V's SIMT machine model) and more.

                  Comment


                  • Originally posted by programmerjake View Post

                    Part of why the processor can decode and execute multiple instructions per clock is that we want it to also be decent at CPU tasks. The other part is that vectorized instructions are not the only kind of GPU instructions, there are also scalar operations that need to be run for things like computing addresses, loop counters, execution mask housekeeping (for implementing SPIR-V's SIMT machine model) and more.
                    Well, as I said earlier, you definitively need to have somewhat different GPU and CPU cores. You can't have identical cores, that would be "beyond stupid". The CPU core is a big fat power hungry core.

                    For the GPU, you have to tune the decoder decode capability and multiple issue capability to match the GPU complexity. So, I propose reducing the instruction issue width to single or double issue, and reducing the instruction decode capability to one or two decoded instructions per clock.

                    Then you can't say that the GPU is wasting area on a decoder, since the decode capability is matched with the rest of the GPU.

                    Comment


                    • Originally posted by lkcl View Post


                      so that's 6 transistors (aka 3 "gates").

                      the other post says you can expect about 55% "efficiency" (50-ish percent "crud" for addressing). so let's say 32k cache, that's x3 for gates, x2 for "crud" so that's 192k gates (which if you reaaally want to do it in transistors is 384k transistors)

                      yyeah you weren't far off if you were expecting 64k L1 cache sizes
                      So, 1 cache bit is 6 transistors plus another 5 for "crud". That is 11 transistors per bit, or 88 transistors per byte. An 8 KiB cache would have 88 * 8192 transistors, that is 0.7 million transistors.

                      So, you miscalculated badly there. Did you forget to turn bytes into bits?

                      8 KiB cache requires 0.7 million transistors, just as I said.

                      On a 45 nm process, you can put in about 2 GPU cores and one CPU core, so that will mostly be a toy for developers. That is OK, to give out some real hardware for testing to developers (if you have extra funding to print it). In order for this SoC to really make it, it will require a 28 nm or better process.

                      Comment

                      Working...
                      X