Announcement

Collapse
No announcement yet.

Libre-SOC Still Persevering To Be A Hybrid CPU/GPU That's 100% Open-Source

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • The consequence of the design you have selected - a CPU/GPU - is that the GPU cores are going to inherently be quite large. It is therefore extremely important to try to decrease the size of GPU cores.

    But, of course, you can't have one GPU core component very small and another one big. So, you have an OoO engine (which is OK), and that dictates certain minimal cache sizes.

    Comment


    • Let's say you have 32 GPU cores running at 800 MHz.

      A 720p screen has one million pixels, but the GPU must actually run 2 million rays due to various inefficiencies. That's 120 million rays per second at 60 FPS.

      32 GPU cores at 800 MHz are going to do 25 billion clocks. Only half of the frame time is reserved for running the fragment shaders, so there are 12.5 billion clocks reserved for fragment shaders. That's around 100 clocks per fragment shader (100 clocks per ray). The tripple issue OoO engine is going to sucessfully issue slightly more than one instruction per clock on average, so the shader (ray) can have slightly more than 100 instructions executed.

      100 instructions per fragment shader on average, that seems about right to me.

      All estimates I did are very rough.
      Last edited by xfcemint; 23 September 2020, 10:03 AM.

      Comment


      • All the GPU manufacturers are betting that it is impossible to design a sufficiently small and sufficiently efficient OoO GPU core. So, their cores are not OoO. They are so self-assured about this that they are not even making an attempt.

        I think they are very wrong. A small efficient OoO core would wipe their idiotic designs off the floor.

        Comment


        • Originally posted by xfcemint View Post
          For the GPU design, you have to consider that the SAMPLE instruction (or anything similar) needs to occasionaly stall because the data needs to be fetched from main memory. That would be a long stall.
          the typical solution here is to ensure that there are multiple SAMPLE pipelines / FSMs. stalling only occurs in an OoO design when there are no free units that are not currently occupied creating results (remember that all Computation Units *must* monitor results categorically WITHOUT FAIL from start to finish).

          therefore all that need be done is to calculate the desired (target) SAMPLEs/sec rate, take the length of time taken for any one SAMPLE instruction, divide those numbers and that tells you how many such SAMPLE Computation Units are needed.

          of course, you then need to crank up the data paths to cope...

          Comment


          • Originally posted by xfcemint View Post
            Let's say you have 32 GPU cores running at 800 MHz.
            that's a lot!

            if they all have dual-issue and can do 2x FP32 SIMD that's 4x FMACs @ 800mhz which is 8x FLOPs * 800mhz which is 6.4 GFLOPs. 32 cores is 205 GFLOPs which is... woo!

            Comment


            • Originally posted by lkcl View Post

              the typical solution here is to ensure that there are multiple SAMPLE pipelines / FSMs. stalling only occurs in an OoO design when there are no free units that are not currently occupied creating results (remember that all Computation Units *must* monitor results categorically WITHOUT FAIL from start to finish).

              therefore all that need be done is to calculate the desired (target) SAMPLEs/sec rate, take the length of time taken for any one SAMPLE instruction, divide those numbers and that tells you how many such SAMPLE Computation Units are needed.

              of course, you then need to crank up the data paths to cope...
              Nope. It can't possibly work out that way. Your assumptions are too optimistic. You have to really understand this issue if you want to successfully design a GPU.

              The time required to execute a SAMPLE instruction is extremely variable. There are three cases:

              1. All the required texture data is in L1. The SAMPLE instruction executes in less than 10 cycles.
              2. Some of the required data is in non-local L1 or in L2. The SAMPLE instruction requires 15-50 cycles, about 20 on average
              3. Some of the required data is main memory. SAMPLE then requires more than 100 cycles on average, with up to 400 cycles in pathological cases, or even more in case of memory choke up.

              The OoO engine will easily go over the case 1.

              In the case 2, the OoO engine can do a good job of keeping the core occupied, but some loss of execution throughput (like, 50%) is to be expected. That's all just excellent and the OoO engine helps a lot to hide the high latency of the SAMPLE instruction.

              In the case 3, the OoO engine will certainly run out of things to do. There can't exist an OoO engine that can keep scheduling in the face of such a long latency. There is no compiler that can help, it is impossible. Nothing helps. The GPU just stalls and that's it. You have to accept that this problem exist. An OoO engine is no magic wand for everything.

              When you calculate how many texture memory loads are required for a 720p screen at 60 Hz, you get some really astonishing (big) numbers.

              The total loss of execution throughput (due to stalls) will be around 15-20% in the case of 800 MHz GPU. Bumping up the GPU clock just produces a higher loss, and lowering the clocks reduces the loss. At 1500 MHz you are going to have a 30% execution throughput loss due to stalls. That is one reason why the GPUs must run slow (the other one is to reduce power consumption).

              Therefore, your "typical solution" is of absolutely no help in this case. Stalling doesn't occur in an OoO engine only when there are no free execution units, it also occurs when the dependency tracker is full (for example: no more slots for new instructions, no more free registers, or too much branches/speculation which discards most of the results).

              Comment


              • Originally posted by lkcl View Post

                that's a lot!

                if they all have dual-issue and can do 2x FP32 SIMD that's 4x FMACs @ 800mhz which is 8x FLOPs * 800mhz which is 6.4 GFLOPs. 32 cores is 205 GFLOPs which is... woo!
                GeForce 560 Ti:
                - design year: 2011
                - process: 40nm
                - die size: 360 mm2
                - shader clock: 1800 MHz
                - FMA throughput: 1263 GFLOPS/s

                So, 32 POWER GPUs @800MHz is, what, one sixth of the compute power of GeForce 560 Ti.

                That's about just the right amount of compute power required for a modern SoC, about 200 GFLOPs.

                I can get you some more numbers, but it works out if you print the SoC on a 14nm process, you get about 1.5 billion transistors on 70 mm2. If you consider that each GPU takes about 7 million transistors, that's 230 million transistors just for the 32 GPUs.

                The power budget is about 0.15 W per GPU. That would be the hardest thing to satisfy.

                You can also go for 28nm process, 16 GPUs, 100 mm2. That would also be sufficiently cheap and sufficiently small.

                I mean, you don't have to produce this, but if you want someone to put it on silicon, these numbers must work out, otherwise it gets uneconomical.

                Comment


                • I'm counting the transistors my way when I say "5 million transistors". Your way of counting the transistors produces wrong numbers for this kind of esimate.

                  Comment


                  • Originally posted by xfcemint View Post
                    3. Some of the required data is main memory. SAMPLE then requires more than 100 cycles on average, with up to 400 cycles in pathological cases, or even more in case of memory choke up.

                    The OoO engine will easily go over the case 1.
                    ok, so here there would be an exception (standard "miss") which would trigger an OS handler, which would pre-fetch the data and in the meantime context-switch to an alternative task. that's if there isn't a "pre-fetch" process that pre-loads the required memory, making sure that it's in the L2 cache in advance.

                    it's complex but manageable, basically.

                    Comment


                    • Originally posted by lkcl View Post

                      ok, so here there would be an exception (standard "miss") which would trigger an OS handler, which would pre-fetch the data and in the meantime context-switch to an alternative task. that's if there isn't a "pre-fetch" process that pre-loads the required memory, making sure that it's in the L2 cache in advance.

                      it's complex but manageable, basically.
                      No, that likely won't work. That will just waste even more time than just letting the GPU stall. Just for entering the OS, many cycles need to be wasted, much more than 100 for an average stall.

                      Context switch would be another few hunderd cycles lost on swapping out the registers, refilling the L1i and L1d caches, etc...

                      GPU stalls waste only 20% of processing power at 800 MHz. It is the best to just let it stall. 20% is not much.

                      Also, it is hard to predict which data needs to be prefetched. The optimistic estimate is that the OS can guess correctly 50% of the time. This is a complex prediction, certainly not for initial design. Maybe after a few years of experience with the device.

                      The GPU will stall. Nothing will help. No prediction, no OS, no task switching, no compiler optimizations. You have to let it stall.

                      Generally, I can see you basic error pattern: you are a too optimistic person. That jeopadizes the chances of this project to succeed.

                      The OoO engine can't solve all the problems. You have to be more realistic.

                      Comment

                      Working...
                      X