Announcement

Collapse
No announcement yet.

Libre-SOC Still Persevering To Be A Hybrid CPU/GPU That's 100% Open-Source

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #71
    Originally posted by xfcemint View Post
    My proposal is not like Esperanto. My proposal actually makes your design closer to the standard design.
    standard design for a CPU? or standard design for a GPU?

    if we do a standard design for a GPU, then we are wasting our time because it will fail as a CPU (there's already plenty in the market).

    if we do a standard design for a CPU, then we are also wasting our time because we'd be competing directly against a massively-entrenched market *and* adding man-decades to the software driver development process.

    this is one of the weird things about a hybrid design and it's technically extremely challenging, needing to take into account the design requirements of what is normally two completely separate specialist designs (three if we include the Video processing).

    my point about Esperanto - and Aspex - is that if you go "too specialist" (non-SMP, NUMA, SIMT) then it becomes unviable as a general-purpose processor and there's no point trying to follow a *known* failed product strategy when we're specifically targetting dual (triple) workloads of CPU, GPU *and* VPU.

    Aspex was damn lucky that they got bought by Ericsson, who needed a dedicated specialist high-bandwidth solution for coping with the insane workloads of cell tower baseband processing.

    Comment


    • #72
      Originally posted by xfcemint View Post

      Well, there is some truth in what you are saying, but overall: false.
      ... how do you know that? i'm slightly concerned - how can i put it - that you're putting forward an unverified "belief position" without consulting me or looking at the source code and the design plans.

      Imagine all the communication hubs, caches, wrong bus widths, wrong kind of interconnects and througput mismatches that need to be re-designed. The only thing you can keep are possibly the execution units, because their number is made flexible by the OoO engine. Everything else will be wrong.
      "everything else will be wrong" only if the designer has not thought through the issues and taken them into account. the conversation is taking a very strange turn, xfcemint, i hope you don't mind me saying that.

      i've already planned ahead for parameterised massively parallel data paths, parameteriseable multi-issue, parameterised register bus widths. i can't say that everything is covered because it's still early days.

      You have to decide on the GPU design, and you have maybe a few months of time to do so.
      where did you get the mistaken impression that we have only a few months to make a decision? i have to be honest: there's been a significant change in the conversation, today, going from positive and really valuable contributions when this started (last week?), to a position of negative-connotation assumption, today. can i ask: have you been speaking privately, off-forum to individuals who view this project in a negative light? or, perhaps, you just woke up from a reaaaally good night out and haven't slept much?

      Ok, you didn't go into it so far (obviously), as you were doing a CPU. That is OK. Now you have to insert the GPU into the evaluations. GPU must be optimized for bandwidth, not for serial execution like a CPU. The optimal kind of execution model for a GPU is massive paralelism. GPU is a different beast. It requires a different kind of thinking.
      indeed it does, as we are learning.

      please do remember this is very early days. we've yet to get into the same "feedback" loop that Jeff Bush outlined in his work. honestly, that's really the time where we can begin to get "real numbers" and start to properly evaluate whether the architectural decisions made in the first phase need to be adjusted.

      Comment


      • #73
        Originally posted by OneTimeShot View Post

        Yes I can see all that. All of the things you are not doing are critical in building a GPU. What you are doing is building a CPU with a custom vector extension because you don't like AVX-512. We know in advance that software emulation GPU performance and power usage is going to be terrible. A general purpose CPU core has too many transistors to replace a specialized GPU core.

        At the end of the day, the world doesn't need another CPU with vector extensions. Those already exist, and we already have the performance SIMD provides to CPU graphics work when running software Mesa. If you want to build a GPU, here is literally the first thing that came up when searching for GPU designs on open cores: https://opencores.org/projects/flexgripplus

        It looks like it comes from the University of Massachusetts (sorry it's written in "hardcoded non-OO" Verilog) and it has all the bits you'd expect to need in a GPU (it looks like it's more compute than graphics oriented):
        - SMP Controllers
        - Pipeline execution
        - Customised maths libraries
        - Execution Shedulers
        - RAM management

        At the end of the day, have fun with whatever you're doing I guess. Just don't promise anyone anything you can't deliver, and don't bother real hardware developers too much because until you have built the things listed above, or you have extensive game engine knowledge, you can't really offer much experience.
        The flexgrip is specificly designed to emulate nvidia hardware and uses the nvidia toolchain, probably a non-starter for a commercial project. It's also soft-core only, leveraging the fpga architecture highly (using it's DSP and chunks of distributed RAM)

        Anyways the Libre SoC is tagetting about 10GFlops/W, while nvidia's Maxwell gets 23GFlops/W on 28 nm. (Based on the 750Ti) It's definately going to be a challenge. A newer low-power node, lower clock will helps some, but even if that gives you 2x improvement, this design requires a 2x improvement over prior Vector engines. Perhaps not impossible, but a big challenge.

        Personally I love the idea and architectural simplicity/transparency vs either shuffling everything over pci-e or dealing with shared memory. Hell, even if it only hits at 5GFlops/W and is libre, that's usefull to me.


        Originally posted by xfcemint View Post
        When you calculate how many texture memory loads are required for a 720p screen at 60 Hz, you get some really astonishing (big) numbers.

        The total loss of execution throughput (due to stalls) will be around 15-20% in the case of 800 MHz GPU. Bumping up the GPU clock just produces a higher loss, and lowering the clocks reduces the loss. At 1500 MHz you are going to have a 30% execution throughput loss due to stalls. That is one reason why the GPUs must run slow (the other one is to reduce power consumption).

        Therefore, your "typical solution" is of absolutely no help in this case. Stalling doesn't occur in an OoO engine only when there are no free execution units, it also occurs when the dependency tracker is full (for example: no more slots for new instructions, no more free registers, or too much branches/speculation which discards most of the results).
        (Warning, this is just complete ametuer guessing) Indeed, the common method s to hide this latency is SMT of the CPU, and large statically allocated register adresses to thread blocks on the GPU. I'm wondering how well the scoreboard can deal with SMT? You could of course duplicate all registers and scoreboard inside itself and keep the same number of wires in some sort of course threading scheme. Or weather you could just duplicate registers, and use some sort o window/thread dependency in the scoreboard to do more of a fine-grained multi-threading. Then adaptive round-robin w/some feedback to the decoder could avoid/mitigate the worst of the stalls.

        Comment


        • #74
          Originally posted by xfcemint View Post

          I would like to thank you for this conversation. It is not everyday that someone like me, an amateur hobyist CPU-designer has a chance to talk to a real hardware designer.
          hey i am an amateur too, i just got lucky that NLnet were happy to back this

          The problem is that the hybrid CPU-GPU idea is your starting point. Apparently, you are not going to give it up easily, despite the existance (in my view) of very obvious arguments against.
          the decision to do a hybrid processor is driven not by how better the hardware will be but by how absolutely insane and complex driver development becomes for split CPU-GPU designs.

          if we go the "traditional" GPU route we LITERALLY add 5-10 man-years to the completion time, and, worse than that, cut out the opportunity for "long-tail" development.

          So, maybe we just disagree.

          Well, my advice to you is to reconsider it again.

          I suggest asking other GPU hardware designers and even experienced CUDA programmers about this issue. I have a feeling that they will all side with me.
          we did. they didn't. at SIGGRAPH 2018, Atif from Pixilica gave a BoF talk. the room was packed. he then went to a Bay Area meetup, and described his plans for a hybrid CPU-GPU architecture. *very experienced* Intel GPU engineers told him that they were delighted at this hybrid approach, saying that it was exactly the kind of shake-up that the GPU industry needs.

          the advantages of a hybrid architecture go well beyond what can be achieved with a set-in-stone proprietary GPU. "unusual" and innovative algorithms can be developed and tried out.

          in particular, the fact that you have to go userspace-RPCserialisation-kernelspace-SHAREDMEMORY-to-GPU-RPCdeserialisation-GPUexecution on EVERY SINGLE OPENGL call makes programming spectacularly difficult to debug

          and now the Khronos Group is adding ray-tracing, this is RECURSIVE! recursive mirrored stacks where you have to have a full-blown recursive RPC subsystem on both the CPU and the GPU! absolutely insane.

          whereas for ray-tracing on a hybrid CPU-GPU? it's just a userspace function call. the only recursion done is on the standard userspace stack.

          focussing exclusively on speed, speed, speed at the hardware level is how the current insanity in driver development got to where it is, now.
          Last edited by lkcl; 01 October 2020, 05:19 PM.

          Comment


          • #75
            Originally posted by WorBlux View Post

            (Warning, this is just complete ametuer guessing) Indeed, the common method s to hide this latency is SMT of the CPU, and large statically allocated register adresses to thread blocks on the GPU. I'm wondering how well the scoreboard can deal with SMT? You could of course duplicate all registers and scoreboard inside itself and keep the same number of wires in some sort of course threading scheme. Or weather you could just duplicate registers, and use some sort o window/thread dependency in the scoreboard to do more of a fine-grained multi-threading. Then adaptive round-robin w/some feedback to the decoder could avoid/mitigate the worst of the stalls.
            adding SIMT opens up a whole can-o-worms that will need an entire separate research project to add. hyperthreading might be possible to (sanely) add via virtualisation/indirection of the register file (to get numbers down to sane levels). at that point "thread context" becomes part of the (virtual) regfile lookup table.

            Comment


            • #76
              hi xfcemint, briefly: see https://youtu.be/FxFPFsT1wDw?t=17022 the talk by jason ekstrand on the upcoming vulkan ray-tracing API. yes on ##xdc2020 freenode when he mentioned it, three separate people went "oink, did he really say the API can be called recursively??"

              yes i took on board the non-uniform processing idea. your input helped expand the idea that we'd been mulling over for some time (not in detail) and i took note of the input you gave last week.

              apologies for not engaging more on this, recently: although i really want to we have the Dec 2 tape-out deadline to focus on.

              Comment


              • #77
                Originally posted by xfcemint View Post
                Here is one thing that I would like to know... but I have no clue.

                How many gates / transistors can you put on some modern, reasonably inexpensive FPGA?
                FPGA's don't work on raw transistors. Rather most of the logic in done by programing LUT's (multi-input lookup tables) And there's not single ratio between them. As an exapmle IBM reliesed a slightly cut down A2I (power7/bluegene) at just under 200,000 LUT's.

                And 250k-300k is a reasonably accessible FPGA size, and very small ones (80k range) can be had quite cheaply.

                Comment


                • #78
                  They are working on this, and have already booted it up on a FPGA.

                  Comment


                  • #79
                    Originally posted by lkcl View Post

                    adding SIMT opens up a whole can-o-worms that will need an entire separate research project to add. hyperthreading might be possible to (sanely) add via virtualisation/indirection of the register file (to get numbers down to sane levels). at that point "thread context" becomes part of the (virtual) regfile lookup table.
                    SIMT would be quite a beast, but isn't exactly the GPU feature I'm talking about. For example, a single nVidia Ampere SM has a 256k register file that can be divided amoung up to 32 thread blocks. (SIMT being found as an optimization within the thread block) But because the register files are statically allocated to thread blocks, the SM's internal scheduler can quickly flip between thread blocks to cover memory latency/stalls.

                    And I've tried to find more on your implementation of simple-V, but can't quite find what exactly is going on. However if you're striding across very large vectors you can't keep it all in cache, and I suspect you may even have a hard time streaming from memory fast enough.

                    You say a vector instruction will essentially stop instruction decode other execution until the vector op is complete, but it seems like at that point you are committed, and if you're on memory stall, there's not painless early out or swap/resume built in. Presumably you have to break it up internally to register-sized load/stores, but it's not clear if these can commit/pause/resume independently.

                    To cover memory latency I'd expect a lot of loads in flight and a lot of places to put it. I do see you have a proposal to bank/divide vector registers and that's maybe closer to what I'm thinking, assigning a bank to a specific op. Then when you hit a stall, you can switch to a vector op going on a different bank, and if an op is active on it try to continue it, or if the bank is empty, look at the scoreboard and try to find a another vector op.

                    Maybe this virtual reg file is a big thing I'm overlooking.

                    I guest the TLDR question would be. Is there a reason the decoder can't issue one of the vector zero-overhead loops alongside subsequent instructions, potentially even out of multiple vector-loops at once? SMT would be a way to do this withing the core but is heavy and involves OS support for swapping. Maybe some way to spawn asynchonous threadlets?


                    Also most of the discussion of simple-V centers on RISC-V, and not on POWER so it's hard to tell what's essential to the idea and what came about simply for better RISC-V integration.

                    Comment


                    • #80
                      Originally posted by lkcl View Post

                      the decision to do a hybrid processor is driven not by how better the hardware will be but by how absolutely insane and complex driver development becomes for split CPU-GPU designs.

                      if we go the "traditional" GPU route we LITERALLY add 5-10 man-years to the completion time, and, worse than that, cut out the opportunity for "long-tail" development.
                      Indeed, the iterative approach makes more sense for a small team, and worse case you end up adding some simple accelerators /co-preccessors/execution lanes for specific bottlenecks. And the Power architecture in particular is fairly amenable to this.

                      And for for further context 25(16-bit) GFLOPS is about half of what the PS3 could do. This isn't meant to run AAA games.

                      Originally posted by lkcl View Post


                      and now the Khronos Group is adding ray-tracing, this is RECURSIVE! recursive mirrored stacks where you have to have a full-blown recursive RPC subsystem on both the CPU and the GPU! absolutely insane.

                      whereas for ray-tracing on a hybrid CPU-GPU? it's just a userspace function call. the only recursion done is on the standard userspace stack.

                      focussing exclusively on speed, speed, speed at the hardware level is how the current insanity in driver development got to where it is, now.
                      I don't think the stack is mirrored. The cpu side has to set up and compile shader groups into a Shader description Table, which is buffered somewhere on the GPU, and the GPU is responsible for managing the I/O variables that are explicitly passed. Callable shaders are recursive, but can only call within the shared group, and there are hardware/implementation defined limits on depth.
                      Also a callable shader may be returning to a return/jump to a different shader context than it was spawned in. (the call may generates a new shader, and the the return yet another. The I/O variables are explicitly read/write and you overwrite in place to pass data back up. There also seem to be a limited number of variables that can be passed which would help in stack construction/traversal. And the callable shader just gets callable data in, which means all the calls in the chain have send the right "in" format putting quite a bit more work on the developer that is making deep arbitrary call chains.

                      That being said, implementing this on CPU is fairly straightforward, while the GPU is relying on black magic and a lot of sub-layers. But also being said, a couple million rays per second is simply out of the question of CPU. Not to be said there aren't more efficient techniques the CPU opens up for rendering that a GPU would simply choke on.

                      Comment

                      Working...
                      X