Libre-SOC Still Persevering To Be A Hybrid CPU/GPU That's 100% Open-Source

Dr.N0 replied

05 October 2020, 10:36 AM
They are working on this, and have already booted it up on a FPGA.
Leave a comment:
WorBlux replied

05 October 2020, 12:22 AM
Originally posted by xfcemint View Post

Here is one thing that I would like to know... but I have no clue.

How many gates / transistors can you put on some modern, reasonably inexpensive FPGA?

FPGA's don't work on raw transistors. Rather most of the logic in done by programing LUT's (multi-input lookup tables) And there's not single ratio between them. As an exapmle IBM reliesed a slightly cut down A2I (power7/bluegene) at just under 200,000 LUT's.

And 250k-300k is a reasonably accessible FPGA size, and very small ones (80k range) can be had quite cheaply.
Leave a comment:
lkcl replied

03 October 2020, 09:27 AM
hi xfcemint, briefly: see https://youtu.be/FxFPFsT1wDw?t=17022 the talk by jason ekstrand on the upcoming vulkan ray-tracing API. yes on ##xdc2020 freenode when he mentioned it, three separate people went "oink, did he really say the API can be called recursively??"

yes i took on board the non-uniform processing idea. your input helped expand the idea that we'd been mulling over for some time (not in detail) and i took note of the input you gave last week.

apologies for not engaging more on this, recently: although i really want to we have the Dec 2 tape-out deadline to focus on.
Leave a comment:
lkcl replied

01 October 2020, 12:29 PM
Originally posted by WorBlux View Post

(Warning, this is just complete ametuer guessing) Indeed, the common method s to hide this latency is SMT of the CPU, and large statically allocated register adresses to thread blocks on the GPU. I'm wondering how well the scoreboard can deal with SMT? You could of course duplicate all registers and scoreboard inside itself and keep the same number of wires in some sort of course threading scheme. Or weather you could just duplicate registers, and use some sort o window/thread dependency in the scoreboard to do more of a fine-grained multi-threading. Then adaptive round-robin w/some feedback to the decoder could avoid/mitigate the worst of the stalls.

adding SIMT opens up a whole can-o-worms that will need an entire separate research project to add. hyperthreading might be possible to (sanely) add via virtualisation/indirection of the register file (to get numbers down to sane levels). at that point "thread context" becomes part of the (virtual) regfile lookup table.
Leave a comment:
lkcl replied

01 October 2020, 12:25 PM
Originally posted by xfcemint View Post

I would like to thank you for this conversation. It is not everyday that someone like me, an amateur hobyist CPU-designer has a chance to talk to a real hardware designer.

hey i am an amateur too, i just got lucky that NLnet were happy to back this

The problem is that the hybrid CPU-GPU idea is your starting point. Apparently, you are not going to give it up easily, despite the existance (in my view) of very obvious arguments against.

the decision to do a hybrid processor is driven not by how better the hardware will be but by how absolutely insane and complex driver development becomes for split CPU-GPU designs.

if we go the "traditional" GPU route we LITERALLY add 5-10 man-years to the completion time, and, worse than that, cut out the opportunity for "long-tail" development.

So, maybe we just disagree.

Well, my advice to you is to reconsider it again.

I suggest asking other GPU hardware designers and even experienced CUDA programmers about this issue. I have a feeling that they will all side with me.

we did. they didn't. at SIGGRAPH 2018, Atif from Pixilica gave a BoF talk. the room was packed. he then went to a Bay Area meetup, and described his plans for a hybrid CPU-GPU architecture. *very experienced* Intel GPU engineers told him that they were delighted at this hybrid approach, saying that it was exactly the kind of shake-up that the GPU industry needs.

the advantages of a hybrid architecture go well beyond what can be achieved with a set-in-stone proprietary GPU. "unusual" and innovative algorithms can be developed and tried out.

in particular, the fact that you have to go userspace-RPCserialisation-kernelspace-SHAREDMEMORY-to-GPU-RPCdeserialisation-GPUexecution on EVERY SINGLE OPENGL call makes programming spectacularly difficult to debug

and now the Khronos Group is adding ray-tracing, this is RECURSIVE! recursive mirrored stacks where you have to have a full-blown recursive RPC subsystem on both the CPU and the GPU! absolutely insane.

whereas for ray-tracing on a hybrid CPU-GPU? it's just a userspace function call. the only recursion done is on the standard userspace stack.

focussing exclusively on speed, speed, speed at the hardware level is how the current insanity in driver development got to where it is, now.

Last edited by lkcl; 01 October 2020, 05:19 PM.
Likes 1
Leave a comment:
WorBlux replied

28 September 2020, 04:46 PM
Originally posted by OneTimeShot View Post

Yes I can see all that. All of the things you are not doing are critical in building a GPU. What you are doing is building a CPU with a custom vector extension because you don't like AVX-512. We know in advance that software emulation GPU performance and power usage is going to be terrible. A general purpose CPU core has too many transistors to replace a specialized GPU core.

At the end of the day, the world doesn't need another CPU with vector extensions. Those already exist, and we already have the performance SIMD provides to CPU graphics work when running software Mesa. If you want to build a GPU, here is literally the first thing that came up when searching for GPU designs on open cores: https://opencores.org/projects/flexgripplus

It looks like it comes from the University of Massachusetts (sorry it's written in "hardcoded non-OO" Verilog) and it has all the bits you'd expect to need in a GPU (it looks like it's more compute than graphics oriented):
- SMP Controllers
- Pipeline execution
- Customised maths libraries
- Execution Shedulers
- RAM management

At the end of the day, have fun with whatever you're doing I guess. Just don't promise anyone anything you can't deliver, and don't bother real hardware developers too much because until you have built the things listed above, or you have extensive game engine knowledge, you can't really offer much experience.

The flexgrip is specificly designed to emulate nvidia hardware and uses the nvidia toolchain, probably a non-starter for a commercial project. It's also soft-core only, leveraging the fpga architecture highly (using it's DSP and chunks of distributed RAM)

Anyways the Libre SoC is tagetting about 10GFlops/W, while nvidia's Maxwell gets 23GFlops/W on 28 nm. (Based on the 750Ti) It's definately going to be a challenge. A newer low-power node, lower clock will helps some, but even if that gives you 2x improvement, this design requires a 2x improvement over prior Vector engines. Perhaps not impossible, but a big challenge.

Personally I love the idea and architectural simplicity/transparency vs either shuffling everything over pci-e or dealing with shared memory. Hell, even if it only hits at 5GFlops/W and is libre, that's usefull to me.

Originally posted by xfcemint View Post

When you calculate how many texture memory loads are required for a 720p screen at 60 Hz, you get some really astonishing (big) numbers.

The total loss of execution throughput (due to stalls) will be around 15-20% in the case of 800 MHz GPU. Bumping up the GPU clock just produces a higher loss, and lowering the clocks reduces the loss. At 1500 MHz you are going to have a 30% execution throughput loss due to stalls. That is one reason why the GPUs must run slow (the other one is to reduce power consumption).

Therefore, your "typical solution" is of absolutely no help in this case. Stalling doesn't occur in an OoO engine only when there are no free execution units, it also occurs when the dependency tracker is full (for example: no more slots for new instructions, no more free registers, or too much branches/speculation which discards most of the results).

(Warning, this is just complete ametuer guessing) Indeed, the common method s to hide this latency is SMT of the CPU, and large statically allocated register adresses to thread blocks on the GPU. I'm wondering how well the scoreboard can deal with SMT? You could of course duplicate all registers and scoreboard inside itself and keep the same number of wires in some sort of course threading scheme. Or weather you could just duplicate registers, and use some sort o window/thread dependency in the scoreboard to do more of a fine-grained multi-threading. Then adaptive round-robin w/some feedback to the decoder could avoid/mitigate the worst of the stalls.
Likes 1
Leave a comment:
lkcl replied

27 September 2020, 07:25 AM
Originally posted by xfcemint View Post

Well, there is some truth in what you are saying, but overall: false.

... how do you know that? i'm slightly concerned - how can i put it - that you're putting forward an unverified "belief position" without consulting me or looking at the source code and the design plans.

Imagine all the communication hubs, caches, wrong bus widths, wrong kind of interconnects and througput mismatches that need to be re-designed. The only thing you can keep are possibly the execution units, because their number is made flexible by the OoO engine. Everything else will be wrong.

"everything else will be wrong" only if the designer has not thought through the issues and taken them into account. the conversation is taking a very strange turn, xfcemint, i hope you don't mind me saying that.

i've already planned ahead for parameterised massively parallel data paths, parameteriseable multi-issue, parameterised register bus widths. i can't say that everything is covered because it's still early days.

You have to decide on the GPU design, and you have maybe a few months of time to do so.

where did you get the mistaken impression that we have only a few months to make a decision? i have to be honest: there's been a significant change in the conversation, today, going from positive and really valuable contributions when this started (last week?), to a position of negative-connotation assumption, today. can i ask: have you been speaking privately, off-forum to individuals who view this project in a negative light? or, perhaps, you just woke up from a reaaaally good night out and haven't slept much?

Ok, you didn't go into it so far (obviously), as you were doing a CPU. That is OK. Now you have to insert the GPU into the evaluations. GPU must be optimized for bandwidth, not for serial execution like a CPU. The optimal kind of execution model for a GPU is massive paralelism. GPU is a different beast. It requires a different kind of thinking.

indeed it does, as we are learning.

please do remember this is very early days. we've yet to get into the same "feedback" loop that Jeff Bush outlined in his work. honestly, that's really the time where we can begin to get "real numbers" and start to properly evaluate whether the architectural decisions made in the first phase need to be adjusted.
Leave a comment:
lkcl replied

27 September 2020, 07:04 AM
Originally posted by xfcemint View Post

My proposal is not like Esperanto. My proposal actually makes your design closer to the standard design.

standard design for a CPU? or standard design for a GPU?

if we do a standard design for a GPU, then we are wasting our time because it will fail as a CPU (there's already plenty in the market).

if we do a standard design for a CPU, then we are also wasting our time because we'd be competing directly against a massively-entrenched market *and* adding man-decades to the software driver development process.

this is one of the weird things about a hybrid design and it's technically extremely challenging, needing to take into account the design requirements of what is normally two completely separate specialist designs (three if we include the Video processing).

my point about Esperanto - and Aspex - is that if you go "too specialist" (non-SMP, NUMA, SIMT) then it becomes unviable as a general-purpose processor and there's no point trying to follow a *known* failed product strategy when we're specifically targetting dual (triple) workloads of CPU, GPU *and* VPU.

Aspex was damn lucky that they got bought by Ericsson, who needed a dedicated specialist high-bandwidth solution for coping with the insane workloads of cell tower baseband processing.
Leave a comment:
lkcl replied

27 September 2020, 06:36 AM
Originally posted by xfcemint View Post

Again, strange framing.

From my point of view, the consequence of your current design is that you will be unable to sasisfy your customer's demands.

then we change the parameters of the design *to* meet the customer's need.

You can't design a GPU, and then say "we will adapt to customers' demands". The right ways is to first figure out what the customers want, then you design a GPU.

to emphasise again, as i don't believe you've fully grasped it: what i learned from that extremely knowledgeable salesman is that it's an iterative process, where yes, we do exactly that. you start somewhere, then say to the customer, "this is what we've got, what would *you* like?" and they tell you.

then you do - in fact - do exactly what you expect not to be possible - which is to change the design.

and you then return to them for a follow-up meeting and say, "we've changed the design to what you want: can you confirm this is what you want?"

now, knowing this *in advance* i have in fact made damn sure that yes, we can in fact "dial the parameters" to reasonably meet whatever a given customer might reasonably ask for.

if they want a 4096 core processor with a multi-billion transistor count it might get a little hairy at this early stage, however if they're prepared to pay to get what they want i am hardly going to say "no, go away"

now, fortunately, we can in fact start from somewhere, because we're funded by NLnet to do exactly that. NLnet's giving us this opportunity to get the project off the ground where normally we would be scrabbling around on a "day job", working at this only on weekends and nights, stretching it out over years.

(the other thing i haven't mentioned publicly is that we do in fact have a potential customer to whom we're listening to, as to what they would like to have).
Leave a comment:
lkcl replied

27 September 2020, 04:40 AM
Originally posted by xfcemint View Post

It enables getting higher performance from an old process like 45nm.

we'll only use an old process like 45nm if it delivers what the customer wants at a cost that the customer can pay.
Leave a comment:

Announcement

Libre-SOC Still Persevering To Be A Hybrid CPU/GPU That's 100% Open-Source

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: