Announcement

**lkcl** · 22 September 2020, 11:38 AM

Originally posted by xfcemint View Post

When I put some number and estimates in, it appears that it needs to be an enormously wide bus.

yyeah just for the LD/STs for Vector Processing, to even manage 4x 32-bit LD/STs we need a whopping 256-bit-wide LD/ST total bandwidth into L1 caches and out. that's because 4x 32-bit LD/ST is 2x 64-bit but you will have, when pipelined, 2x 64-bit LDs *AND* 2x 64-bit STs happening simultaneously!

the bandwidth numbers are quite mental. and to think, most open source projects do single-issue 32-bit wishbone accesses and they're really happy!

**lkcl** · 23 September 2020, 01:04 PM

Originally posted by xfcemint View Post

For the GPU design, you have to consider that the SAMPLE instruction (or anything similar) needs to occasionaly stall because the data needs to be fetched from main memory. That would be a long stall.

the typical solution here is to ensure that there are multiple SAMPLE pipelines / FSMs. stalling only occurs in an OoO design when there are no free units that are not currently occupied creating results (remember that all Computation Units *must* monitor results categorically WITHOUT FAIL from start to finish).

therefore all that need be done is to calculate the desired (target) SAMPLEs/sec rate, take the length of time taken for any one SAMPLE instruction, divide those numbers and that tells you how many such SAMPLE Computation Units are needed.

of course, you then need to crank up the data paths to cope...

**lkcl** · 23 September 2020, 01:09 PM

Originally posted by xfcemint View Post

Let's say you have 32 GPU cores running at 800 MHz.

that's a lot!

if they all have dual-issue and can do 2x FP32 SIMD that's 4x FMACs @ 800mhz which is 8x FLOPs * 800mhz which is 6.4 GFLOPs. 32 cores is 205 GFLOPs which is... woo!

**lkcl** · 24 September 2020, 11:37 AM

Originally posted by xfcemint View Post

3. Some of the required data is main memory. SAMPLE then requires more than 100 cycles on average, with up to 400 cycles in pathological cases, or even more in case of memory choke up.

The OoO engine will easily go over the case 1.

ok, so here there would be an exception (standard "miss") which would trigger an OS handler, which would pre-fetch the data and in the meantime context-switch to an alternative task. that's if there isn't a "pre-fetch" process that pre-loads the required memory, making sure that it's in the L2 cache in advance.

it's complex but manageable, basically.

**programmerjake** · 24 September 2020, 02:00 PM

Originally posted by xfcemint View Post

Isn't it simpler to design a simpler core, and just add more cores? How about that solution? That would be my solution.

That could work, assuming GPU instructions are included, it's just less area efficient because of the large amount of instruction decoders, icaches, schedulers, etc. compared to the area dedicated to ALUs and data path. That's the route Esperanto took with their 4096-core behemoth.

**programmerjake** · 24 September 2020, 05:45 PM

Originally posted by xfcemint View Post

Wait a second. The GPU core is supposed to use vectorized instructions, as I previously understood. That means the core is quite fat already, definitively not wasting area on an instruction decoder or schedulers. In the presence of vectorized instructions, even the tripple issue is excessive. Vectorized instructions seem like a good idea to me.

With vectorized instructions, the GPU core is fat enough.

Part of why the processor can decode and execute multiple instructions per clock is that we want it to also be decent at CPU tasks. The other part is that vectorized instructions are not the only kind of GPU instructions, there are also scalar operations that need to be run for things like computing addresses, loop counters, execution mask housekeeping (for implementing SPIR-V's SIMT machine model) and more.

**lkcl** · 27 September 2020, 04:28 AM

Originally posted by programmerjake View Post

That could work, assuming GPU instructions are included, it's just less area efficient because of the large amount of instruction decoders, icaches, schedulers, etc. compared to the area dedicated to ALUs and data path. That's the route Esperanto took with their 4096-core behemoth.

the result unfortunately being that as a general-purpose compute engine, it wasn't useful. i learned this lesson when working for Aspex Semiconductors, which was a massively-parallel SIMT array of 2-bit processors. 4096 per chip in 2003: they were already planning the 65536 ALU version even then. with each ALU being around 3,000 gates you'd be into the tens to hundreds of millions of ALUs for a big 7nm core.

regardless: the cost and complexity of software development of this type of arrangement was so high that it was measured in DAYS per line of assembly code, and unfortunately, the Esperanto core falls into the same category: ultra-specialist, non-SMP.

the only reason that "traditional" GPUs sell is because it's a well-established proven market.

**lkcl** · 27 September 2020, 04:39 AM

Originally posted by xfcemint View Post

Oh, right, the plan is to put 4 hybrid CPU-GPU cores on a 45nm die. That is a tight fit, but probably doable by reducing the core and caches below optimal size.

you misunderstand: we'll do what the market - the customer - the person or persons with the money and the belief - says what they want. to do otherwise is to be hopelessly unrealistic and naive, and not recognising that this is a business.

i worked with an extremely skilled salesman, it was an education in itself to watch him present things to the customer, then ask questions: my role was to sit there and say "yes that's possible" - pretty much the only time i would speak during the entire meeting, and we would arrange a second meeting in a few months. i would be tasked during the intervening time with delivering on what i'd said was possible.

to try to *tell* the customer "yeah you want a NN nm core with not enough cache and core count for your actual needs", that's not going to fly, is it? so why would we try to do that?

doing business involves listening to what the customer wants, working out if it's possible, and having iterative conversations to deliver on their needs.

**lkcl** · 27 September 2020, 04:40 AM

Originally posted by xfcemint View Post

It enables getting higher performance from an old process like 45nm.

we'll only use an old process like 45nm if it delivers what the customer wants at a cost that the customer can pay.

**lkcl** · 27 September 2020, 06:36 AM

Originally posted by xfcemint View Post

Again, strange framing.

From my point of view, the consequence of your current design is that you will be unable to sasisfy your customer's demands.

then we change the parameters of the design *to* meet the customer's need.

You can't design a GPU, and then say "we will adapt to customers' demands". The right ways is to first figure out what the customers want, then you design a GPU.

to emphasise again, as i don't believe you've fully grasped it: what i learned from that extremely knowledgeable salesman is that it's an iterative process, where yes, we do exactly that. you start somewhere, then say to the customer, "this is what we've got, what would *you* like?" and they tell you.

then you do - in fact - do exactly what you expect not to be possible - which is to change the design.

and you then return to them for a follow-up meeting and say, "we've changed the design to what you want: can you confirm this is what you want?"

now, knowing this *in advance* i have in fact made damn sure that yes, we can in fact "dial the parameters" to reasonably meet whatever a given customer might reasonably ask for.

if they want a 4096 core processor with a multi-billion transistor count it might get a little hairy at this early stage, however if they're prepared to pay to get what they want i am hardly going to say "no, go away"

now, fortunately, we can in fact start from somewhere, because we're funded by NLnet to do exactly that. NLnet's giving us this opportunity to get the project off the ground where normally we would be scrabbling around on a "day job", working at this only on weekends and nights, stretching it out over years.

(the other thing i haven't mentioned publicly is that we do in fact have a potential customer to whom we're listening to, as to what they would like to have).

Announcement

Libre-SOC Still Persevering To Be A Hybrid CPU/GPU That's 100% Open-Source

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment