Announcement

**coder** · 20 February 2019, 02:55 AM

Also, on the subject of cache coherency, consider how much floating-point performance IBM's Cell managed. In just an 8+1-core chip, they managed over 100 GFLOPS, more than a decade ago. The 8 PPE cores had only 128-bit vector engines, like your design, but were in-order with 2-way SMT.

Cell (processor) - Wikipedia

https://en.wikipedia.org/wiki/Cell_(microprocessor)

The secret? The Cell used scratch-pad memory - not cache. This made it notoriously difficult to program, but then they weren't using OpenCL, which would've significantly eased the burden on programmers of managing data movement, among other things.

**lkcl** · 20 February 2019, 05:24 AM

Originally posted by ldesnogu View Post

Jeremy Bennett found that ARM Thumb is denser than RISC-V compressed: https://fosdem.org/2019/schedule/event/riscvcompact/

oo that's an extremely valuable and insightful analysis. it would be particularly interesting to see it repeated for RV64C, as there's something odd about the arm64 that increases code size and requires 50% larger L1 I-cache to compensate.

Also I guess you know a cache is not a CAM. You only need as many comparators as the number of ways of your cache and even then you can use way predictors. Anyway even if you completely removed the I-cache you'd not gain 40% of power.

appreciated. well, we can't remove the I-cache, that's for sure.

**lkcl** · 20 February 2019, 05:32 AM

Originally posted by coder View Post

The secret? The Cell used scratch-pad memory - not cache. This made it notoriously difficult to program, but then they weren't using OpenCL, which would've significantly eased the burden on programmers of managing data movement, among other things.

times move on, eh?

yyeah we will need to add a scratch memory area as well: its primary purpose will be as a direct target for batches of 4xFP32 (A,R,G,B) to be converted in a single cycle to a batch of 32-bit 8/8/8/8 ARGB pixels. (edit: see https://www.phoronix.com/forums/foru...37#post1081537 for additional uses)

that's just one of the things that we've established will be needed in order to achieve the goal. AndyChow: we don't have all the answers, we don't know everything: there are still areas where we don't know what we don't know. and y'know what? that's okay. we'll find out (sooner rather than later being better), and when we do, we'll iteratively improve until the goal - the target - *is* reached. it may take 1 year, it make take 2, it may take 3: we'll keep at it.

**ldesnogu** · 20 February 2019, 11:08 AM

Originally posted by lkcl View Post

oo that's an extremely valuable and insightful analysis. it would be particularly interesting to see it repeated for RV64C, as there's something odd about the arm64 that increases code size and requires 50% larger L1 I-cache to compensate.

For AArch64, I agree that 64-bit RISC-V compressed would be smaller. But what do you think is odd about it?

Anyway I'm still unconvinced that reducing Icache size by 25% will lead to 40% less power; in fact I'm rather confident it's wrong. But that doesn't mean you can't achieve that goal of 40% less power overall, I will just wait for proof :-)

**lkcl** · 20 February 2019, 11:40 AM

Originally posted by ldesnogu View Post

For AArch64, I agree that 64-bit RISC-V compressed would be smaller. But what do you think is odd about it?

Anyway I'm still unconvinced that reducing Icache size by 25% will lead to 40% less power; in fact I'm rather confident it's wrong. But that doesn't mean you can't achieve that goal of 40% less power overall, I will just wait for proof :-)

me too

Jeff Bush's Nyuzi paper was particularly informative in this regard, as power-performance in GPUs is critically related to a lot of factors, one of the heaviest being getting data through the L1 / L2 caches.

http://www.cs.binghamton.edu/~millerti/nyuziraster.pdf

it's pretty essential to have enough registers such that the data is kept in the register file (after LD) until it is absolutely necessary to push it back out (STORE), and even then, if that can be avoided it would be better. this is why most GPUs have a minimum of 128 floating-point registers.

as GPUs are pretty much proprietary, it's not exactly like we can examine an existing design's source code. we can study work such as MIAOW (which is a parallel compute engine, not a GPU), and Nyuzi, and the available documentation on Broadcom Videocore IV... but ultimately, we just have to get on with it, generate some verilog, synthesise it and see what happens.

**coder** · 20 February 2019, 08:08 PM

Originally posted by lkcl View Post

it's not exactly like we can examine an existing design's source code. we can study work such as MIAOW (which is a parallel compute engine, not a GPU),

Did you happen to see my post:

https://www.phoronix.com/forums/foru...16#post1081216

It got delayed by the spam filter, due to the number of links I included. Anyway, perhaps some worthwhile references.

Another, more recent example of a highly efficient (i.e. GFLOPS/W) architecture that reminded me of Cell is:

Sunway SW26010 - Wikipedia

https://en.wikipedia.org/wiki/Sunway_SW26010

Its main compute cores have I-cache, but scratch pad RAM for data.

**programmerjake** · 20 February 2019, 09:17 PM

Originally posted by lkcl View Post

times move on, eh?

yyeah we will need to add a scratch memory area as well: its primary purpose will be as a direct target for batches of 4xFP32 (A,R,G,B) to be converted in a single cycle to a batch of 32-bit 8/8/8/8 ARGB pixels.

I am planning on using the scratchpad for the portion of the framebuffer and z-buffer that is currently being worked on. so, it's much more than just a target for rgba 8888 pixels.

**coder** · 20 February 2019, 10:32 PM

Originally posted by programmerjake View Post

I am planning on using the scratchpad for the portion of the framebuffer and z-buffer that is currently being worked on. so, it's much more than just a target for rgba 8888 pixels.

In neural network inferencing, it's tremendously useful to have some fast, on-chip memory.

For the sake of security, I would make it private to each core. That way, you needn't take the hit of accessing it through a MMU.

**Spacefish** · 23 October 2020, 11:03 AM

So the 720p 25fps limit is mainly dictated by the memory controller? As there are no good high perf Open controller IPs?

**lkcl** · 23 October 2020, 01:20 PM

Originally posted by Spacefish View Post

So the 720p 25fps limit is mainly dictated by the memory controller?

no, it was - when that particular customer specified their requirements to us - a way to fit within their specified power requirements. you are correct as in: 720p @ 25fps results in a certain memory controller bandwidth demand (easily calculated: 1280 x 720 x 25 x 888 (3 bytes) = 70 megabytes/sec.

this in turn becomes the basis for computing the power draw for that particular customer's needs, and further, the amount of processing power required to keep the framebuffer occupied can also be computed.

As there are no good high perf Open controller IPs?

richard herveille's roalogic RGBTTL controller is unlimited. i.e. the limits are in the amount of memory bandwidth on the one side and the pinout (PHY pads) bandwidth on the other. https://github.com/RoaLogic/vga_lcd

on the latter there's additionally nothing stopping you from doing a conversion (internally) from RGB/TTL into eDP, DVI and so on (effectively subsuming the role of a TI TFP410a for example)

Announcement

Libre RISC-V GPU Aiming For 2.5 Watt Power Draw Continues Being Plotted

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment