Announcement

**hlandau** · 20 October 2019, 11:58 PM

Originally posted by lkcl View Post

turns out that enjoy-digital has a PCIe Controller - that just leaves the PHY. which is the bit that has me concerned: aside from DDR3/4 RAM, there's nothing on the ASIC that i was planning to run above around 150mhz for the first version.

if however you can get hold of a PCIe PHY, then yes, we can put it in. however not for the test chip, because it's 180nm.

what *would* work would be to use a Lattice ECP5G as a gateway (communicating using some form of parallel bus e.g. xSPI or SDRAM). the ECP5G already has the balanced differential PHY drivers needed to do PCIe, and someone is actually working on it: https://github.com/enjoy-digital/litepcie/issues/20

There is an existing parallel interface standard for connecting PCIe PHYs to the datalink layer, the "PIPE" specification: https://www.intel.com/content/dam/ww...ctures-3.1.pdf

In fact, due to similar PHY design between PCIe, USB3, SATA, DisplayPort PIPE has been adapted to support all of these. So common interfaces are available for PHYs. Though these are intended for intra-chip use, inter-chip use may be feasible. There are also standalone PCIe PHY chips, but only at x1, but studying their interfaces may be helpful: https://www.nxp.com/docs/en/data-sheet/PX1011B.pdf

As far as I can tell your project doesn't have an IRC channel, getting one might be a good idea. There's much that can be discussed here.

**archsway** · 21 October 2019, 01:27 AM

Originally posted by Qaridarium

I think 14nm will be very cheap in 2020 because then IBM power10 will have 7nm and also Intel will have 7nm for all products.

so all the 14nm fabs will be free for low-cost manufacturing.

Remember that many ARM CPUs are still 28nm, so it may be a while until 14nm is "free".

Originally posted by lkcl View Post

when you compute the data transfer rate generated by 4k, it's 8.3 million pixels per frame. let's say 30 fps, that's now 250 million pixels per frame. let's say 16 bpp, that's 2 bytes - now that's 500 mbytes/sec, just for the pixel data.

DDR3 @ 800mhz is a nice low-cost RAM rate, 32 bit wide, the power budget is around 300mW with DDR3L. you get 4 bytes so it's 3200 mbytes/sec. FIFTEEN PERCENT of the data bandwidth is taken up by a 4k frame @ 30 fps, 16bpp!

if you went to 60fps, it would be 30%. if you went to 60fps 32bpp, it would be a whopping SIXTY PERCENT of the data bandwidth taken up just feeding the framebuffer, at 2000 mbytes/sec.

How about having separate RAM for the framebuffer?

**OneTimeShot** · 21 October 2019, 02:17 AM

Originally posted by lkcl View Post

stuff

Ok. I just wasted some time reading the libre-RISC dev mailing list. There was a whole lot of 20-year-old politics and ranting, and (surprise) no technical content.

You have *a lot* to prove that this is a real project, and that RISC-V aren't correct in just ignoring you as a weirdo (based on what I read, I suspect that your new contacts at POWER and MIPS are going to come to the same conclusion in due course).

Here's the thing: I too had a chat about modifying RISC-V for GPU over a beer at the pub with some mates. We came to the conclusion that effectively it woiuld need to be a completely different ISA that would bear little resemblance to RISC-V. That you've been "working" on the project for this long and suddenly you can swap to POWER or MIPS probably indicates that (1) you came to the same conclusion (2) you haven't actually done any real engineering yet.

There is nothing stopping you building an FPGA demo of your idea using any processor ISA you like. Let's face it, all you're proposing is to compile a Vulkan software renderer and start to add processor features in a [very likely futile] attempt to make it fast enough.

I would suggest that you fork the Rocket RISC-V core, and go back to the RISC foundation when you have a rotating cube on a cheap Xilinx chip

**brent** · 21 October 2019, 06:24 AM

Originally posted by OneTimeShot View Post

Here's the thing: I too had a chat about modifying RISC-V for GPU over a beer at the pub with some mates. We came to the conclusion that effectively it woiuld need to be a completely different ISA that would bear little resemblance to RISC-V. That you've been "working" on the project for this long and suddenly you can swap to POWER or MIPS probably indicates that (1) you came to the same conclusion (2) you haven't actually done any real engineering yet.

Yup. That's why it is obviously a joke project. I said this in another post, which seems to have been deleted (?), but I stand by that.

There is nothing stopping you building an FPGA demo of your idea using any processor ISA you like. Let's face it, all you're proposing is to compile a Vulkan software renderer and start to add processor features in a [very likely futile] attempt to make it fast enough.

I would suggest that you fork the Rocket RISC-V core, and go back to the RISC foundation when you have a rotating cube on a cheap Xilinx chip

Exactly. An entirely unproven approach is not exactly convincing enough to warrant getting an official RISC-V ISA extension. In fact, I don't know *any* standards organisation that works this way.

**lkcl** · 21 October 2019, 09:48 AM

Originally posted by hlandau View Post

In fact, due to similar PHY design between PCIe, USB3, SATA, DisplayPort PIPE has been adapted to support all of these. So common interfaces are available for PHYs. Though these are intended for intra-chip use, inter-chip use may be feasible. There are also standalone PCIe PHY chips, but only at x1, but studying their interfaces may be helpful: https://www.nxp.com/docs/en/data-sheet/PX1011B.pdf

ah, excellent! that's a really good find, thank you. i've added it to the list. 250mhz is well within achievable, and the number of pins is not too mad (8-bit TX data bus, 8-bit RX data bus)

As far as I can tell your project doesn't have an IRC channel, getting one might be a good idea. There's much that can be discussed here.

i set one up, but it's rarely used - due to the timezone differences we tend to use the lists.

**lkcl** · 21 October 2019, 09:56 AM

Originally posted by OneTimeShot View Post

Here's the thing: I too had a chat about modifying RISC-V for GPU over a beer at the pub with some mates. We came to the conclusion that effectively it woiuld need to be a completely different ISA that would bear little resemblance to RISC-V.

yes. we studied AMDGPU, and the Midgard Panfrost docs, as well as MIAOU, Nyuzi and others, and also talked a lot with Mitch Alsup (on comp.arch). swizzle, which is extremely high-priority, takes *twenty bits* to properly specify (4 for dest, 2x4 for src1, 2x4 for src2) - it's *28* for a 3-src operation! no wonder GPU ISAs are 64 to 128 bit!

the only way we could come up with that would meet the requirements (of not doing a full custom ISA rewrite) is through Simple-V, which is an *ISA-INDEPENDENT* scheme, and is a "context" system (register tagging).

SV takes *ANY* ISA, lifts up the skirts, and shoves in a hardware-for-loop around *scalar* instructions, pausing the Program Counter whilst the register numbers are incremented on the *same* scalar instruction, pushed into the instruction queue.

therefore, yes, we can just move to PowerPC. with some futzing about. which is what we're evaluating and thinking through.

I would suggest that you fork the Rocket RISC-V core, and go back to the RISC foundation when you have a rotating cube on a cheap Xilinx chip

we looked at it, and - as you also determined - the amount of engineering required is so high that it would be a near-rewrite. Chisel3 is so impossible to comprehend that we decided not to go with it. we also looked at the Shakti cores, however they are in BSV, and although the verilog output is useable, the compilers are proprietary, and we'd (again) need to do a near-rewrite.

**lkcl** · 21 October 2019, 10:01 AM

Originally posted by archsway View Post

How about having separate RAM for the framebuffer?

we looked at it (after all, discrete GPUs have their own DRAM, typically GDDR), here's how it went:

each DDR3/DDR4 32-bit RAM interface needs around 100 pins. 80 (or so) for signalling, and about 20 (or so) for POWER/GND. you can see here:
https://libre-riscv.org/shakti/m_class/pinouts/ - see section 2.2 "DDR3"

so if we added 2x DDR3/4 interfaces, it would no longer be a USD $4 300-pin ASIC in a power budget of 2.5 watts, it would be a USD $5-6 400-pin ASIC in a power budget of 3+ watts.

also the licensing costs of the DDR3/4 PHY would be doubled, from USD $1-2m to USD $2-4m.

DDR PHYs are *insanely* expensive to license.

now, that's not to say that we *can't* do this - if a sponsor or customer comes forward and is prepared to put up the money, anything is possible. if they don't, we'll proceed with a test ASIC that's achievable in a reasonable budget and work our way up, minimising risk as much as possible along the way.

**lkcl** · 21 October 2019, 11:41 AM

Originally posted by madscientist159 View Post

Yes, this makes sense, and for what we're looking at the power budget isn't as critical. I know the original application was mobile, but I just wonder if there's any way to do this where the same chip can do double duty and fill the void that seems to exist right now with our current desktops -- even if the mobile flavor of the chip has one of the DDR controllers lasered off, it'd still be better to have the capability in the design I would think?

yes, from what i gather, you only pay for RTLs that you actually use (bring out to actual pins). DDR3/4 PHY layouts are duplicated (and are a pre-done "thing" that you license per geometry per foundry. so USD 0.5 to 1 million for 28nm TSMC DDR PHY, a different company will have one - a completely different product - for 20nm GF and so on) - you only actually pay the license fee if you *use* it.

so we can _do_ it.... just please do not be shocked at the DDR3/4 PHY licensing costs: these are just "normal" costs for the ASIC industry.

we will however have to make sure that the internal data buses (we're going with Wishbone, to avoid ARM patent licensing issues) are wide enough to cope with the extra load.

oh, i don't know if you spotted, from the last update: we re-discovered a technique used by IBM which halves (or doubles, depending on perspective) the pipeline length. every other pipeline "latch" is made "transparent" (a combinatorial bypass Mux is added).

normally, if you have say a 10-long pipeline, and you halve the clockrate, the latency is (obviously) doubled. however, if you have "transparent latches" on every other stage, you can open those up and *HALVE* the pipeline length to 5 stages. each stage will now take (appx) twice as long to stabilise due to gate "ripple", but the clock rate is *already low enough to cope*.

if you start to increase the clock rate so much that the 5-stage pipeline is at risk of over-running the gate ripple and not having enough time to stabilise, you just pause things for a few cycles to let the pipelines clear, then break the transparent latches, put it back to a 10-stage pipeline, and now it's safe to go twice as fast.

it's a bit of a mind-bender because in each configuration (5-stage, 800mhz vs 10-stage, 1.6hhz) the actual instruction latency (completion time in ns) is identical.

basically what i am saying is that we have a technique whereby the former clock rate of 800mhz can easily be doubled to 1.6hhz, by breaking everything down into much shorter pipeline stages and having these alternate-transparent-latches.

so we can actually do desktop-level speeds.

no, we do not want, at this stage, to try going to 3ghz, sorry

**pal666** · 21 October 2019, 03:46 PM

Originally posted by Qaridarium

I also think POWER is better than RISC-V

also it is better strategy to make big-gpu for desktop and server

first, you can't make big gpu from either power or riscv, your choices are only different grades of small. second, bigness of gpu doesn't depend on power or riscv choice, it's affected by implementation, which can be of varying speed

**adakite** · 21 October 2019, 05:00 PM

Originally posted by DMJC View Post

I just wish we could get the code to SGI IRIX opened up. Now THAT would make MIPS a much more interesting platform. I'm a bit surprised that they (Libre) haven't chosen MIPS for a GPU since it has a very long/proven track record of usage in graphics.

Now that SGI is faded into HPE, hope is gone.

Announcement

Libre RISC-V Open-Source Effort Now Looking At POWER Instead Of RISC-V

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment