Announcement

**boxie** · 19 February 2019, 03:21 AM

It seems you and everyone else on the team seem to have everything under control and thinking about things long term.

I wish you nothing but good luck in making this a pretty cool toy

Originally posted by programmerjake View Post

To mitigate spectre-class bugs, we have a speculation fence instruction and we are designing so that speculation isn't visible outside a core, so we don't have speculative cache fills unless we have a mechanism that ensures that they aren't visible to other cores while they are speculative.

It's good to see there is thought being put into making things nice and secure!

**coder** · 19 February 2019, 03:39 AM

Originally posted by programmerjake View Post

One thing to keep in mind is that we are aiming for the whole SoC to be around 2-3mm² in 28nm, so it is much smaller than the RPI v3's SoC, and hopefully less expensive. We're aiming for around $4 per chip.

Please note that I did already edit that post with (hopefully) more accurate specs on the quad-A53 cluster of Pi v3. Its NEON throughput isn't as high as I expected. It seems to do about 1 fp32 MAC per cycle per core, rather than 4.

However, when you're quoting the area of the Pi's entire SoC, you need to acknowledge its 28.8 GFLOPS GPU, hardware video decoder, etc.

Originally posted by programmerjake View Post

One part that is different than just a pure CPU is that we are supporting more FP-div performance than needed for a CPU, to handle perspective projection.

A fast reciprocal operation is certainly pretty useful, in graphics. I don't know if that would give you enough precision to entirely forego fp-div, but there are a lot of cases where a full division would be overkill. I would also recommend a fast sqrt. Some chips I've seen even have a dedicated instruction for 1/sqrt.

IIRC, AMD had an interesting approach with 3D Now, where the programmer could tradeoff execution time vs. accuracy, for such operations.

Originally posted by programmerjake View Post

We are adding more cores before making the ALUs wider because we also intend for the cores to act as a CPU, where 4 cores is definitely better than 1 core for non-FP stuff. We are supporting variable-length vectors in the ISA, up to 256 elements (in my SVprefix proposal), so that will help improve ALU utilization and reduce power usage.

Wider vectors will scale more efficiently than more cores. Exhibit A) Intel's HD Graphics uses 2x 4-wide SIMD per core, with lots of cores, whereas AMD uses far fewer cores with far wider (4x 16-wide) SIMD. Guess whose APUs have better graphics performance...

Originally posted by programmerjake View Post

We are also aiming for dual-issue OoO execution to help non-FP execution.

Hmmm... it's starting to sound less and less like a GPU. I would add SMT before going OoO. I'm not aware of any GPUs that are OoO, but probably all are now using SMT.

TBH, I don't know much about mobile GPUs, so that might be where I'm wrong. But I doubt it...

I'm not necessarily saying you should build a copy cat of a typical GPU. But, neither should you ignore the wisdom in their designs.

**ldesnogu** · 19 February 2019, 07:51 AM

Originally posted by coder View Post

Hmmm... it's starting to sound less and less like a GPU. I would add SMT before going OoO. I'm not aware of any GPUs that are OoO, but probably all are now using SMT.

TBH, I don't know much about mobile GPUs, so that might be where I'm wrong. But I doubt it...

I'm not necessarily saying you should build a copy cat of a typical GPU. But, neither should you ignore the wisdom in their designs.

Indeed. Studying the failed Larrabee is a must.

**ldesnogu** · 19 February 2019, 08:07 AM

That site is full of pearls like this:

With RISC-V being 40% more power efficient than x86 or ARM, this is very reasonably achievable.

Being enthusiastic is really nice. But being credible is also needed. I wonder where that claim comes from.

**programmerjake** · 19 February 2019, 04:44 PM

Originally posted by ldesnogu View Post

Being enthusiastic is really nice. But being credible is also needed. I wonder where that claim comes from.

working on fixing that

**AndyChow** · 19 February 2019, 05:05 PM

Originally posted by wizard69 View Post

Well power is always a concern in engineering a chip. GPUs are a perfect example here as the manufactures can vary the number of execution units (among other things) to hit a power specification. I’m pretty sure the primary goal Apple has when designing their A series chips is to operate under a fixed power level. Frankly there are many situations where power is a driving factor in chip design.

It is a a completely different question as to why this guy choose the power levels alluded too. Especially when you consider the relatively low performance goals. Frankly I suspect you the time the chip is in silicon it will be woefully out dated.

I absolutely get what you are saying. But in your example of the GPU, I assume they first figure out a design, then implement a working unit, and in the final steps, then tweak the design to hit a power specification. I doubt they plan their power target without first having figured out how to make execution units that function how they want.

So while power might be the driving factor in chip design, I think that first they have a chip design that works, tons of experience, and then they improve that design to either be power efficient, or fast, or whatever.

Again, I have no idea how chips are designed. But, it does seem to me like those details are things that are figured out in the middle/end phases, after you have a proof-of-concept working prototype. And in many cases (Intel, AMD), power reduction was an evolutionary process after years of experience and expertise. Here, we seem (IMO) to have a vapor-ware concept, and already the TDP and GFLOPS numbers are known quantities? It's a red-flag in my head.

Also, and I wish some expert would come in to analyse this project, but doesn't making such a chip require hundreds of thousands, if not millions, of man-hours?

**lkcl** · 19 February 2019, 05:24 PM

Originally posted by wizard69 View Post

Well power is always a concern in engineering a chip. GPUs are a perfect example here as the manufactures can vary the number of execution units (among other things) to hit a power specification.

exactly. in this design, we can dial up the number of ports to the register file, increase the number of instructions issued per clock, increase the internal bus bandwidth to compensate, and also increase the number of cores.

It is a a completely different question as to why this guy choose the power levels alluded too. Especially when you consider the relatively low performance goals. Frankly I suspect you the time the chip is in silicon it will be woefully out dated.

the reason for the minimum specification and the extremely tight power budget is to meet a sponsor's technical requirements for an ultra-low-power application. anything exceeding those requirements will be extremely nice, however if the sponsor is to put up the funding, their requirements need to be met.

if there happen to be additional sponsors, or clients, or investors come forward, we can look at meeting alternative requirements.

**lkcl** · 19 February 2019, 05:29 PM

Originally posted by ldesnogu View Post

That site is full of pearls like this:

Being enthusiastic is really nice. But being credible is also needed. I wonder where that claim comes from.

from the published academic literature on RISC-V. it's down to the reduced instruction-cache size, from the use of Compressed instructions. a 20-25% reduction in code size results in a smaller I-cache being needed, which in turn results in an approximately 40% power reduction. CAMs (Content-addressable Memory) are power-suckers

**lkcl** · 19 February 2019, 05:40 PM

Originally posted by AndyChow View Post

already the TDP and GFLOPS numbers are known quantities?

no, they are goals. when the goals are achieved, we can contact the client and say "hey client, goal's been met, how about that $?"

Also, and I wish some expert would come in to analyse this project,

it would be even nicer if they helped to actually get it done as well.

but doesn't making such a chip require hundreds of thousands, if not millions, of man-hours?

a CISC chip like an x86, with billions of transistors? yes absolutely.

jacob wrote a simple rv32 design in around two weeks flat. it comes in at around 1,000 lines of verilog. the shakti team leader, using bluespec, told me that with bluespec he could easily write a decent design in around six weeks.

now, that's just the core. the peripherals are a different matter, and to help deal with that, i went to visit the shakti team in india, and spent several weeks capturing their knowledge with an "auto fabric generator". you specify the interfaces at a high level, write it out in a TSV file, run a command and the code generated with an AXI4Lite interface is *literally* auto-generated within half a second.

it had taken the team literally man-years to write that same code by hand.

in addition, we have a significant advantage over any proprietary corporation: the freedom to talk to absolutely anyone, anywhere in the world. no proprietary company would ever let its employees talk to experts over the internet, let alone publish the full details of their work *as it was being developed*, that would be viewed as corporate suidice. so of *course* they have to spend billions of dollars and millions of man-hours.

**AndyChow** · 19 February 2019, 09:17 PM

Originally posted by lkcl View Post

no, they are goals. when the goals are achieved, we can contact the client and say "hey client, goal's been met, how about that $?"

Well, if you actually deliver hardware, you can

Announcement

Libre RISC-V GPU Aiming For 2.5 Watt Power Draw Continues Being Plotted

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment