Announcement

**coder** · 19 February 2019, 09:20 PM

Originally posted by AndyChow View Post

Also, and I wish some expert would come in to analyse this project, but doesn't making such a chip require hundreds of thousands, if not millions, of man-hours?

I wrote firmware for a semiconductor company, some years ago. My knowledge is a bit dated, but perhaps still relevant. First, many hardware designs use rather high-level languages, similar to software programming languages. This also opens the door for scripts to be used to automatically generate some repetitive structures (perhaps not unlike the fabric generator lkcl mentions, below).

Second, there is (or used to be) a distinction between how a small company designs an ASIC and how the big boys do it. Little guys would design in HDL (Hardware Design Language) and run some synthesis tool to generate a lower-level representation that references a library of the actual gate structures, provided by the fab. The Big Boys would do a so-called "full custom" chip, where they would design much more by hand. The difference is a significant multiple in power/performance. That's a big reason for the differential in engineering hours required by leading edge chip vendors vs. some small shop cranking out special-purpose ASICs. Now, I could certainly believe the synthesis and layout tools have progressed, since then.

BTW, speaking of tools and ASIC libraries, are those going to be open source? If so, how good are the open source synthesis tools?

If not, doesn't it make the effort that much less useful? I mean, open source software is something I can compile for myself. Whereas, with open source hardware, I just have to take someone's word that a chip was synthesized from some particular revision of the source - I have no way of being absolutely sure, and no practical way of modifying it even if I wanted. And from a security perspective, even the hardware designers can't be sure that someone at the fab didn't monkey with their RTL.

**coder** · 19 February 2019, 09:27 PM

Originally posted by lkcl View Post

now, that's just the core. the peripherals are a different matter, and to help deal with that, i went to visit the shakti team in india, and spent several weeks capturing their knowledge with an "auto fabric generator". you specify the interfaces at a high level, write it out in a TSV file, run a command and the code generated with an AXI4Lite interface is *literally* auto-generated within half a second.

it had taken the team literally man-years to write that same code by hand.

Okay, so even if you publish the input and output, I take it their tool is proprietary? Seems less than ideal.

Also, are you designing your own memory controller, or stuff like that? If not, will it be proprietary IP?

Originally posted by lkcl View Post

in addition, we have a significant advantage over any proprietary corporation: the freedom to talk to absolutely anyone, anywhere in the world. no proprietary company would ever let its employees talk to experts over the internet, let alone publish the full details of their work *as it was being developed*,

Huh? There are loads of companies developing open source software who do exactly that.

Sure, the economics of hardware design has made it less open, but your statement didn't seem limited to just hardware companies.

**lkcl** · 19 February 2019, 10:41 PM

Originally posted by coder View Post

The Big Boys would do a so-called "full custom" chip, where they would design much more by hand.

sigh, i'm actually much more comfortable with gate-level design, and in talking with mitch alsup i was able to follow what he was saying.

BTW, speaking of tools and ASIC libraries, are those going to be open source? If so, how good are the open source synthesis tools?

they're surprisingly extensive: yosys is the main one. the shakti team, when i met them 8 months ago, were just about to embark on parallel tracks to do a full custom layout of a 180nm chip using entirely libre tools separate from and side-by-side with entirely proprietary tools, in order to make a proper comparison.

the motivation being that for security-sensitive applications (such as a processor used in India's Fast Breeder Nuclear Reactor Programme), they absolutely under no circumstances can trust proprietary tools made by a foreign power.

If not, doesn't it make the effort that much less useful? I mean, open source software is something I can compile for myself. Whereas, with open source hardware, I just have to take someone's word that a chip was synthesized from some particular revision of the source - I have no way of being absolutely sure, and no practical way of modifying it even if I wanted. And from a security perspective, even the hardware designers can't be sure that someone at the fab didn't monkey with their RTL.

well, there we have to think through the consequences for the fab if they were found to have been involved in tampering with a design. the loss of business would be catastrophic.

**lkcl** · 19 February 2019, 10:46 PM

Originally posted by coder View Post

Also, are you designing your own memory controller, or stuff like that? If not, will it be proprietary IP?

sigh, a DDR3/4 PHY is the biggest bitch, here. it involves analog layout that has to be customised to the fab and the node. this takes a YEAR to do the layout. i've received a quote for USD $600,000 to do a libre DDR3 PHY layout. that's excluding the design, and excluding DDR4. chances are, then, realistically, that we'll need to license a proprietary LP/DDR/3/4 PHY. i don't like it.

so one of the things that we'll do is add several HyperRAM interfaces, all running DDR (300mhz). that way if we get a customer that absolutely insists on not using proprietary PHYs, it's ok. yes, there's a libre version of HyperRAM available.

Huh? There are loads of companies developing open source software who do exactly that.

exactly. cooperation has finally become a respected business model.

**coder** · 19 February 2019, 11:12 PM

Originally posted by lkcl View Post

...

Thanks for your replies. I do hope you're successful.

The way I see it, GPUs are all about power-efficiency. The more efficient the design, the better it scales and the faster you can clock it. SMT designs are inherently more power-efficient than OoO, due to the overhead of the scheduling logic (which scales nonlinearly). And SMT is potentially better at hiding the long latencies experienced with modern memories (especially if you're not going to burn even more power on speculatively executing results that might just get discarded). I assume SMT is simpler to design and easier to pipeline, as well.

Finally, I think security concerns with SMT can be mitigated by limiting core-sharing to threads from the same process. In graphics workloads, there's enough concurrency that this shouldn't be a problem.

**lkcl** · 20 February 2019, 12:17 AM

Originally posted by coder View Post

Thanks for your replies. I do hope you're successful.

appreciated

The way I see it, GPUs are all about power-efficiency. The more efficient the design, the better it scales and the faster you can clock it. SMT designs are inherently more power-efficient than OoO, due to the overhead of the scheduling logic (which scales nonlinearly). And SMT is potentially better at hiding the long latencies experienced with modern memories (especially if you're not going to burn even more power on speculatively executing results that might just get discarded). I assume SMT is simpler to design and easier to pipeline, as well.

Finally, I think security concerns with SMT can be mitigated by limiting core-sharing to threads from the same process. In graphics workloads, there's enough concurrency that this shouldn't be a problem.

SIMT, as i understand it, is basically a fancy name for predicated (maskable) SIMD. i've seen, experienced and heard enough about SIMT to know that it's a complete pig to program.

we'd like to do something slightly different, particularly given that this has to be a hybrid processor that's capable of general-purpose workloads.

one of the reasons that i believe OoO is inefficient is due to excessive use of CAMs, particularly in the Reorder Buffer of the Tomasulo Algorithm, or, if a Scoreboard system is deployed instead (which has been completely misunderstood by the academic community), an "Architectural Register File" (another CAM) is added which maps between real registers and a much larger internal number.

i spent over 10 weeks interacting with mitch alsup on comp.arch, and not only learned a huge amount about the CDC 6600, we also came up with a scheme that provides precise exceptions, operand forwarding, nameless registers *and historic rollback* and much more, all *without* requiring *any* CAMs. i documented this for the crowdsupply page. a preview of the updates that's going out is here if you're interested to see them before they're published officially: https://git.libre-riscv.org/?p=crowd....git;a=summary

basically there's quite a lot of design work going on, including input from the software development side. we're not doing "just the hardware and then writing the software", and neither are we doing "just the software and then designing the hardware", we're doing iterative feedback on *everything*.

**lkcl** · 20 February 2019, 12:18 AM

Originally posted by AndyChow View Post

Well, if you actually deliver hardware, you can

it'll be a couple of years yet

**ldesnogu** · 20 February 2019, 02:36 AM

Originally posted by lkcl View Post

from the published academic literature on RISC-V. it's down to the reduced instruction-cache size, from the use of Compressed instructions. a 20-25% reduction in code size results in a smaller I-cache being needed, which in turn results in an approximately 40% power reduction. CAMs (Content-addressable Memory) are power-suckers

Jeremy Bennett found that ARM Thumb is denser than RISC-V compressed: https://fosdem.org/2019/schedule/event/riscvcompact/

Also I guess you know a cache is not a CAM. You only need as many comparators as the number of ways of your cache and even then you can use way predictors. Anyway even if you completely removed the I-cache you'd not gain 40% of power.

**coder** · 20 February 2019, 02:43 AM

Originally posted by lkcl View Post

SIMT, as i understand it, is basically a fancy name for predicated (maskable) SIMD. i've seen, experienced and heard enough about SIMT to know that it's a complete pig to program.

Uh, it's a Nvidia marketing term, really. SIMD is normally predicated, but Nvidia added a few sensible extensions to SIMD that support the idea of treating each lane as a conjoined thread. They're spinning this as if it's some fancy, new thing, but IMO it's really a small evolution of standard SIMD.

SIMD \< SIMT \< SMT: parallelism in NVIDIA GPUs

https://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html

Originally posted by lkcl View Post

we'd like to do something slightly different, particularly given that this has to be a hybrid processor that's capable of general-purpose workloads.

@ldesnogu had a good point about Larrabee (AKA Xeon Phi), the ultimate incarnation of which was basically Knights Landing (KNL).

Xeon Phi - Wikipedia

https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing

Knights Landing - Microarchitectures - Intel - WikiChip

https://en.wikichip.org/wiki/intel/microarchitectures/knights_landing

It extended Intel's low-power 2-way OoO Silvermont core with 4-way SMT and dual 512-bit vector engines. These cores were interconnected using the forerunner to Skylake-SP's mesh interconnect. Add 16 GB of HMC2 in-package DRAM and a 384-bit DDR4 interface and you've got a hot mess of x86 floating point horsepower that was still only about half as fast as Nvidia's P100 GPU against which it launched.

Its main benefits of x86 backward compatibility and large memory support (384 GB of DDR4) weren't enough to save it.

Originally posted by lkcl View Post

which maps between real registers and a much larger internal number.

Another neat thing about GPUs is that they can break ISA-level backward compatibility potentially every generation. So, there's no need for inefficient things like register renaming.

Originally posted by lkcl View Post

all *without* requiring *any* CAMs.

I think you're wise to be wary of CAMs.

On a related note, cache coherency is a huge obstacle to scaling multi-core designs. These guys claim 60% of the power burned in modern CPUs is lost in the cache hirerarchy:

https://www.csm.ornl.gov/SOS20/documents/Sohmers.pptx

They also claim

Data movement now requires over 40x more energy than an actual calculation.

Quite eye-opening.

Going back to Larrabee, I found this analysis of its predecessor, from a decade ago:

https://www.edn.com/Pdf/ViewPdf?contentItemId=4216945

...where they state:

As data transfer on chip costs significant energy, larger caches will be required to keep the data local. Maintaining coherency across many cores is a significant challenge as well. Hardware costs and increased coherency traffic on the mesh will pose hurdles for completely hardware-based coherent systems. Instead, future terascale processors will explore message-passing architectures. Special on-die, message-passing hardware is very efficient for core-to-core communication, making software-based coherency with hardware assists a viable solution for the future.

However, I think the lure of x86-compatibility doomed them to implement full hardware-based coherency, in Xeon Phi.

**ldesnogu** · 20 February 2019, 02:45 AM

Originally posted by AndyChow View Post

Also, and I wish some expert would come in to analyse this project, but doesn't making such a chip require hundreds of thousands, if not millions, of man-hours?

I have been involved in CPU design teams for 20 years. You can achieve great results with small teams, but of course you likely won't be competitive with higher end chips. And you still need millions of dollars to pay for dev and mask. Parallella is a good example of that: https://www.parallella.org/2017/01/2...-from-scratch/

I really want them to succeed, but some claims made here sound fishy as I pointed in previous messages in this thread.

Announcement

Libre RISC-V GPU Aiming For 2.5 Watt Power Draw Continues Being Plotted

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment