Announcement

**Ironmask** · 06 April 2022, 07:31 PM

Originally posted by Ladis View Post

it was also a "morphing" CPU which had the same performance like Intel, but at a fraction of power consuption and smaller/cheaper chip.

I hate to go off-topic but, that's hardly an accomplishment.
I don't know of any modern CPU as slow and power-hungry as Intel's chips. The mad dash to replace them with ARM is similar to how we're desperate to replace the dilapidated Xorg stack. We simply can't go on like this, if ARM didn't show up when it did, the computing would have been set back another couple decades (after the setbacks Intel already put us through, thank god AMD at least picked up the slack.)

**Ladis** · 06 April 2022, 07:46 PM

Originally posted by Ironmask View Post

I hate to go off-topic but, that's hardly an accomplishment.
I don't know of any modern CPU as slow and power-hungry as Intel's chips. The mad dash to replace them with ARM is similar to how we're desperate to replace the dilapidated Xorg stack. We simply can't go on like this, if ARM didn't show up when it did, the computing would have been set back another couple decades (after the setbacks Intel already put us through, thank god AMD at least picked up the slack.)

At the time all software was closed source (and nobody knew about Linux). So it was a huge achievement, because you could run the same x86 Windows OS and software. What was ARM and such for common people at the time, when you couldn't run that OS and those programs and games everybody used

.

PS: Intel (and x86) didn't become more efficient by improving their architecture, but by just using a newer smaller node. They were leading in the chip fabrication at the time (and that lenghtened the life of x86).

**sophisticles** · 06 April 2022, 10:01 PM

One size fits all usually fits none.

**coder** · 06 April 2022, 11:49 PM

Originally posted by atomsymbol

I think/believe the main points of the Prodigy CPU are:

Matrix multiplication acceleration instructions:

On this point, the T864 datasheet claims "8 Tflops High Performance Computing" and "131 Tflops AI training and inference". The T16128, with 2x the cores, simply doubles it.

In comparison:

AMD's MI250X (shipping since late last year) offers: 45.3 fp64 vector TFLOPS, 90.5 matrix TFLOPS, and 362 AI TFLOPS (fp16 or BF16).
Nvidia's A100 (launched 2020) offers: 9.7 fp64 vector TFLOPS and 312 AI TFLOPS (fp16).
Nvidia's H100 (coming in Q3) offers: 30 fp64 vector TFLOPS, 60 fp64 matrix TFLOPS, and 1000 AI TFLOPS (fp16).

So, it can't touch GPUs, on any workloads suitable for them. Even if you ran them at lower clocks to rein in their power dissipation, they'd still run circles around it.

Originally posted by atomsymbol

VLIW instruction set architecture: up to 8 micro-ops per cycle to ALUs/load/store/branch/compare

Huh. The datasheets for T864 and T16128 are both claiming "4 instructions per clock up to 4GHz", which hardly impresses.

This is the typical problem you see with chip startups. It's like they look at the spec sheets of current-gen GPUs and CPUs, go off an design a chip which can beat them, but forget that they'll be competing against chips 2 generations better, by the time they get anything to market.

The only recent HPC processor I've seen that's really impressed me (though A64FX deserves a special mention) is Preferred Networks (PFN):

https://medium.com/syncedreview/pref...e-40bb02be0985

PEZY-SC2 was pretty cool, including their mad TCI wireless in-package DRAM interface that outmatched any HBM2, in its day. Sadly, they haven't seemed to do much since getting busted on financial fraud.

https://en.wikichip.org/wiki/pezy/pezy-scx/pezy-sc2

**Ironmask** · 07 April 2022, 12:35 AM

Originally posted by sophisticles View Post

One size fits all usually fits none.

I remember when APUs got popular and my friend was like "wow it's two in one!" and I was like "it's the worst of both worlds, a bad CPU combined with a bad GPU" and later they were like "yeah you were right".
Nobody in this thread seems to be able to find out anything about what this is even supposed to be. I'm guessing they found out what an FPGA is and are marketing it as a CPU that can optimize itself to do anything? So, snake oil.
Now, if you want real innovation in AI chips, some startup (I forget the name of) is using SSD technology to encode artificial neural networks into solid-state chips. That one is actually going to be revolutionary.

**coder** · 07 April 2022, 02:12 AM

Originally posted by Ironmask View Post

I remember when APUs got popular and my friend was like "wow it's two in one!" and I was like "it's the worst of both worlds, a bad CPU combined with a bad GPU" and later they were like "yeah you were right".

I disagree. APUs had potential, but the software support simply wasn't there. If you had some floating-point heavy workload, even the small iGPUs traditionally had a lot more compute power than the CPU cores. Lately, CPU cores have done a lot to catch up, but iGPUs still pack considerably more FLOPS/W.

The biggest limitation APUs face is memory bandwidth. DDR5 raises the bar on that, but still not enough. You'd have to go to in-package memory, like Apple, to scale them up to anything meaningful. For laptops, I guess you could use soldered-down GDDR6, because many are already using soldered RAM, anyhow.

Originally posted by Ironmask View Post

I'm guessing they found out what an FPGA is and are marketing it as a CPU that can optimize itself to do anything? So, snake oil.

No, for two reasons. First, you couldn't hit their quoted performance numbers with a CPU built on a normal FPGA. Second, they provide a FPGA-based development platform, implying that the real products clearly aren't.

**mdedetrich** · 07 April 2022, 05:27 AM

Originally posted by coder View Post

This sounds better than it is. You talk about these things as if they're comparable, but unless you're using a really small deep learning model, the overhead of shipping your data over PCIe or CXL is negligible, by comparison with the time that inferencing takes. We're talking about microseconds vs. milliseconds, at least.

It sounds better than it is because currently, unless you can afford highly specialized systems, you have to work around this limitation and people accept it. The PCIe is definitely a bottleneck for many HPC cases. I mean look at the latest Z series of mainframes from IBM, they put their AI accelerator directly onto the chip exactly to remove that overhead and they got something like a 20x improvement (also because since that AI processor is on the chip it now has extremely fast access to IBM's new CPU cache architecture).

For this reason, having some "general" ISA that allows all of what traditionally would be accelerators/separate chips onto the same die has merit. I can't really comment on how viable/realistic this is though but its definitely a problem in HPC.

**sinepgib** · 07 April 2022, 07:39 AM

Originally posted by atomsymbol

The probability of that being true is very close to zero. Or do you believe that a person with 88 patents doesn't know technology? https://patents.google.com/?inventor=Radoslav+Danilak

Not saying this is the case, but considering how much of a problem bullshit patents are, I wouldn't read too much just on the number.

**paulpach** · 07 April 2022, 08:34 AM

Originally posted by coder View Post

What technology? I visited their website, but it doesn't really say how they do it. Did you look through patents or something?

They indeed lack any technical detail on their website. It seems their processor is a VLIW architecture, like Itanium 20 years ago.

While today's x86 and ARM CPUs have dozens of logical registers, they are implemented using Tomasulo algorithm, and they actually have hundreds of physical registers. Each one of them can feed the execution units. The routing of data between hundreds of registers to the execution units is a huge power drain.

VLIW architectures are not OoO and do not use Tomasulo's algorithm. Rather the instruction scheduling is done by the compiler. They have an order of magnitude fewer physical registers. VLIW architectures, typically found in DSPs, are 100x more power-efficient than x86 and ARM, however, they tend to be pretty bad for general-purpose code. Itanium failed because it stalled a lot in general-purpose code. If someone figures out how to avoid stalling in general-purpose code with a VLIW architecture, they would be sitting on a gold mine. That is precisely what Tachyum claims to have done.

It is not clear how Tachyum achieves this, but I think it is totally possible. Take a look at the Mill. They have a very innovative architecture that is VLIW-ish. They have lots of presentations detailing how it works. They have addressed the stalling problem and are directly targetting general-purpose code. While they don't have a physical chip yet, The presentations are sound and will probably convince you this is indeed possible. For example, they separate issue and retirement of loads, so you can issue a lot at the beginning of a function, but the value is not actually loaded until several instructions later, and you can even do writes in between. They also have a completely redesigned branch predictor, which can predict branches without even looking at the code, by predicting exists of basic blocks instead of predicting jumps.

That said, Tachyum has been saying "we hope for tape out later this year" since 2018, so yeah, it smells like vaporware at this point.

**coder** · 07 April 2022, 10:29 AM

Originally posted by mdedetrich View Post

The PCIe is definitely a bottleneck for many HPC cases.

Yes, if you can't fit your entire dataset in GPU memory, or there's lots of shuttling back & forth, then agreed. There are obviously motivations for things like PCIe 6.0 and CXL - not all of them networking or storage use cases.

This applies mainly to high-end, niche use cases, though. If you try to extrapolate it to APUs and general-purpose workloads, the APU is much more hampered by lack of memory bandwidth than it benefits from being in-package with the CPU cores.

Originally posted by mdedetrich View Post

I mean look at the latest Z series of mainframes from IBM, they put their AI accelerator directly onto the chip exactly to remove that overhead and they got something like a 20x improvement (also because since that AI processor is on the chip it now has extremely fast access to IBM's new CPU cache architecture).

That's not a benefit vs. being on a separate die. Their reason for putting it on die was likely just to simplify their system architecture and have only one chip to fab instead of 2.

Originally posted by mdedetrich View Post

For this reason, having some "general" ISA that allows all of what traditionally would be accelerators/separate chips onto the same die has merit.

Well, Intel seems to think so. That's why Sapphire Rapids is getting AMX (8192-bit matrix extensions).

Meanwhile, AMD is spending its die area on lots more cores. So, we'll see who wins this next round. I think most cloud users with AI workloads are already moving to special-purpose accelerators that can be scaled up far beyond the number of CPUs in a system.

But the key thing you're forgetting is the bandwidth problem. AI accelerators are so data-hungry, even HBM isn't enough. The growing trend is to fill up to half of the die with SRAM. That's not something you'd do with a CPU, and note that it's not cache. If you squint your eyes, you can pretend that cache is the same thing, but cache burns a lot more power than directly-addressable SRAM.

Announcement

Tachyum Gets FreeBSD Running On Their Prodigy ISA Emulation Platform For AI / HPC

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment