Tachyum Gets FreeBSD Running On Their Prodigy ISA Emulation Platform For AI / HPC

coder replied

27 June 2022, 11:01 AM
Originally posted by paulpach View Post

Reorder buffers (AKA instruction queue) ARE registers. You can't address them in software at all, they are internal to the CPU, but they are registers that hold the instruction and its dependencies.

I went back a few messages, but the relevance of this point was still unclear to me. How ROB entries are implemented is one thing - and yes, of course they're implemented as registers. That's different than talking about the ISA registers, or the shadow registers typically used in an OoO processor.
Leave a comment:
paulpach replied

27 June 2022, 08:43 AM
Originally posted by coder View Post

You're just rattling off irrelevant facts, rather than answering the point. First, you do not need unique registers for each reorder buffer entry.

Reorder buffers (AKA instruction queue) ARE registers. You can't address them in software at all, they are internal to the CPU, but they are registers that hold the instruction and its dependencies.

Check out an introduction to the Tomasulo algorithm: https://www.youtube.com/watch?v=jyjE6NHtkiA&t=122s
Leave a comment:
tuxd3v replied

12 April 2022, 05:02 PM
Originally posted by coder View Post

I had the same thought, but someone suggested they mean 4 VLIWs per cycle per core. If it's SMT, that at least makes logical sense.

Then again, they supposedly have some limited amount of OoO, so maybe they don't reschedule slots within a packet, but allow some sort of IA64-like reordering between non-dependent packets.

In any case, they will have serious problems with samples( something that in a FPGA doesn't come up.. ) duo to Power usage envelops/Thermal dissipation.
They would need a very good Power Management Unit..
And we still don't know, how they will manage to do simd@4Ghz in all cores..

I am not saying that its bullshit,maybe they found several breakthrough discoveries...its possible, but highly unlikely.
As a matter of comparison,
intel is in the business for several decades, and they still have a lot of problems to solve, and they have one of the best power management systems in CPUs currently..
But they still didn't managed to solve the problem of power usage/dissipation when doing simd for example, and maintaining frequency..

Anyway, we need to wait to see
I, at least, expect them to have success,
Because that will mean that technology has being evolving, which is good.

Originally posted by coder View Post

Someone mentioned a Hot Chips presentation form 3-4 years ago, so you could look for it if you're really curious. However, it might have only a tenuous relation to their current/upcoming generation of hardware.

I will look into it thanks
Leave a comment:
coder replied

11 April 2022, 12:33 PM
Originally posted by tuxd3v View Post

I still didn't got the architecture that is being used here..judging by 4 operations/per cycle I would say that its not VLIW like..
So its some sort of OoO..

I had the same thought, but someone suggested they mean 4 VLIWs per cycle per core. If it's SMT, that at least makes logical sense.

Then again, they supposedly have some limited amount of OoO, so maybe they don't reschedule slots within a packet, but allow some sort of IA64-like reordering between non-dependent packets.

I think it's a fool's game to parse so little information, so I'll not dwell on it further. Someone mentioned a Hot Chips presentation form 3-4 years ago, so you could look for it if you're really curious. However, it might have only a tenuous relation to their current/upcoming generation of hardware.
Likes 2
Leave a comment:
tuxd3v replied

11 April 2022, 12:14 PM
I am a bit sceptic about some things..
I don't know how they will deliver 4Ghz on a 64 core system... and if when they start to use simd operations the frequency will remain steady??
because all we have now is vapourware, even intel didn't managed to solve that problems..

Its very easy to do FPGA developments at hundreds of megahertz, and extrapolate,
But the reality is that the final hardware is the real challenge... and there frequency/power consumption will be a problem..

I still didn't got the architecture that is being used here..judging by 4 operations/per cycle I would say that its not VLIW like..
So its some sort of OoO..
Likes 1
Leave a comment:
sinepgib replied

10 April 2022, 04:53 AM
Originally posted by coder View Post

What we need are details.

Unless & until they disclose more specifics, this degree of skepticism is to be expected. I'm (somewhat) open-minded, but they need to show us how they intend to deliver on these claims.

Vagueness is key to scams. Just saying.
Likes 2
Leave a comment:
coder replied

10 April 2022, 04:50 AM
Originally posted by Markospox View Post

Not always with tech and mathematics - this looks good (one may prefer some other but that's other thing) and it is good .

What we need are details.

Unless & until they disclose more specifics, this degree of skepticism is to be expected. I'm (somewhat) open-minded, but they need to show us how they intend to deliver on these claims.
Likes 2
Leave a comment:
coder replied

10 April 2022, 04:41 AM
Originally posted by paulpach View Post

It really doesn't. Out of Order processors use reorder buffers, ...

You're just rattling off irrelevant facts, rather than answering the point. First, you do not need unique registers for each reorder buffer entry.

Second, we're talking about a best-in-class OoO micro-architecture vs. a VLIWish CPU from more than 20 years earlier. If someone wanted competitive efficiency from a VLIW CPU today, they'd likely need even more registers than before. See ELBRUS, for instance.

I guess the main determining factor in register count is whether you're going to try and cover some level of cache misses, like OoO CPUs can do. If so, then static scheduling would probably need more registers than dynamically-scheduled OoO, to achieve the same level of concurrency.

The other popular option for covering cache misses is SMT, but that still amounts to multiplying the number of ISA registers by a few times, which still pretty much runs into the same routing & switching problem you were concerned about with a large, unified register pool. And the processors more successfully implementing this tactic are mostly GPUs, with SMT on the order of a dozen or so. That's a big jump in register pool size! On this point, I should note that Intel eventually added SMT to its IA64 CPUs, though it seems they only went as far as 2 threads per core.

Originally posted by paulpach View Post

So they just have a few dozen registers and that is it.

No, let's try this again.
TI C6000 (1997): 32 registers @ 32 bits

Philips TriMedia TM1000 (1997): 128 registers @ 32 bits

Intel Itanium (2001): 128 int, 128 fp registers @ 64 bits; 64 predicate registers

ELBRUS 2000 (2005): 256 registers @ 84 bits

The obvious trend is obvious.

Originally posted by paulpach View Post

a VLIW CPU can use the same amount of logical registers as your everyday x86 or ARM CPU. For example, the modern Texas C6000 series only has 32 general-purpose registers.

C6000 is 25 years old and nowhere near comparable in performance to modern ARM (which has 32 ISA GPRs). I actually programmed a C6000. We ported the same code to an ARM Cortex-A53 - an in-order, dual-issue core, running at a bit higher clock speed, and it handled the workload with room to spare!

I won't even go into the horrendous development environment on the C6000, but let's just say the difference between that and ARM was night-and-day. It's the only time I've personally experienced a compiler bug, which it took 2 of us about 3 weeks to hunt down with little more than print statements at our disposal.

Originally posted by paulpach View Post

Those are used to increase ILP while resolving data hazards. In a VLIW there are no data hazards at all. All data hazards are resolved by the compiler.

This is pure marketing spin. Of course VLIW CPUs have data hazards! The only difference is that they do speculative execution via predication. But it still requires extra registers and involves unnecessary work. And just because the registers are exposed in the ISA doesn't change the fact that you need a lot of them to achieve good concurrency.

The pernicious thing about predication is that you're stuck doing wasted work that a good OoO CPU wouldn't necessarily do, once the branch pattern becomes sufficiently clear. And because predication is so costly, it doesn't scale well to nested branches - something an OoO CPU can manage, with good branch-prediction.

Originally posted by paulpach View Post

The TMS320C6474 uses 4-8 watts and can do 28,800 million instructions per second.
The i9-12900K uses 50-358 watts and can do 97671 million instructions per second.

Again, pure marketing spin. This assumes full throughput of 8x 32-bit instructions per core, per cycle, no empty slots, no cache misses, and that no instructions are wasted on false predications. Anyone who's programmed such a device will know this theoretical ideal is an unachievable fantasy. The first challenge is for the compiler to find enough useful work to schedule into all of the slots, which isn't as easy as it sounds, because slots have restrictions on which types of instructions they can hold. So, you'd need just the right instruction mix. Then, even if there's enough instruction-level parallelism in your code, there are other hazards which can affect instruction scheduling, like branch delay slots, instruction latency, and write-back conflicts (i.e. because the register file only has so many write ports, sometimes a slot goes empty because otherwise there'd be too many results getting written back in a subsequent cycle).

And if we're comparing useful work, then you ought to account for Alder Lake being 64-bit with 256-bit vector instruction support. That 358 W figure was surely measured during heavy AVX2-usage. Furthermore, I have no idea where you got the 97.6 GIPS figure. It's only a little more than 2 instructions per cycle, per core, at their respective base clocks. Its theoretical peak is certainly much higher.

Oh, and the TMS320C6474 has a total of 64 GPRs per core, split across 2 register files. Not the 32 registers of the original C6000.

Originally posted by paulpach View Post

28800 / 8 = 3600 MIPS per watt
97671 / 358 = 272 MIPS per watt

Aside from those instructions not being remotely comparable, your misrepresentation also relies on Alder Lake being operated well beyond its peak efficiency point.

Originally posted by paulpach View Post

So you are right, not 100x the performance per watt, more like 13x.

The obvious troll is obvious.

Originally posted by paulpach View Post

The comparison is not exactly fair, since the i9 uses much better node technology, but even with the node disadvantage, it comes out ahead with a massive win.

It's easy to win competitions on paper, when you can presume perfect efficiency of your side and cherry-pick the least flattering numbers for your competition. The picture your paint is not an honest portrayal.

Originally posted by paulpach View Post

Unfortunately, the TMS320C6474 completely sucks for general-purpose code, which is full of branches. It requires carefully crafted assembly that takes advantage of software pipelining and whatnot.

Had you ever actually used one of these, you'd know that it's utterly impractical to program them in assembly language. You have to use a proprietary (and buggy!) C/C++ compiler, with a very sophisticated optimizer that often relies on extensive pragmas to achieve good performance.

That's not a bad thing (bugs & proprietary aside), but a statement of fact. Truth be told, I wish GCC gave me more of the kind of expressiveness that compiler did, but then modern OoO CPUs basically don't need software pipelining and even loop unrolling has less benefit for them (especially if you've take care of any pointer aliasing problems).

Originally posted by paulpach View Post

in an OoO CPU, a good 90% of the power usage goes to figuring out exactly which instruction to send to the execution units.

Though I'm quite familiar with the argument, I can only assume this figure is as fictitious as the rest of your numbers. And whatever it truly is, it's obviously very specific to both the micro-architecture and workload. If anyone knows what the actual figures are, it's probably just the design teams of these OoO CPU cores. They're the ones making the power/area/performance tradeoffs.

Originally posted by paulpach View Post

If someone managed to crack the nut and get DSP-like performance with general-purpose code, that would be huge. Did Tachyum do it? I don't know, but I do think this is possible.

That's not my understanding of their proposition. I believe what they're selling is DSP-like performance on DSP code, but maybe better-than-DSP performance on general-purpose code.

The problem they face, as I've outlined in prior posts, is that you simply can't beat GPUs at their own game. And it's worth noting that GPUs ditched VLIW, in favor of SIMD + SMT.

I find it disheartening that you've resorted to such cheap tactics in defense of VLIW, because I really don't consider myself a hater. Had you not overreached to make your points, we could've found ample room for agreement. I had high hopes for IA64, after my prior experience with VLIW. I still think some sort of hybrid approach will ultimately emerge. I think we can all see the absurdity of burning power decoding & scheduling the same instructions over and over, each time some code is executed.

What I like to imagine is a redrawing of the abstraction boundary between hardware and software. Instead of the fiction of a simplistic ISA that imposes a tremendous burden on the hardware to implement efficiently, I'm in favor of exposing the predictive and dynamic aspects of the hardware for the compiler to utilize, when it deems necessary and appropriate to do so. And I'm not only talking about branch and data predictions, but also cache lookups. For instance, what if you could lock some data in cache, and then address it directly?

Anyway, these are some of the ideas we could be discussing, if I didn't have to waste so much time & energy debunking the rubbish figures and inferences you seem to think are necessary to get your point across. Although, it turned out to be a good exercise for me, so I don't really mind too much.
Last edited by coder; 10 April 2022, 01:00 PM.
Likes 1
Leave a comment:
Markospox replied

08 April 2022, 04:05 PM
Originally posted by ldesnogu View Post

When something sounds too good to be true, it usually isn't.

Not always with tech and mathematics - this looks good (one may prefer some other but that's other thing) and it is good .
Leave a comment:
AnonDMR replied

08 April 2022, 03:18 PM
It's not a VLIW processor - this has been said over and over. https://youtu.be/lQ1wUnsh5Qk?t=5174
Leave a comment:

Announcement

Tachyum Gets FreeBSD Running On Their Prodigy ISA Emulation Platform For AI / HPC

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: