Originally posted by paulpach
View Post
Second, we're talking about a best-in-class OoO micro-architecture vs. a VLIWish CPU from more than 20 years earlier. If someone wanted competitive efficiency from a VLIW CPU today, they'd likely need even more registers than before. See ELBRUS, for instance.
I guess the main determining factor in register count is whether you're going to try and cover some level of cache misses, like OoO CPUs can do. If so, then static scheduling would probably need more registers than dynamically-scheduled OoO, to achieve the same level of concurrency.
The other popular option for covering cache misses is SMT, but that still amounts to multiplying the number of ISA registers by a few times, which still pretty much runs into the same routing & switching problem you were concerned about with a large, unified register pool. And the processors more successfully implementing this tactic are mostly GPUs, with SMT on the order of a dozen or so. That's a big jump in register pool size! On this point, I should note that Intel eventually added SMT to its IA64 CPUs, though it seems they only went as far as 2 threads per core.
Originally posted by paulpach
View Post
- TI C6000 (1997): 32 registers @ 32 bits
- Philips TriMedia TM1000 (1997): 128 registers @ 32 bits
- Intel Itanium (2001): 128 int, 128 fp registers @ 64 bits; 64 predicate registers
- ELBRUS 2000 (2005): 256 registers @ 84 bits
The obvious trend is obvious.
Originally posted by paulpach
View Post
I won't even go into the horrendous development environment on the C6000, but let's just say the difference between that and ARM was night-and-day. It's the only time I've personally experienced a compiler bug, which it took 2 of us about 3 weeks to hunt down with little more than print statements at our disposal.
Originally posted by paulpach
View Post
The pernicious thing about predication is that you're stuck doing wasted work that a good OoO CPU wouldn't necessarily do, once the branch pattern becomes sufficiently clear. And because predication is so costly, it doesn't scale well to nested branches - something an OoO CPU can manage, with good branch-prediction.
Originally posted by paulpach
View Post
And if we're comparing useful work, then you ought to account for Alder Lake being 64-bit with 256-bit vector instruction support. That 358 W figure was surely measured during heavy AVX2-usage. Furthermore, I have no idea where you got the 97.6 GIPS figure. It's only a little more than 2 instructions per cycle, per core, at their respective base clocks. Its theoretical peak is certainly much higher.
Oh, and the TMS320C6474 has a total of 64 GPRs per core, split across 2 register files. Not the 32 registers of the original C6000.
Originally posted by paulpach
View Post
Originally posted by paulpach
View Post
Originally posted by paulpach
View Post
Originally posted by paulpach
View Post
That's not a bad thing (bugs & proprietary aside), but a statement of fact. Truth be told, I wish GCC gave me more of the kind of expressiveness that compiler did, but then modern OoO CPUs basically don't need software pipelining and even loop unrolling has less benefit for them (especially if you've take care of any pointer aliasing problems).
Originally posted by paulpach
View Post
Originally posted by paulpach
View Post
The problem they face, as I've outlined in prior posts, is that you simply can't beat GPUs at their own game. And it's worth noting that GPUs ditched VLIW, in favor of SIMD + SMT.
I find it disheartening that you've resorted to such cheap tactics in defense of VLIW, because I really don't consider myself a hater. Had you not overreached to make your points, we could've found ample room for agreement. I had high hopes for IA64, after my prior experience with VLIW. I still think some sort of hybrid approach will ultimately emerge. I think we can all see the absurdity of burning power decoding & scheduling the same instructions over and over, each time some code is executed.
What I like to imagine is a redrawing of the abstraction boundary between hardware and software. Instead of the fiction of a simplistic ISA that imposes a tremendous burden on the hardware to implement efficiently, I'm in favor of exposing the predictive and dynamic aspects of the hardware for the compiler to utilize, when it deems necessary and appropriate to do so. And I'm not only talking about branch and data predictions, but also cache lookups. For instance, what if you could lock some data in cache, and then address it directly?
Anyway, these are some of the ideas we could be discussing, if I didn't have to waste so much time & energy debunking the rubbish figures and inferences you seem to think are necessary to get your point across. Although, it turned out to be a good exercise for me, so I don't really mind too much.
Comment