Announcement

**cb88** · 18 January 2021, 07:32 PM

Originally posted by vladpetric View Post

Transmeta's performance was lackluster (actually its competitor was P4, a mediocre core; and it didn't manage to make inroads against that). Code morphing - kinda' cool, though in the end not that helpful (20 years later, we have open as in free instruction sets anyway ...). VLIW - bad.

No it really was the Pentium 3.... which morphed into Pentium M and later conroe and core architecture, the P4 is the exact thing the Transmeta CPUs were good at combating since you could not fit a P4 in a tablet sized device. As far as the rest of the comment you clearly have no idea what you are talking about... most CPUs have to do a static form of code morphing anyway and but cannot do runtime optimizations because patents but that is coming

The Transmeta CPU is acutally an x86 CPU with a VLIW front end exposed all the hardware around the VLIW core including the registers is essentially an x86 CPU. The whole point was to get ride of hardware implementations of things that could be done in software *better* and with lower power use. Modern CPUs get around some of this by clock gating inactive silicon... another advantage of VLIW is you can implement fast paths for all the instructions at low cost... since implementing an instruction is the job of the software. On the contrary intel tends to force instructions to be slow to try to get people to stop using them.

**vladpetric** · 18 January 2021, 07:59 PM

Originally posted by cb88 View Post

No it really was the Pentium 3.... which morphed into Pentium M and later conroe and core architecture, the P4 is the exact thing the Transmeta CPUs were good at combating since you could not fit a P4 in a tablet sized device. As far as the rest of the comment you clearly have no idea what you are talking about... most CPUs have to do a static form of code morphing anyway and but cannot do runtime optimizations because patents but that is coming

The Transmeta CPU is acutally an x86 CPU with a VLIW front end exposed all the hardware around the VLIW core including the registers is essentially an x86 CPU. The whole point was to get ride of hardware implementations of things that could be done in software *better* and with lower power use. Modern CPUs get around some of this by clock gating inactive silicon... another advantage of VLIW is you can implement fast paths for all the instructions at low cost... since implementing an instruction is the job of the software. On the contrary intel tends to force instructions to be slow to try to get people to stop using them.

If you're gonna pull an ad hominem, at the very least get it right (still an ad-hominem, but a correct ad hominem). Yeah, I only have a PhD in microarchitecture with dissertation cited in two Intel preliminary patents.

The whole idea that software can do instruction scheduling better has been shown to be patently false (billions of dollars wasted later, I should add). As I said earlier on this thread, compilers (or translation software in Transmeta, for that matter) can't handle the dynamicity of the memory hierarchy. Stalling when you have misses to main memory in the order of ~500 cycles is really a retarded idea (stalling is what in-order processors do, including VLIW).

The problem is that many people (primarily compiler writers) don't want to accept that OoO instruction scheduling is a really good idea (even mobile processors and the Atoms do that today), so completely ignore all the factual evidence. Much like the political parties of today - when the other side says something meaningful, they just ignore it.

So yeah, make VLIW great again!

**Space Heater** · 18 January 2021, 09:17 PM

Originally posted by cb88 View Post

Actually RDNA1/2 is has brought back some of the VLIW design.... and its likely that the same is true of CDNA. They are calling it Super SIMD.

Can you explain how this is like VLIW? I'm not seeing where RDNA is breaking with the GCN ISA and *requiring* that compilers now statically scheduling independent instruction pairs/bundles. If anything, this sounds like superscalar concepts being mapped to SIMD, where the hardware assumes responsibility for issuing multiple instructions and checking for hazards.

**cb88** · 18 January 2021, 09:26 PM

Originally posted by Space Heater View Post

Can you explain how this is like VLIW? I'm not seeing where RDNA is breaking with the GCN ISA and *requiring* that compilers now statically scheduling independent instruction pairs/bundles. If anything, this sounds like superscalar concepts being mapped to SIMD, where the hardware assumes responsibility for issuing multiple instructions and checking for hazards.

GCN isnt' a single ISA anyway... its a design architecture really, each version of the GCN ISA is not binary compatible with the others in all cases.

It's kind of pointless to discuss this from the ISA level as we don't know the internals, when the radeon GPU code leaked awhile back it turns out its wildly different internally than you would think. You can think of RNDA as VLIW2 with a hardware scheduler that handles the VLIW bit in hardware.

**WorBlux** · 18 January 2021, 09:43 PM

Originally posted by Space Heater View Post

Since when does RISC-V have any form of predicated move/select? They seem to be religiously against predication and claim that branch prediction is always better.

Again my mistake. When LbreSOC was looking at proposing a risc-v vector extension, they considered overloading all the branch commands as predicates. I got confused for a minute.

**Space Heater** · 18 January 2021, 09:54 PM

Originally posted by cb88 View Post

GCN isnt' a single ISA anyway... its a design architecture really, each version of the GCN ISA is not binary compatible with the others in all cases.

It's kind of pointless to discuss this from the ISA level as we don't know the internals, when the radeon GPU code leaked awhile back it turns out its wildly different internally than you would think. You can think of RNDA as VLIW2 with a hardware scheduler that handles the VLIW bit in hardware.

The whole concept of VLIW is that it is done at the ISA level, the name "Very Long Instruction Word" provides a strong hint that it relates to the software-hardware interface. VLIW architectures punt the complexities of grouping multiple independent instructions to software, and they do this by exposing hardware details (e.g. bundle size, what permutations of instruction types are allowed in the same bundle etc.) in the ISA. Therefore with VLIW architectures the compiler is responsible for creating bundles of independent instructions, not the hardware. If the hardware is handling "the VLIW bit" then it's not VLIW-like at all, instead it's like a traditional superscalar processor.

**Nelson** · 18 January 2021, 11:21 PM

Originally posted by zexelon View Post

I would suggest this assessment might be a bit unfair. The architecture is actually incredibly elegant, very well implemented (in hardware) and amazingly flexible for future development.

It had several key failures though:

...
Turns out its borderline impossible to write an effective compiler. The whole architecture turns much of commonly accepted computer engineering paradigms on their head... it moved all the scheduling, parallelism, and hardware complexity into the compiler... genius idea for the hardware engineers, it theoretically made it cheaper to produce. However it made the compiler severely more complicated to produce... and as several architectures in history have shown... the very best most amazing CPU turns out to be useless if you cant compile software for it!

Personally I think point 1 may have been the key one. If they could have made the market excited about it and got more CPU designers and manufacturers on board, it would have spread the risk and development of the compilers would have perhaps progressed further!

This is all bonus work for the concepts of the RISC-V group... maybe some day we will see an Unobtanium-V group

I don't know that I'd describe it as elegant. It lacked bold leadership in design and had all of the hallmarks of design by committee. Intel's IA-32 was getting beat up by RISCish designs due to the lack of registers, IA-64 had 128 general registers and 128 floating point registers, it wasn't even clear that more than 32 that RISC chips generally had were needed but more is more. EPIC sounded neat too but it might not have been bold enough, it was 3 instructions wide and instruction fetchers at the time already starting to get pretty good at multiple dispatch.

Around that time profile guided optimizations to prevent branch mis-predicts was sort of the hotness in the compiler world, Merced generally didn't have branch mispredicts because it could just take both branches and throw out the mis-predicted results. It drank power like nobody's business.

Building a compiler for it wasn't super difficult, I think that is overblown. Compilers already were ordering and scheduling instructions loop unrolling was very popular already but it was a move in a different direction from where compilers had been going that didn't seem to offer up all that much. It must have been SGI's guys and maybe the IBM Toronto compiler group had already demonstrated solutions to some of the problems compilers had with RISC chips that people thought were too difficult to solve a decade earlier and then those RISC chips are more simple, more power efficient and Itanium never really matched the highend, let alone beat it. Since it was expensive and it had good but not great performance it never really got the compiler attention of other platforms, I think of it as more of a general tooling shortage. Take Java, I think BEA J-Rockit ran on Itanium but I don't remember the Sun JVM ever being really current, it was always built for IA-64 quite a bit later. Intel got the Cygnus guys to port stuff to it, but it was kind of abandoned since noone ever used it.

Maybe the biggest failing was it wasn't an enterprise grade part at the time, it didn't have the RAS requirements that the big buys needed and only HP bought on. Without a substantial highend play, it was just a very expensive part that didn't bring enough performance to justify the cost. And it really sucked at running your existing software. The lesson, basically Gordon Moore invented it, was that you had to be cheaper and faster to really take over; cheaper *or* faster might win or compete but you really needed both. Just a couple years after Merced was released, power specific parts became a thing and so ARM has made progress due to focusing on the power envelope and by making cheaper and more efficient parts.

**zexelon** · 18 January 2021, 11:36 PM

Nelson Well, I pretty much have to agree with the majority of your assessment here! I think you are spot on with your consideration of the energy failings of Itanium. In recent decades this has become a metric more important than raw performance for a large swath of applications and you are bang on about the rise of ARM being largely related to this.

I dont think a mobile version of Itanium would have ever been a feasible proposition... not without basically a total redesign and I think the very idea at its core is probably not able to meet the power usage of say ARM.

**microcode** · 18 January 2021, 11:54 PM

In reality, intel had money out the wazoo at that time; if compiler engineers could figure out how to run general purpose code quickly on an in-order VLIW without mountains of NOP slots and I$ abuse, intel would have hired them.

If there is any place in general purpose software for VLIW, it is as a supplement or assist to OoO, rather than a replacement for it.

**bofkentucky** · 19 January 2021, 11:34 AM

Originally posted by f0rmat View Post

That is what I remember, too. What is fascinating to me is that at the time Intel introduced the Itanium, the vastly overwhelming majority of code out there was x86 with some 16 bit thrown in there. Why Intel thought that everybody would drop all of there x86 code and transfer to Itanium was a mystery to me at the time. Especially since at the time, AMD was providing some serious competition to them. Not only had they just introduced the first true 64 bit processor, they had also just recently had the first processor that broke the 1 GHz threshold.

Itanium was targeted at the legacy UNIX vendors that had Alpha, MIPS, SPARC, PA-RISC, and POWER as live architectures at that point, with 64-bit capability already or fast on the roadmap. Now we're down to POWER and the patent pool that was Alpha ended up influencing the Athlon enough that AMD was able to squeeze out the toe-hold that was AMD64

Announcement

Itanium IA-64 Was Busted In The Upstream, Default Linux Kernel Build The Past Month

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment