Announcement

**coder** · 29 January 2021, 02:03 PM

Originally posted by L_A_G View Post

The decoder being a big part of the silicon budget hasn't really been true in about a decade at this point. Sure, it took up a lot of space on-die when they began "RISC-ifying" x86 with heavy use of superscalarity and other associated features, but that was a long time ago.

Citation needed.

As I mentioned before, the problem with variable-length instructions is that the decoder scales poorly (in both power and area) as you try to do more instructions in parallel. Also, Intel just keeps adding more and more instructions. Modern Intel CPUs now support a lot more opcodes than baseline x86-64. That's got to have an impact, too.

**vladpetric** · 29 January 2021, 02:08 PM

Originally posted by coder View Post

An important point that doesn't get mentioned much - GPUs are in-order! And you're a GPU guy!

Also, I would point out that VLIW (in-order, for those who don't know) has remained very popular in DSPs and AI processors, mostly on the basis of their applications having tight loops and fairly predictable latencies (with on-chip memories or cache prefetchers able to help).

Now, I'm not making the argument that Itanium's lack of OoO wasn't a problem. Quite the contrary, I think we're all in agreement that was its primary flaw.

When you have workloads with highly predictable dataflows - with DSP and AI aka massive matrix multipliers they're super-predictable; with GPUs they're slightly less predictable, but you compensate by insane parallelism - then sure, it makes a lot of sense.

Still, those are to some degree special cases (important special cases, but still special case IMO), which also require a lot of developer time to get right. The compiler doesn't automagically figure out the GPU parallelism out of one's generic code.

And yes, it seems that we're in agreement here.

**bridgman** · 29 January 2021, 03:15 PM

Originally posted by coder View Post

Again, EPIC is not inherently in-order. That's merely how they implemented it. You could build an out-of-order IA64 CPU, and it would probably be as fast or faster than anything we have today.

Yep, fair point. One might argue that building an OOO EPIC CPU would throw away most of the benefits of EPIC, but it can't hurt to have some information in the instruction stream about at least a small number of instructions that can be executed in parallel without having to go through hazard analysis in hardware.

That said, modern CPUs have reached the point where they are typically wider than the typical set of "explicitly parallel" instructions in a typical Itanium stream so I'm not sure there would be a benefit from carrying the instruction bundle relationships around while scheduling around cache / memory responses as well. It sure would be an interesting project though.

Originally posted by coder View Post

An important point that doesn't get mentioned much - GPUs are in-order! And you're a GPU guy!

IIRC NVidia experimented with "lightly OOO" shader cores a while back but went back to in-order. Modern GPU ISA code typically uses a mix of cache warming (touching certain memory areas in a separate thread) and explicitly out-of-order instruction sequencing based on the more predictable access times that cache warming provides. It does feel like it might be time to revisit OOO GPU shader cores though.

**kgardas** · 29 January 2021, 03:51 PM

Originally posted by coder View Post

Again, EPIC is not inherently in-order. That's merely how they implemented it. You could build an out-of-order IA64 CPU, and it would probably be as fast or faster than anything we have today.

One question: will you issue whole bundles of isns or independent isns from bundles OoO? If independent isns, then IMHO you basically kill EPIC idea.
On the other hand completely OoO IA64 may be quite nice architecture -- somewhere in parallel universe. :-) Here we need to stick with amd64 chaos (thanks amd to at least clean a bit this mess!) and armv8 and risc-v.

**vladpetric** · 29 January 2021, 03:58 PM

Originally posted by coder View Post

Again, EPIC is not inherently in-order. That's merely how they implemented it. You could build an out-of-order IA64 CPU, and it would probably be as fast or faster than anything we have today.

You'd ignore bundles completely in the out-of-order side, and sure ... it would be fine.

But I don't think it'd give you a any speed advantage over x86.

Itanium is not just EPIC; it has other nuisances, e.g., the really large stacked register file. I call them nuisances, because while not insurmountable, they'd still make you wish you had a classic RISC ISA instead.

**vladpetric** · 29 January 2021, 08:56 PM

Originally posted by CommunityMember View Post

One of the problems with IA64 was that it was too far ahead of compiler technology of the time, and to get good performance advantages with it required compiler capabilities that were not widely available (hand assembly could show impressive results, but that is not practical for large code bases). Another problem with IA64 was that Intel was unwilling to take the leap of faith and fully commit and put it on their most advanced lithography and displace existing (and profitable) x86 processors which were already supply constrained, so all the IA64 processors were a generation or two or more behind in speeds and feeds.

As Knuth said, the wished-for compilers simply can't be written.

Intel actually assembled a super-good compiler team for Itanium. I know two people from that team.

**Space Heater** · 29 January 2021, 10:02 PM

Originally posted by vladpetric View Post

As Knuth said, the wished-for compilers simply can't be written.

Intel actually assembled a super-good compiler team for Itanium. I know two people from that team.

Yeah I think the biggest gift from VLIW and EPIC was the increased effort from industry and academia in improving compilers.

**coder** · 30 January 2021, 10:42 AM

Originally posted by bridgman View Post

That said, modern CPUs have reached the point where they are typically wider than the typical set of "explicitly parallel" instructions in a typical Itanium stream

How many instructions in a IA64 instruction stream do you think are "explicitly parallel", as you put it?

Originally posted by bridgman View Post

so I'm not sure there would be a benefit from carrying the instruction bundle relationships around while scheduling around cache / memory responses as well.

I think the instruction-triplet encoding was just a pragmatic measure to reduce the overhead of the the templates, but I don't see why you would need to maintain the bundle structure downstream of the decoder.

Originally posted by bridgman View Post

It does feel like it might be time to revisit OOO GPU shader cores though.

Really? I thought the GPU solution to latency-hiding (besides prefetching and fast on-chip memories) was SMT - just add enough threads that at least one will have work to do while the others block on I/O. If the goal of going out-of-order is to hide cache misses, then you need quite a lot of reordering, which would add a lot of overhead. Everything about the design of GPUs strikes me as optimizing throughput by using as much die area as possible for real computation, at the expense of single-thread performance.

Anyway, I expect GPU hardware architects are more than capable of running cost/benefit analysis on these sorts of things. But I'd probably look at macro-architecture structures to address stalls before solving it at the micro-level. Maybe some better way of load-balancing across CUs could help, or maybe re-partitioning workers within a set of wavefronts (not sure if I have quite the right terminology). With all these sorts of clever tricks available, it would seem a bit lazy just to reach for the OoO hammer.

**coder** · 30 January 2021, 10:49 AM

Originally posted by vladpetric View Post

You'd ignore bundles completely in the out-of-order side, and sure ... it would be fine.

But I don't think it'd give you a any speed advantage over x86.

It simplifies the decoder and the analysis needed for reordering. More importantly, the fixed-size of the instructions makes it easier and cheaper to widen the decoder.

Originally posted by vladpetric View Post

Itanium is not just EPIC; it has other nuisances, e.g., the really large stacked register file. I call them nuisances, because while not insurmountable, they'd still make you wish you had a classic RISC ISA instead.

The large register file is another advantage, though I'd forgotten about some of its quirks.

**Weasel** · 30 January 2021, 10:52 AM

Originally posted by coder View Post

What's an example of a data-dependency that a CPU would use as a runtime instruction scheduling constraint that can't be determined at compile-time?

Anything with a branch inside a loop, where the branch reads from some other dependency (such as memory accessed through a register) and otherwise it doesn't. Especially if the loop condition later depends on this.

If you determined this statically, you'd have to use the lowest common denominator, i.e. the slow memory read, for all cases, even if the other case happens 90% of the time (and is predicted correctly by the branch predictor). This makes the loop much slower than it should be.

I mean obviously, this is needed for OoO CPUs, something which Itanium was not.

Announcement

Linux Kernel Orphans Itanium Support, Linus Torvalds Acknowledges Its Death

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment