Announcement

**drakonas777** · 09 February 2022, 03:20 AM

Originally posted by coder View Post

Itanium failed for a lot of reasons, one of which being that customers didn't want to lock themselves into a single vendor. With x86, they had (limited) choice. With RISC-V, they have potentially unlimited choice.

Also potentially unlimited fragmentation of custom extensions

**coder** · 09 February 2022, 03:38 AM

Originally posted by jabl View Post

Well, the situation with Itanium was pretty different. Back then R&D suggested that superscalar out of order CPU's (both x86 and RISC) would be running into scaling limits, and the solution they came up with was a very wide compiler scheduled VLIW. If this plan would have worked out, not only would Itanium have left x86 and various workstation/server RISC cpu's (which where very much alive back then) far behind performance-wise, but Itanium was protected to the hilt with patents etc. and Intel apparently imagined a world where they would enjoy monopoly profits for a very long time in the future.

Right on.

Originally posted by jabl View Post

Of course, in retrospect we know this didn't work out due to multiple reasons, primarily Intel's compiler team was never able to come up with the magic compiler that would make VLIW work great on general purpose code,

Except IA64 is not VLIW. They coined the term EPIC, because instructions were arranged in packets which had ordering dependencies encoded Explicitly. This was their solution for binary forward/backward compatibility. With classical VLIW, you'd have to recompile your code for each generation, because they'd have different numbers of issue ports, different latencies, different register hazards, etc. So, IA64 still had some limited amount of runtime scheduling.

I think one of the problems with IA64 is that Intel took its foot off the gas, before the ISA ever really got close to maximizing its potential. For instance, explicit parallelism means you can do some things at runtime that classical VLIW can't, such as OoO and speculative execution. However, Intel never made a IA64 CPU that did either of these things. By the time they would've, the writing was already on the wall that x86-64 was the way forward.

Also, IA64 was never extended with SIMD instructions. This meant that SSE-enabled CPUs quickly surpassed IA64 on floating point performance, which was one of its early advantages.

Originally posted by jabl View Post

RISC-V is different, as there's no protective moat like with Itanium and to a lesser extent x86-64. They may be interested in RISC-V as a way to hurt ARM, but no, I'm sure they have no intentions of helping RISC-V step on their x86 toes.

Well, if you believe x86 is going to hit a wall before either ARM or RISC-V, then Intel needs to pick a way forward and be prepared to make the transition when the time is right. And, between the two, RISC-V is better for them because no one has competitive cores like we're seeing in the ARM market. Perhaps more impotently, they can make RISC-V cores without paying any royalties.

Originally posted by jabl View Post

make no mistake about it, where there is a choice for what to push Intel would certainly choose x86 as they have a dominant position in that ecosystem with other entrants except AMD denied access, whereas in the RISC-V world they'd just be one more seller of cpu's among many others.

This is a key reason why Intel will not be an early adopter (i.e. announcing any RISC-V cores that challenge their x86 offerings). They will wait until the market is at a tipping point away from x86, and then announce their post-x86 plans.

**coder** · 09 February 2022, 04:32 AM

Originally posted by drakonas777 View Post

Also potentially unlimited fragmentation of custom extensions

Yeah, but I think some common subsets will settle out. Just look at what's happening with x86-64 feature levels, as an example of how it can work.

**brucehoult** · 09 February 2022, 05:14 AM

Originally posted by pabloski View Post

To be precise Intel has already gone RISC, starting with the P6 architecture.

CISC and RISC are properties of instruction sets. The P6 has the same old CISC instruction set.

RISC instruction sets make it cheap and easy to use certain high performance CPU implementation techniques.

With the P6 Intel found a difficult and expensive way to apply those techniques to CISC. (Actually this process started with 80486)

And ARM has gone partially CISC, because the ISA has been updated with complex instructions.

Exactly the opposite -- 1985 ARM is a CISC/RISC hybrid. Aarch64 (aka ARMv8-A) in 2012 is almost as pure RISC as RISC-V, MIPS, and Alpha -- ARM dumped all the complex stuff, such as load/store multiple.

At the current scale of integration, with superscalar execution, multiple pipelines, etc... it has less sense to classify CPU designs as CISC or RISC. CISC exists only in the instruction sets ( microops ), while old school RISC doesn't exist at all anymore.

Not so. RISC-V is very much old school, back to first principles, RISC.

**jabl** · 09 February 2022, 05:19 AM

Originally posted by coder View Post

*snip*

Thank you for your clarifications wrt Itanium, you're right.

Well, if you believe x86 is going to hit a wall before either ARM or RISC-V

I don't believe that. Sure, ARM64 and RISC-V are in many ways nicer and cleaner than x86, but that matters mostly for things like microcontrollers where the decoder eats up a nontrivial part of the chip area and power. For a beefy wide superscalar OoO core, meh.

And before someone says "but, but, Apple M1!", yes it's an impressive chip, but it's also fabbed on a newer process node than any x86 chip on the market, and is overall a very good microarchitectural design regardless of the ISA.

On the high end, I think the biggest advantage ARM64 and RISC-V enjoy over x86 is the more relaxed memory model that allows some memory optimizations, rather than the core ISA itself.

If we're going to see a shift from x86 towards ARM64 or RISC-V, I don't think it will be due to limitations in the x86 ISA itself, but rather due to the more open market model allowing things like cloud providers to design their own chips rather than pay the x86 duopoly tax. Or for countries outside the US that are worried about "digital sovereignty" or US trade restrictions (there seems to be a lot of RISC-V interest in China, for instance).

**jabl** · 09 February 2022, 05:22 AM

Originally posted by coder View Post

Yeah, but I think some common subsets will settle out. Just look at what's happening with x86-64 feature levels, as an example of how it can work.

RISC-V is doing exactly that with the work on defining profiles: https://github.com/riscv/riscv-profi.../profiles.adoc

**coder** · 09 February 2022, 05:56 AM

Originally posted by jabl View Post

Sure, ARM64 and RISC-V are in many ways nicer and cleaner than x86, but that matters mostly for things like microcontrollers where the decoder eats up a nontrivial part of the chip area and power. For a beefy wide superscalar OoO core, meh.

For beefy, wide superscalar implementations, the x86 decoder becomes a bottleneck and burns a lot of power. Intel and AMD have both had trouble getting it past 4-wide, and even then it's like 1 general-purpose and 3 restricted ports. Alder Lake's Golden Cove has 5-wide, but we don't know what sorts of restrictions apply to the ports. Meanwhile, Apple is using 8-wide and it's probably also smaller and more power-efficient.

Originally posted by jabl View Post

And before someone says "but, but, Apple M1!", yes it's an impressive chip, but it's also fabbed on a newer process node than any x86 chip on the market,

Even if you look at the last chip they made on TSMC 7 nm, it still had a wider decoder than Zen 3.

Originally posted by jabl View Post

On the high end, I think the biggest advantage ARM64 and RISC-V enjoy over x86 is the more relaxed memory model that allows some memory optimizations, rather than the core ISA itself.

Perf/W is important not only for mobile, but also servers.

**jabl** · 09 February 2022, 06:37 AM

Originally posted by coder View Post

For beefy, wide superscalar implementations, the x86 decoder becomes a bottleneck and burns a lot of power. Intel and AMD have both had trouble getting it past 4-wide, and even then it's like 1 general-purpose and 3 restricted ports. Alder Lake's Golden Cove has 5-wide, but we don't know what sorts of restrictions apply to the ports. Meanwhile, Apple is using 8-wide and it's probably also smaller and more power-efficient.

A recent-ish paper showed that for a Haswell, decoding consumed, depending on the workload, between 3 and 10% of the total core power: https://www.usenix.org/system/files/...aper-hirki.pdf

A simpler decoder like what you might have in a RISC will certainly reduce that number, but nowhere eliminate it. So we're talking about low single digit percentage differences here.

Note also that all(?) recent x86 as well as many ARM designs use uop caches, so they don't need that wide decoders. M1 doesn't, and hence it does need a wide decoder.

Also note that for RISC-V with the C extension, which it seems more or less all implementations are adopting, you also have the problem of decoding variable length instructions. Granted it's only 2 or 4 bytes and not anything between 1 and 15(?) bytes like on x86, but it has the same problem of needing to decode previous instructions to know the boundary for the next one. AFAIU the strategy is similar to x86, namely fetch a block of memory and speculatively start decoding from all possible boundaries. That is, every 2 bytes for RVC and every single byte for x86. Then as the length of a preceding instruction is decoded you kill those branches that must be invalid. Sure, it burns more power but in the grand scheme of things probably not that much.

For RVC at least one can argue that the C extension is a win overall, even though you spend more power decoding you save that by better code density leading to less power usage due to fetching instructions from memory. (If you look at power usage in modern microelectronics, data movement is stupidly expensive compared to logic.)

Perf/W is important not only for mobile, but also servers.

Absolutely, but low single digit differences ain't gonna make anyone abandon the x86 ecosystem. If there will be a shift from x86 to ARM64 or RISC-V, I predict it will be due to other reasons.

**pabloski** · 09 February 2022, 06:42 AM

Originally posted by brucehoult View Post

CISC and RISC are properties of instruction sets. The P6 has the same old CISC instruction set.

Yes.

Originally posted by brucehoult View Post

RISC instruction sets make it cheap and easy to use certain high performance CPU implementation techniques.

And this is why Intel has gone RISC-ish starting with the P6 microarchitecture. After all it is simpler to implement a CISC ISA through microcode than doing it in hardware.

Originally posted by brucehoult View Post

Exactly the opposite -- 1985 ARM is a CISC/RISC hybrid. Aarch64 (aka ARMv8-A) in 2012 is almost as pure RISC as RISC-V, MIPS, and Alpha -- ARM dumped all the complex stuff, such as load/store multiple.

I don't know about the latest iterations of the ARM architectures.

Originally posted by brucehoult View Post

Not so. RISC-V is very much old school, back to first principles, RISC.

After all it is a base ISA on which everyone is free to implement whatever he wants. Maybe we will see RISC-V designs with CISC ISAs through microcode. A la Transmeta!

**brucehoult** · 09 February 2022, 08:27 AM

Originally posted by jabl View Post

I recent-ish paper showed that for a Haswell, decoding consumed, depending on the workload, between 3 and 10% of the total core power: https://www.usenix.org/system/files/...aper-hirki.pdf

A simpler decoder like what you might have in a RISC will certainly reduce that number, but nowhere eliminate it. So we're talking about low single digit percentage differences here.

That's very nice academic work.

How, then, to explain why Intel can't make a competitively fast but low power x86, despite decades of trying to do so?

Also note that for RISC-V with the C extension, which it seems more or less all implementations are adopting, you also have the problem of decoding variable length instructions. Granted it's only 2 or 4 bytes and not anything between 1 and 15(?) bytes like on x86, but it has the same problem of needing to decode previous instructions to know the boundary for the next one.

It's much much easier.

Here's one scheme. Make a decoder block that can output either one or two instructions. It looks fundamentally at 32 bits (4 bytes) of the input stream, plus optionally at the preceding 16 bits. Let's call the byte range potentially looked at -2 to +4. There is a 1-bit input from the previous decoder block, if any, called ... let's say UNALIGNED, and a similar 1-bit output to the next stage.

The output from the decoder stage is one of X possibilities:

- if UNALIGNED in is false then 1) a single 4-byte instruction starting at 0, or 2) two 2-byte instructions starting at 0 and 2, or 3) one 2-byte instruction starting at 0 and UNALIGNED_OUT=true

- if UNALIGNED in is true then 1) a 4-byte instruction starting at -2 and a 2-byte instruction starting at 2, or 2) a 4-byte instruction starting at -2 and UNALIGNED_OUT=true

That's five possibilities.

Or, to put it another way: The first instruction output is a 4-byte instruction starting at -2 or 0, or a 2-byte instruction starting at 0. The second instruction output is either a 2-byte instruction starting at 2, or nothing.

You can propagate the UNALIGNED signal as follows:

Code:

UNALIGNED_OUT = (UNALIGNED | !(bit[0] & bit[1]))  & bit[16] & bit[17]

That's a NAND gate, an OR gate, and a 3-input AND. Of course you can rearrange that expression for less gate delay depth (but more gates).

In an FPGA that's all just a single LUT6 anyway.

You can put as many of these decoder blocks in parallel as you want, each looking at 4 bytes of code (and possibly the previous 2 bytes). The UNALIGNED signal has to ripple through them, but it's very fast.

In many designs you could afford to mux either bits -16..15 or bits 0..31 into a single decoder that can deal with either a 16 bit or 32 bit instruction. The second instruction can only ever be a compressed instruction from bits 16..31.

Or, for more speed at the cost of a little more hardware you could have a 32-bit (only) decoder looking at bits -16..15, a 16- or 32-bit decoder looking at bits 0..31, and a 16-bit (only) decoder looking at bits 16..31. You then use the UNALIGNED input to mux the outputs of the decoders. This gives much more time for the UNALIGNED signal to propagate.

Four of these decoder blocks can decode 16-18 bytes of program code into between 4 and 8 instructions (with average close to 6). So UNALIGNED only has to propagate into three decoder blocks during that clock cycle.

Six of these decoder blocks can decode 24-26 bytes of program code into between 6 and 12 instructions, with an average of near 9 instructions. That's almost certainly more than you want to do anyway, because you're getting very likely to have a taken branch somewhere in those instructions -- even if you can build a back end wide enough to process them.

Announcement

Intel Joins RISC-V International, Will Help With RISC-V Open-Source Software

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment