Originally posted by coder
View Post
Announcement
Collapse
No announcement yet.
Intel Joins RISC-V International, Will Help With RISC-V Open-Source Software
Collapse
X
-
Originally posted by jabl View PostWell, the situation with Itanium was pretty different. Back then R&D suggested that superscalar out of order CPU's (both x86 and RISC) would be running into scaling limits, and the solution they came up with was a very wide compiler scheduled VLIW. If this plan would have worked out, not only would Itanium have left x86 and various workstation/server RISC cpu's (which where very much alive back then) far behind performance-wise, but Itanium was protected to the hilt with patents etc. and Intel apparently imagined a world where they would enjoy monopoly profits for a very long time in the future.
Originally posted by jabl View PostOf course, in retrospect we know this didn't work out due to multiple reasons, primarily Intel's compiler team was never able to come up with the magic compiler that would make VLIW work great on general purpose code,
I think one of the problems with IA64 is that Intel took its foot off the gas, before the ISA ever really got close to maximizing its potential. For instance, explicit parallelism means you can do some things at runtime that classical VLIW can't, such as OoO and speculative execution. However, Intel never made a IA64 CPU that did either of these things. By the time they would've, the writing was already on the wall that x86-64 was the way forward.
Also, IA64 was never extended with SIMD instructions. This meant that SSE-enabled CPUs quickly surpassed IA64 on floating point performance, which was one of its early advantages.
Originally posted by jabl View PostRISC-V is different, as there's no protective moat like with Itanium and to a lesser extent x86-64. They may be interested in RISC-V as a way to hurt ARM, but no, I'm sure they have no intentions of helping RISC-V step on their x86 toes.
Originally posted by jabl View Postmake no mistake about it, where there is a choice for what to push Intel would certainly choose x86 as they have a dominant position in that ecosystem with other entrants except AMD denied access, whereas in the RISC-V world they'd just be one more seller of cpu's among many others.Last edited by coder; 09 February 2022, 03:41 AM.
- Likes 1
Comment
-
-
Originally posted by pabloski View PostTo be precise Intel has already gone RISC, starting with the P6 architecture.
RISC instruction sets make it cheap and easy to use certain high performance CPU implementation techniques.
With the P6 Intel found a difficult and expensive way to apply those techniques to CISC. (Actually this process started with 80486)
And ARM has gone partially CISC, because the ISA has been updated with complex instructions.
At the current scale of integration, with superscalar execution, multiple pipelines, etc... it has less sense to classify CPU designs as CISC or RISC. CISC exists only in the instruction sets ( microops ), while old school RISC doesn't exist at all anymore.
- Likes 2
Comment
-
Originally posted by coder View Post*snip*
Well, if you believe x86 is going to hit a wall before either ARM or RISC-V
And before someone says "but, but, Apple M1!", yes it's an impressive chip, but it's also fabbed on a newer process node than any x86 chip on the market, and is overall a very good microarchitectural design regardless of the ISA.
On the high end, I think the biggest advantage ARM64 and RISC-V enjoy over x86 is the more relaxed memory model that allows some memory optimizations, rather than the core ISA itself.
If we're going to see a shift from x86 towards ARM64 or RISC-V, I don't think it will be due to limitations in the x86 ISA itself, but rather due to the more open market model allowing things like cloud providers to design their own chips rather than pay the x86 duopoly tax. Or for countries outside the US that are worried about "digital sovereignty" or US trade restrictions (there seems to be a lot of RISC-V interest in China, for instance).
Comment
-
Originally posted by coder View PostYeah, but I think some common subsets will settle out. Just look at what's happening with x86-64 feature levels, as an example of how it can work.
- Likes 1
Comment
-
Originally posted by jabl View PostSure, ARM64 and RISC-V are in many ways nicer and cleaner than x86, but that matters mostly for things like microcontrollers where the decoder eats up a nontrivial part of the chip area and power. For a beefy wide superscalar OoO core, meh.
Originally posted by jabl View PostAnd before someone says "but, but, Apple M1!", yes it's an impressive chip, but it's also fabbed on a newer process node than any x86 chip on the market,
Originally posted by jabl View PostOn the high end, I think the biggest advantage ARM64 and RISC-V enjoy over x86 is the more relaxed memory model that allows some memory optimizations, rather than the core ISA itself.
- Likes 1
Comment
-
Originally posted by coder View PostFor beefy, wide superscalar implementations, the x86 decoder becomes a bottleneck and burns a lot of power. Intel and AMD have both had trouble getting it past 4-wide, and even then it's like 1 general-purpose and 3 restricted ports. Alder Lake's Golden Cove has 5-wide, but we don't know what sorts of restrictions apply to the ports. Meanwhile, Apple is using 8-wide and it's probably also smaller and more power-efficient.
A simpler decoder like what you might have in a RISC will certainly reduce that number, but nowhere eliminate it. So we're talking about low single digit percentage differences here.
Note also that all(?) recent x86 as well as many ARM designs use uop caches, so they don't need that wide decoders. M1 doesn't, and hence it does need a wide decoder.
Also note that for RISC-V with the C extension, which it seems more or less all implementations are adopting, you also have the problem of decoding variable length instructions. Granted it's only 2 or 4 bytes and not anything between 1 and 15(?) bytes like on x86, but it has the same problem of needing to decode previous instructions to know the boundary for the next one. AFAIU the strategy is similar to x86, namely fetch a block of memory and speculatively start decoding from all possible boundaries. That is, every 2 bytes for RVC and every single byte for x86. Then as the length of a preceding instruction is decoded you kill those branches that must be invalid. Sure, it burns more power but in the grand scheme of things probably not that much.
For RVC at least one can argue that the C extension is a win overall, even though you spend more power decoding you save that by better code density leading to less power usage due to fetching instructions from memory. (If you look at power usage in modern microelectronics, data movement is stupidly expensive compared to logic.)
Perf/W is important not only for mobile, but also servers.
- Likes 2
Comment
-
Originally posted by brucehoult View Post
CISC and RISC are properties of instruction sets. The P6 has the same old CISC instruction set.
Originally posted by brucehoult View PostRISC instruction sets make it cheap and easy to use certain high performance CPU implementation techniques.
Originally posted by brucehoult View PostExactly the opposite -- 1985 ARM is a CISC/RISC hybrid. Aarch64 (aka ARMv8-A) in 2012 is almost as pure RISC as RISC-V, MIPS, and Alpha -- ARM dumped all the complex stuff, such as load/store multiple.
Originally posted by brucehoult View PostNot so. RISC-V is very much old school, back to first principles, RISC.
- Likes 1
Comment
-
Originally posted by jabl View PostI recent-ish paper showed that for a Haswell, decoding consumed, depending on the workload, between 3 and 10% of the total core power: https://www.usenix.org/system/files/...aper-hirki.pdf
A simpler decoder like what you might have in a RISC will certainly reduce that number, but nowhere eliminate it. So we're talking about low single digit percentage differences here.
How, then, to explain why Intel can't make a competitively fast but low power x86, despite decades of trying to do so?
Also note that for RISC-V with the C extension, which it seems more or less all implementations are adopting, you also have the problem of decoding variable length instructions. Granted it's only 2 or 4 bytes and not anything between 1 and 15(?) bytes like on x86, but it has the same problem of needing to decode previous instructions to know the boundary for the next one.
Here's one scheme. Make a decoder block that can output either one or two instructions. It looks fundamentally at 32 bits (4 bytes) of the input stream, plus optionally at the preceding 16 bits. Let's call the byte range potentially looked at -2 to +4. There is a 1-bit input from the previous decoder block, if any, called ... let's say UNALIGNED, and a similar 1-bit output to the next stage.
The output from the decoder stage is one of X possibilities:
- if UNALIGNED in is false then 1) a single 4-byte instruction starting at 0, or 2) two 2-byte instructions starting at 0 and 2, or 3) one 2-byte instruction starting at 0 and UNALIGNED_OUT=true
- if UNALIGNED in is true then 1) a 4-byte instruction starting at -2 and a 2-byte instruction starting at 2, or 2) a 4-byte instruction starting at -2 and UNALIGNED_OUT=true
That's five possibilities.
Or, to put it another way: The first instruction output is a 4-byte instruction starting at -2 or 0, or a 2-byte instruction starting at 0. The second instruction output is either a 2-byte instruction starting at 2, or nothing.
You can propagate the UNALIGNED signal as follows:
Code:UNALIGNED_OUT = (UNALIGNED | !(bit[0] & bit[1])) & bit[16] & bit[17]
In an FPGA that's all just a single LUT6 anyway.
You can put as many of these decoder blocks in parallel as you want, each looking at 4 bytes of code (and possibly the previous 2 bytes). The UNALIGNED signal has to ripple through them, but it's very fast.
In many designs you could afford to mux either bits -16..15 or bits 0..31 into a single decoder that can deal with either a 16 bit or 32 bit instruction. The second instruction can only ever be a compressed instruction from bits 16..31.
Or, for more speed at the cost of a little more hardware you could have a 32-bit (only) decoder looking at bits -16..15, a 16- or 32-bit decoder looking at bits 0..31, and a 16-bit (only) decoder looking at bits 16..31. You then use the UNALIGNED input to mux the outputs of the decoders. This gives much more time for the UNALIGNED signal to propagate.
Four of these decoder blocks can decode 16-18 bytes of program code into between 4 and 8 instructions (with average close to 6). So UNALIGNED only has to propagate into three decoder blocks during that clock cycle.
Six of these decoder blocks can decode 24-26 bytes of program code into between 6 and 12 instructions, with an average of near 9 instructions. That's almost certainly more than you want to do anyway, because you're getting very likely to have a taken branch somewhere in those instructions -- even if you can build a back end wide enough to process them.
- Likes 2
Comment
Comment