Originally posted by In_Mint_Condition
View Post
OpenBLAS 0.3.20 Adds Support For Russia's Elbrus E2000, Arm Neoverse N2/V1 CPUs
Collapse
X
-
Originally posted by coder View PostNo, I'm pretty sure the last several iterations of IA64 CPUs were only by contractual obligation. I think they gave up on IA64 after about the second generation CPU.
Comment
-
-
Originally posted by coder View PostWhat I really want is for Michael to keep covering all tech, whether it's from China, Russia, or anywhere else. I don't think he's easily dissuaded by arguments in the forums, but it would be nice if we can also have intelligent discussions about this tech, and not get bogged down in politics.
Thanks to you and other Russians for being here and helping us understand your cool CPUs.
Comment
-
-
Originally posted by mSparks View Post
Yes, I got that, I just didn't get what you think is the difference between M1 long instruction words that contain multiple RISC instructions that run in parallel and Elbrus VLIW that contain multiple RISC instructions that run in parallel.
You keep mentioning out of order, but these are all parallel instructions, the whole point is there is no order.
Now, there have been CPU's that use a VLIW-like backend, with a frontend converting the instructions of a "traditional" ISA to VLIW-like internal instructions. Like Transmeta and NVIDIA Denver, IIRC. But I have seen no indications anywhere that Apple M1 would be anything like that. Everything I've seen suggests the M1 microarchitecture is a "normal" OoO core design.
Comment
-
-
Originally posted by coder View Post"APB (automatic prefetch buffer) programmable to deliver RAM contents at given patterns into L2 cache predictably"
Prefetching is essential, even for modern, out-of-order cores. Because even they don't have big enough reorder buffers to hide the latency of a read that has to go all the way out to DRAM. And deep reorder buffers presume you can even find enough work to do that doesn't depend on the missing data.. I mean, would prefetching be possible all time, we wouldn't need any caches, just load register 1000 cycles before using it.
The tragedy of VLIW is that any single unexpected cache miss stops thread from execution for unknown amount of cycles. It has all cons of OoO like need to have some work to do while waiting for data, but all these problems amplified by VLIW. Number of independent instructions are limited, OoO can use them when it needed, in case of L1 cache miss it will need to fill L2 delay only. With VLIW you have to predict data availability so you have to fill up to RAM access delay. And what if there are two dependent memory accesses? Like ptr1->parent.data? Will compiler have independent instructions to hide 2 RAM accesses?
VLIW proponent usually talk about magic compiler which will somehow solve all problems. But that is nonsense. Any compiler can guess code behavior with some amount of certainty. You talking about PGO, so you mean it, PGO just allows compiler to evaluate probability more correctly. But unforgiving nature of VLIW makes compiler to be more pessimistic. That mean if someone able to create some state of art compiler, able to produce a near-perfect code for 16-way VLIW, then this compiler can be easily adapted to be more optimistic and to rearrange scalar code in such way, that some 24-way OoO superscalar CPU will be able to fill all its pipelines.
OoO is always better than VLIW. And, second tragedy of VLIW: it has no real advantages. The idea behind it was to pack more computation power into CPU. Classic ALU + Control looked ineffective, lets add more ALUs. But this problem already have been solved, partly by superscalar, partly by SIMD. VLIW have nothing to offer to justify problems it brings.
Comment
-
-
Originally posted by coder View PostLet's say you're right. If the point of your posts is primarily to feed their algorithms, it might be self-defeating. You could just end up informing them of which facts to refute.
Presenting some facts would be a nice move in the right direction.
Originally posted by jabl View Post
No, there are no "M1 long instruction words that contain multiple RISC instructions that run in parallel". Just look at any aarch64 manual.
Perhaps you should look at "why the M1 is so fast".
e.g.
https://www.linkedin.com/pulse/refle...ric-kolotyluk/
I have since learned that Apple's M1 incorporates some of the innovations from this product, in particular, the Very Large Instruction Word Very Long Instruction Word architectureLast edited by mSparks; 24 February 2022, 08:01 AM.
Comment
-
-
Originally posted by caligula View PostWho's the idiot now? He started the genocide few minutes ago.
My time is too precious to talk to such ignorant idiots!
Comment
-
-
Originally posted by In_Mint_Condition View PostI would ask if you even knew the Donbass to be independent and not part of Ukraine since 2015...
But surely you knew that?
Comment
-
Comment