Originally posted by coder
View Post
OpenBLAS 0.3.20 Adds Support For Russia's Elbrus E2000, Arm Neoverse N2/V1 CPUs
Collapse
X
-
Originally posted by coder View PostOkay, in a Out-of-Order CPU, we call that General the Scheduler. He is continually monitoring the marching and when one column gets held up, he rearranges the soldiers (but still ordering by rank), to keep the parade moving at a good rate. He has a notebook, which he uses to track which soldiers go where, and that's called the Reorder Buffer.
In a VLIW compiler, the General is called the compiler. He tells the soldiers where to go, in advance. He knows enough about each soldier to guess which ones might get held up by certain obstacles, and arranges them accordingly.
Not a perfect analogy, but I tried.
: )
You Said
Originally posted by coder View Postthe reorder buffer (also called ROB). This is a structure used for instruction scheduling, specifically when you're executing them out of order.
Originally posted by mSparks View PostIsn't that the whole point of VLIW?
Originally posted by coder View PostNo, a reorder buffer is for on-the-fly scheduling of instructions to execution units. That runs counter to VLIW. Classical VLIW is where all scheduling of instructions to execution units happens at compile time. And that means whenever there's a stall, it affects all execution pipelines, because they all run in lock-step.
And the "whole point" of VLIW is the instructions are sent in batches to the CPU already "out of order" (the left to right in your earlier diagram).
using a simple example
ADD a,b
CMP b,c
ADD d,e
3 instructions in 3 clock cycles
"out of order" in a processor with two adders would ideally turn that into
[ADD a,b],[ADD d,e]
then
[CMP b,c]
3 instructions in 2 clock cycles
VLIW just sends the above to directly to the processor rather than relying on the hardware to identify that ADD a,b and CMP b,c must be done in order, and ADD a,b and ADD d,e can be done out of order.
Comment
-
-
Originally posted by Khrundel View PostYes, I confirm. I just wanted to mention that this will be time when quantity becomes quality. Take a look to perfect VLIW with memory without latency. To make it efficient you need N independed "subtasks" where N is VLIW width. With imperfect case, when you need to hide memory latencies, for example 10 cycles (typical L2), you need N*10 instructions which aren't depend on data from memory. In case of L3 there must be N*40.
Originally posted by Khrundel View PostNot sure. When you need big caches you'll have to waste much more energy.
Originally posted by Khrundel View PostSIMD will show same performance and some FPGA or ASIC units will crush VLIW. Or, if you like exotic solutions, Cell-like CPUs can beat VLIW.
The fact you're trying to wish out of existence is that VLIW is alive and well, in the bowels of most smartphone SoC and AI accelerators. Here's a list of some modern micro-architectures that are VLIW-based:
Originally posted by Khrundel View PostYes, R600 was based on VLIW. With SMT you can hide latencies and GPU tasks tend to vector math so it is easy to fill wide word with independent instructions.
Only Intel, traditionally the weakest player in the PC GPU race, has taken the approach of supporting general vector arithmetic in their GPU ISA.
Originally posted by Khrundel View PostVLIW can't compete with OoO in general use single threaded performance,
Originally posted by Khrundel View Postit can't compete with scalar SIMD with GPU-like tasks.
Again, if you just step back from your dogma and actually look around at current DSP ISAs, you'll see VLIW almost everywhere.
Originally posted by Khrundel View PostI can trust Intel with this. They admitted EPIC have no value.
Like it or not, there are plenty of non-technical explanations for the demise of IA64. Intel learned too well the lesson that x86 is king, and that's how we ended up with failed products like Xeon Phi and their x86 cell phone SoCs.
Comment
-
-
Originally posted by Khrundel View PostThank you for remind. Another flaw in VLIW design is that you can't rearrange your CPU microarchitecture without recompiling everything. You can't just create more energy-efficient LITTLE core without one FP-ALU
For embedded applications, the software is already nearly always coupled to the hardware. So, this is no liability for them (e.g. DSPs).
Comment
-
-
Originally posted by mSparks View PostBut aiui the "out of order" mechanism, it takes a list of sequential "in order" instructions, and bundles them together into instructions that can run in parallel - "out of order".
Originally posted by mSparks View PostAnd the "whole point" of VLIW is the instructions are sent in batches to the CPU already "out of order" (the left to right in your earlier diagram).
Originally posted by mSparks View PostVLIW just sends the above to directly to the processor rather than relying on the hardware to identify that ADD a,b and CMP b,c must be done in order, and ADD a,b and ADD d,e can be done out of order.
EPIC is the half-way solution that would package the instructions into blocks, and then indicate the second block depends on the first. That gives the CPU flexibility to reorder things, without the burden of needing to generate those data dependencies each time it sees those instructions.
Comment
-
-
Originally posted by coder View PostIf the CPU were in-order, then whatever comes after that memory read instruction would have to wait until the read completes, even if the next instruction didn't use the result of the read.
Comment
-
-
Originally posted by coder View Post
The benefit is that VLIW cores don't need to waste die space on detecting data dependencies or scheduling. Virtually all of the silicon is used for actual computation and memories.
the reorder buffer does in silicon what the VLIW compiler calculates before the instructions are even sent to the stack.
Obviously no RoB silicon or VLIW compilers are created equal, but they are doing pretty much the same thing, removing the requirement that every instruction runs in order one after the other in sequence.
Comment
-
-
Originally posted by coder View PostNot a perfect analogy, but I tried.
: )Originally posted by Khrundel View PostIn same scenario OoO will stall each "subtask" individually and for minimum required time each.
The hardware might be simply unable to find parallelism due to ROB size limitation and hard realtime involved. That's what OoO engineers seem to have their uphill battle against, in case someone missed it all up to M1.
Originally posted by Khrundel View PostYes, R600 was based on VLIW.
Originally posted by Khrundel View PostThere are no niche for them.
Originally posted by Khrundel View PostThank you for remind. Another flaw in VLIW design is that you can't rearrange your CPU microarchitecture without recompiling everything.
I write this off an e2kv4 workstation that started running v3 ALT back when we only had that. 7za b performance difference between v3-on-v4 and v4-on-v4 was about 1% (small but consistent). My ALT Rescue image for v3 runs on anything from v3 to v6, at least in its [only] text mode.
So MCST folks actually can rearrange their CPU design to some extent, and sure they can extend it -- which is the typical case for OoO CPUs you didn't mention directly but rather implicated it seems.
IOW: can "you" rearrange -- as in mixing up, not just extending -- any ISA and not have to recompile everything? If "obviously not", then what was your point?Originally posted by mSparks View PostVLIW just sends the above to directly to the processor rather than relying on the hardware to identify that ADD a,b and CMP b,c must be done in order, and ADD a,b and ADD d,e can be done out of order.
As I've mentioned already, you can have some fun with ce.mentality.rip's Compiler Explorer instance.
Comment
-
-
Originally posted by mshigorin View Post
Erm, Elbrus specifically has a bunch of single-bit predicate registers and supports controlled speculative execution with "flagged" branching.
As I've mentioned already, you can have some fun with ce.mentality.rip's Compiler Explorer instance.
We had several pages of someone (not coder iirc) trying to convince me that VLIW can only run ops in order (specifically cannot do out of order instructions)
Comment
-
-
Originally posted by coder View PostThere are ways to mitigate against memory latencies. Already mentioned: prefetchers.
Originally posted by coder View PostAnther thing you can use SMT in a round-robin fashion, to half/quarter/etc. software-visible latencies.
Originally posted by coder View PostNot true. With prefetchers or DMA engines, you only need caches or on-die memory that's maybe ~2x the size of your working set.
Originally posted by coder View PostSo, why are so many AI chips using VLIW cores?
I think as anywhere it is more like "we are too small to place all eggs to same basket, we starting with something more flexible and then replacing hotpoint with ASIC". Are nvidia's tensor cores based on VLIW?
Originally posted by coder View PostWhy did Intel buy Nervana and Habana (both VLIW) when it already had Altera (FPGA), for many years?
Originally posted by coder View PostThe fact you're trying to wish out of existence is that VLIW is alive and well, in the bowels of most smartphone SoC and AI accelerators. Here's a list of some modern micro-architectures that are VLIW-based:
Originally posted by coder View PostThis tells me you don't know GPUs as well as you like us to believe. If you think their SIMD implementation is just like SSE or AVX on steroids,
Comment
-
Comment