Announcement

Collapse
No announcement yet.

OpenBLAS 0.3.20 Adds Support For Russia's Elbrus E2000, Arm Neoverse N2/V1 CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by coder View Post
    It's a bit hard to explain with words, but try to think of a VLIW instruction stream in terms of a 2D grid.
    Thank you for remind. Another flaw in VLIW design is that you can't rearrange your CPU microarchitecture without recompiling everything. You can't just create more energy-efficient LITTLE core without one FP-ALU

    Comment


    • Originally posted by coder View Post
      Okay, in a Out-of-Order CPU, we call that General the Scheduler. He is continually monitoring the marching and when one column gets held up, he rearranges the soldiers (but still ordering by rank), to keep the parade moving at a good rate. He has a notebook, which he uses to track which soldiers go where, and that's called the Reorder Buffer.

      In a VLIW compiler, the General is called the compiler. He tells the soldiers where to go, in advance. He knows enough about each soldier to guess which ones might get held up by certain obstacles, and arranges them accordingly.

      Not a perfect analogy, but I tried.
      : )
      Just to recap.
      You Said
      Originally posted by coder View Post
      the reorder buffer (also called ROB). This is a structure used for instruction scheduling, specifically when you're executing them out of order.
      I said
      Originally posted by mSparks View Post
      Isn't that the whole point of VLIW?
      You said
      Originally posted by coder View Post
      No, a reorder buffer is for on-the-fly scheduling of instructions to execution units. That runs counter to VLIW. Classical VLIW is where all scheduling of instructions to execution units happens at compile time. And that means whenever there's a stall, it affects all execution pipelines, because they all run in lock-step.
      But aiui the "out of order" mechanism, it takes a list of sequential "in order" instructions, and bundles them together into instructions that can run in parallel - "out of order".

      And the "whole point" of VLIW is the instructions are sent in batches to the CPU already "out of order" (the left to right in your earlier diagram).

      using a simple example

      ADD a,b
      CMP b,c
      ADD d,e

      3 instructions in 3 clock cycles

      "out of order" in a processor with two adders would ideally turn that into

      [ADD a,b],[ADD d,e]
      then
      [CMP b,c]

      3 instructions in 2 clock cycles

      VLIW just sends the above to directly to the processor rather than relying on the hardware to identify that ADD a,b and CMP b,c must be done in order, and ADD a,b and ADD d,e can be done out of order.

      Comment


      • Originally posted by Khrundel View Post
        Yes, I confirm. I just wanted to mention that this will be time when quantity becomes quality. Take a look to perfect VLIW with memory without latency. To make it efficient you need N independed "subtasks" where N is VLIW width. With imperfect case, when you need to hide memory latencies, for example 10 cycles (typical L2), you need N*10 instructions which aren't depend on data from memory. In case of L3 there must be N*40.
        There are ways to mitigate against memory latencies. Already mentioned: prefetchers. Anther thing you can use SMT in a round-robin fashion, to half/quarter/etc. software-visible latencies.

        Originally posted by Khrundel View Post
        Not sure. When you need big caches you'll have to waste much more energy.
        Not true. With prefetchers or DMA engines, you only need caches or on-die memory that's maybe ~2x the size of your working set.

        Originally posted by Khrundel View Post
        SIMD will show same performance and some FPGA or ASIC units will crush VLIW. Or, if you like exotic solutions, Cell-like CPUs can beat VLIW.
        So, why are so many AI chips using VLIW cores? Why did Intel buy Nervana and Habana (both VLIW) when it already had Altera (FPGA), for many years?

        The fact you're trying to wish out of existence is that VLIW is alive and well, in the bowels of most smartphone SoC and AI accelerators. Here's a list of some modern micro-architectures that are VLIW-based:


        Originally posted by Khrundel View Post
        Yes, R600 was based on VLIW. With SMT you can hide latencies and GPU tasks tend to vector math so it is easy to fill wide word with independent instructions.
        This tells me you don't know GPUs as well as you like us to believe. If you think their SIMD implementation is just like SSE or AVX on steroids, you're wrong. They have virtually no support for horizontal operations. The way they use SIMD is like in the classic supercomputing context, with the fact that it's implemented using vectors being a mere implementation detail. That's why Nvidia adopted the terminology of talking about SIMD lanes as "threads" and calls the actual instruction stream a "Warp" (while AMD calls it a "Wavefront").

        Only Intel, traditionally the weakest player in the PC GPU race, has taken the approach of supporting general vector arithmetic in their GPU ISA.

        Originally posted by Khrundel View Post
        VLIW can't compete with OoO in general use single threaded performance,
        Nobody said otherwise.

        Originally posted by Khrundel View Post
        it can't compete with scalar SIMD with GPU-like tasks.
        That only helps if you need to do 64 simultaneous FFTs, or at least FFTs on 64 blocks, simultaneously. If you just need an efficient way to compute a FFT on a single block of data, because you're doing low-latency signal processing, then VLIW still wins.

        Again, if you just step back from your dogma and actually look around at current DSP ISAs, you'll see VLIW almost everywhere.

        Originally posted by Khrundel View Post
        I can trust Intel with this. They admitted EPIC have no value.
        I can't. EPIC is not classical VLIW - it still has runtime scheduling overhead, whether or not they choose to do OoO with it. EPIC failed for a lot of reasons. With x86, you had multiple vendors (3, at the time), whereas if you went IA64, then you were tied to Intel and whatever they felt like charging you. Most of the installed base of software was x86, which meant that as long as faster & 64-bit x86 processors were in the pipeline, there wasn't a big incentive to switch.

        Like it or not, there are plenty of non-technical explanations for the demise of IA64. Intel learned too well the lesson that x86 is king, and that's how we ended up with failed products like Xeon Phi and their x86 cell phone SoCs.

        Comment


        • Originally posted by Khrundel View Post
          Thank you for remind. Another flaw in VLIW design is that you can't rearrange your CPU microarchitecture without recompiling everything. You can't just create more energy-efficient LITTLE core without one FP-ALU
          That's why IA64 used EPIC, not VLIW.

          For embedded applications, the software is already nearly always coupled to the hardware. So, this is no liability for them (e.g. DSPs).

          Comment


          • Originally posted by mSparks View Post
            But aiui the "out of order" mechanism, it takes a list of sequential "in order" instructions, and bundles them together into instructions that can run in parallel - "out of order".
            I really think it's best to understand out-of-order CPUs first, with a single instruction dispatch per cycle. Then, it's easy to see how you can use those same techniques with multiple-dispatch.

            Originally posted by mSparks View Post
            And the "whole point" of VLIW is the instructions are sent in batches to the CPU already "out of order" (the left to right in your earlier diagram).
            Yes, the compiler reorders them, when the program code is transformed into an executable. No matter who is doing it or when, there is a weakly-defined, partial order, based on data dependencies. This is ultimately the order that cannot be violated without breaking program correctness.

            Originally posted by mSparks View Post
            VLIW just sends the above to directly to the processor rather than relying on the hardware to identify that ADD a,b and CMP b,c must be done in order, and ADD a,b and ADD d,e can be done out of order.
            Yes, but it's more than that. VLIW doesn't just say what the dependencies are, it actually tells the CPU to execute the two adds in the first clock cycle, and the comparison in the second. In a way, it's over-specified. The CPU doesn't know why they were scheduled that way, and it lacks the analytic capability to figure it out. That's why the entire core has to freeze, when you try to use some data that hasn't finished reading in from memory. The benefit is that VLIW cores don't need to waste die space on detecting data dependencies or scheduling. Virtually all of the silicon is used for actual computation and memories.

            EPIC is the half-way solution that would package the instructions into blocks, and then indicate the second block depends on the first. That gives the CPU flexibility to reorder things, without the burden of needing to generate those data dependencies each time it sees those instructions.

            Comment


            • Originally posted by coder View Post
              If the CPU were in-order, then whatever comes after that memory read instruction would have to wait until the read completes, even if the next instruction didn't use the result of the read.
              To nitpick, in-order processors do usually have some limited OoO ability, like starting executing the next instruction if it's not dependent on the result of an in-flight instruction. That's how you can have superscalar in-order processors, and why optimization advice for in-order processors tend to contain advice like moving loads as early as possible and doing other work while waiting for the load to complete (easier to do with a load-store architecture with plenty of registers than something like x86-32).

              Comment


              • Originally posted by coder View Post

                The benefit is that VLIW cores don't need to waste die space on detecting data dependencies or scheduling. Virtually all of the silicon is used for actual computation and memories.
                Which is pretty much what I meant by "isn't that the whole point of VLIW"

                the reorder buffer does in silicon what the VLIW compiler calculates before the instructions are even sent to the stack.

                Obviously no RoB silicon or VLIW compilers are created equal, but they are doing pretty much the same thing, removing the requirement that every instruction runs in order one after the other in sequence.

                Comment


                • Originally posted by coder View Post
                  Not a perfect analogy, but I tried.
                  : )
                  I'm silently admiring
                  Originally posted by Khrundel View Post
                  In same scenario OoO will stall each "subtask" individually and for minimum required time each.
                  Incorrect. You somehow grant OoO the inherently perfect scheduling in the first place -- did you wonder about "magic", what's yours here, decoder?

                  The hardware might be simply unable to find parallelism due to ROB size limitation and hard realtime involved. That's what OoO engineers seem to have their uphill battle against, in case someone missed it all up to M1.

                  Originally posted by Khrundel View Post
                  Yes, R600 was based on VLIW.
                  Speaking of compilers. I talked with Vadim who wrote radeon_si optimizations in Mesa. He told me that he was disappointed with AMD engineers being unable to do that decently in a year, so he did that in three months as a bet.

                  Originally posted by Khrundel View Post
                  There are no niche for them.
                  You're free to have your own opinion -- or, well, to share someone else's without reflecting upon it -- but please remember that categorical statements are absolutely wrong!

                  Originally posted by Khrundel View Post
                  Thank you for remind. Another flaw in VLIW design is that you can't rearrange your CPU microarchitecture without recompiling everything.
                  Yet another Maslov-style FUD.

                  I write this off an e2kv4 workstation that started running v3 ALT back when we only had that. 7za b performance difference between v3-on-v4 and v4-on-v4 was about 1% (small but consistent). My ALT Rescue image for v3 runs on anything from v3 to v6, at least in its [only] text mode.

                  So MCST folks actually can rearrange their CPU design to some extent, and sure they can extend it -- which is the typical case for OoO CPUs you didn't mention directly but rather implicated it seems.

                  IOW: can "you" rearrange -- as in mixing up, not just extending -- any ISA and not have to recompile everything? If "obviously not", then what was your point?
                  Originally posted by mSparks View Post
                  VLIW just sends the above to directly to the processor rather than relying on the hardware to identify that ADD a,b and CMP b,c must be done in order, and ADD a,b and ADD d,e can be done out of order.
                  Erm, Elbrus specifically has a bunch of single-bit predicate registers and supports controlled speculative execution with "flagged" branching.

                  As I've mentioned already, you can have some fun with ce.mentality.rip's Compiler Explorer instance.

                  Comment


                  • Originally posted by mshigorin View Post

                    Erm, Elbrus specifically has a bunch of single-bit predicate registers and supports controlled speculative execution with "flagged" branching.

                    As I've mentioned already, you can have some fun with ce.mentality.rip's Compiler Explorer instance.
                    I didn't actually get as far as processor specifics.

                    We had several pages of someone (not coder iirc) trying to convince me that VLIW can only run ops in order (specifically cannot do out of order instructions)

                    Comment


                    • Originally posted by coder View Post
                      There are ways to mitigate against memory latencies. Already mentioned: prefetchers.
                      As we already discussed, that's won't work.

                      Originally posted by coder View Post
                      Anther thing you can use SMT in a round-robin fashion, to half/quarter/etc. software-visible latencies.
                      That is 100% confession of VLIW's uselessnes. If you have so many threads so you can hide latencies, why bother with VLIW in first place? Just make many 2-way in order superscalar instead of 1 16-way VLIW. Maybe with some shared units or real SMT for cores.
                      Originally posted by coder View Post
                      Not true. With prefetchers or DMA engines, you only need caches or on-die memory that's maybe ~2x the size of your working set.
                      And what is the working set size? Less than half of 96MB of L3 of ryzen 5800x3d?
                      Originally posted by coder View Post
                      So, why are so many AI chips using VLIW cores?
                      Wait a minute. I wasn't aware we discussing something so distant from general purpose CPU. I mean big matrices multiplications is something even more VLIW-friendly than 4-comp vector math within GPU shaders.
                      I think as anywhere it is more like "we are too small to place all eggs to same basket, we starting with something more flexible and then replacing hotpoint with ASIC". Are nvidia's tensor cores based on VLIW?
                      Originally posted by coder View Post
                      Why did Intel buy Nervana and Habana (both VLIW) when it already had Altera (FPGA), for many years?
                      I don't know why Intel buys any company with VLIW cores. I suppose not because of VLIW technology, because Intel already have EPIC and wouldn't be a problem to develop another one for accelerator.
                      Originally posted by coder View Post
                      The fact you're trying to wish out of existence is that VLIW is alive and well, in the bowels of most smartphone SoC and AI accelerators. Here's a list of some modern micro-architectures that are VLIW-based:
                      You mean "contain some VLIW-based programmable unit with integrated memory"?

                      Originally posted by coder View Post
                      This tells me you don't know GPUs as well as you like us to believe. If you think their SIMD implementation is just like SSE or AVX on steroids,
                      No, it's not like SSE... I mean yes, they are like SSE, but because unlike usual CPUs, which is a hybrid between scalar and vector CPU, they are fully SIMD and this allows them to look from programmer's perspective as scalar CPU which executes several threads. But R600 was also VLIW. Well, I suppose their idea wasn't to create VLIW because this kind of architecture worth something, it was more evolutionary approach. Before R600 GPU had vector unit for simple vector ops (real vector, GPU usually works with 4-component 3d vector or 4-component color) + scalar unit. Next step of evolution was to make vector unit more flexible, allowing wider range operations than usual vector math. They've just added ability to control each component's ALU independently by part of VLIW instruction + one previous scalar and voila 5-way VLIW architecture. Nvidia, on other hand, just skipped that step to scalar cores. Scalar cores in case of million "threads" are more effective.

                      Comment

                      Working...
                      X