Announcement

Collapse
No announcement yet.

OpenBLAS 0.3.20 Adds Support For Russia's Elbrus E2000, Arm Neoverse N2/V1 CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by coder View Post

    I don't have the experience or expertise to comment on it, but it does have its share of criticisms. I think the overhead is probably too high for it to make sense on any but a limited set of problems.
    Hardly anyone does, its worth reading that paper to get something of an understanding of the principles though.

    E.g. talking about "OoO" - with map reduce turning complete it is technically possible to treat OoO as a map reduce problem and take any pure "in order" set of instructions, send them out to an arbitary number of workers that can complete the entire program in parallel.
    Massive development overhead in doing something like that, but the perf gains to be had are astronomical.

    So I wouldnt say its peaked, taken all the low hanging fruit and the paradigm will evolve to be unrecognisable from the original paper, but with cpu frequency improvments dead in the water, and manufacturing hitting the limits of physics, it has to be that kind of thinking that will yield any kind of perf improvments over what currently exists.

    Comment


    • Originally posted by coder View Post
      BTW, with all the recent security vulnerabilities linked to branch prediction and speculative execution, there has been some talk of running security-sensitive applications to in-order cores. And among in-order cores, VLIW is king. At least, if we're talking about general-purpose computation.
      Aren't all these vulnerabilities we've seen the past few years all been due to speculation, and not OoO vs. in-order? Of course OoO enables much more aggressive speculation, but shouldn't affect the basic issue.

      I think some VLIW type ISA's have things enabling software speculation with things like predicate bits etc. instead of hw branch prediction. But I'm slightly sceptical the code bloat inherent in these kinds of approaches is worth it compared to fixing microarchitectural side channel leaks. AFAIU many of the recent issues can be largely fixed in HW, it just takes a long time for such redesigns to percolate out to shipping products.

      Comment


      • Originally posted by jabl View Post
        there's a wide variety of microarchitectural features you can employ before bringing in the heavyweight OoO machinery. Like branch prediction, caches, prefetching, superscalar.
        Branch prediction is mostly about speculative execution, which requires OoO. And nobody said we weren't using caches or hardware prefetchers. Superscalar in-order cores aren't really so different or better than VLIW, except they have more overhead to check which instructions can run in parallel.

        Comment


        • Originally posted by mSparks View Post
          E.g. talking about "OoO" - with map reduce turning complete it is technically possible to treat OoO as a map reduce problem and take any pure "in order" set of instructions, send them out to an arbitary number of workers that can complete the entire program in parallel.
          There are lots of turing-complete systems which are impractical to use in a fully-general way. For instance, to do what you're saying would probably involve an impractical amount of data-movement and synchronization.

          Originally posted by mSparks View Post
          with cpu frequency improvments dead in the water, and manufacturing hitting the limits of physics, it has to be that kind of thinking that will yield any kind of perf improvments over what currently exists.
          We agree on that much: we're approaching a point where the paradigm of wide OoO micro-architectures is going to hit a wall and some re-thinking will be unavoidable.

          Comment


          • Originally posted by coder View Post
            There are lots of turing-complete systems which are impractical to use in a fully-general way. For instance, to do what you're saying would probably involve an impractical amount of data-movement and synchronization.
            There are good reasons the buzzword for all that is "Big Data"
            Even AAA games like MSFS are now (trying to) run against 2 petabytes of scenery data.

            Comment


            • Originally posted by coder View Post
              Branch prediction is mostly about speculative execution, which requires OoO.
              Er, no. Branch prediction doesn't require OoO. Even tiny microcontroller cores like Cortex M0 or Sifive FE310 feature branch predictors.

              And nobody said we weren't using caches or hardware prefetchers. Superscalar in-order cores aren't really so different or better than VLIW, except they have more overhead to check which instructions can run in parallel.
              I think you need that kind of logic anyway, since you need to be able to block execution of instructions until their inputs are ready, which can take a variable number of cycles due to loads from cache vs memory etc.

              I think the main thing VLIW gives you is that the individual slots in the instruction word are specific to a specific pipeline in the processor, so you don't need the logic to route an instruction to the correct pipeline after decoding. But I don't think that is a big deal in the grand scheme of things.

              Comment


              • Originally posted by coder View Post
                We agree on that much: we're approaching a point where the paradigm of wide OoO micro-architectures is going to hit a wall and some re-thinking will be unavoidable.
                I agree with that, I just don't think VLIW is the solution, except maybe in a few very specific DSP like workloads.

                In a way, I'm actually kind of pessimistic wrt some magical ISA and/or microarchitecture being able to extract more ILP. The way we currently write software in mainstream programming languages exposes only so much ILP, and a dramatic improvement here will, I think, require a dramatic change in how we approach programming.

                If you look at some of the motivations behind RISC-V, they have sort of come to the same conclusion. That we've reached the end of the road in ISA design for general purpose code, and that further improvements will come from application specific acceleration. So they created RISC-V as a basic shared platform on top of which we can innovate various application specific acceleration. Like, say, crypto, or vectors, or whatnot.

                And in a way, you can see all this in the various AI HW startups that have popped up like mushrooms. They are not accelerating normal C code, but neural network graphs that expose massive parallelism. Some of these startup have even based their processors on RISC-V, just like the RISC-V founders envisioned.
                Last edited by jabl; 26 February 2022, 02:06 PM.

                Comment


                • Originally posted by mSparks View Post
                  There are good reasons the buzzword for all that is "Big Data"
                  Even AAA games like MSFS are now (trying to) run against 2 petabytes of scenery data.
                  I guess you're talking about Microsoft Flight Sim using real GIS terrain data? What does that have to do with Map/Reduce? If you have a link to some paper or such that explains it, please post it.

                  Comment


                  • Originally posted by jabl View Post
                    Er, no. Branch prediction doesn't require OoO. Even tiny microcontroller cores like Cortex M0 or Sifive FE310 feature branch predictors.
                    I said "Branch prediction is mostly about speculative execution, which requires OoO." The only other thing you can do with it is prefetching the branch target, but you could as easily have a dumb branch target prefetcher which does that.

                    Originally posted by jabl View Post
                    I think you need that kind of logic anyway, since you need to be able to block execution of instructions until their inputs are ready,
                    You just need register scoreboarding to stall on incomplete reads.

                    Originally posted by jabl View Post
                    I think the main thing VLIW gives you is that the individual slots in the instruction word are specific to a specific pipeline in the processor, so you don't need the logic to route an instruction to the correct pipeline after decoding. But I don't think that is a big deal in the grand scheme of things.
                    There are more hazards that can occur which VLIW handles at compile time, such as register bank conflicts when retiring instructions.

                    I think some "little" problems can become bigger, when you scale them up to many execution units. CPU designers want to keep critical path lengths short, which means you want to minimize feedback paths.

                    I've never designed a CPU, though I once wrote firmware for a CPU being designed by people sitting down the hall from me.

                    Comment


                    • Originally posted by coder View Post
                      I said "Branch prediction is mostly about speculative execution, which requires OoO." The only other thing you can do with it is prefetching the branch target, but you could as easily have a dumb branch target prefetcher which does that.
                      Yes I lumped them together, but yes, in order cores can be capable of speculative execution based on branch prediction. However as they don't have the register renaming machinery that OoO cores have they can't commit speculative work before the branch is resolved. But in a core with a multi-stage pipeline they can still do all the other work except the final retire stage.

                      Comment

                      Working...
                      X