Announcement

Collapse
No announcement yet.

OpenBLAS 0.3.20 Adds Support For Russia's Elbrus E2000, Arm Neoverse N2/V1 CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • mSparks
    replied
    Originally posted by coder View Post
    There are lots of turing-complete systems which are impractical to use in a fully-general way. For instance, to do what you're saying would probably involve an impractical amount of data-movement and synchronization.
    There are good reasons the buzzword for all that is "Big Data"
    Even AAA games like MSFS are now (trying to) run against 2 petabytes of scenery data.

    Leave a comment:


  • coder
    replied
    Originally posted by mSparks View Post
    E.g. talking about "OoO" - with map reduce turning complete it is technically possible to treat OoO as a map reduce problem and take any pure "in order" set of instructions, send them out to an arbitary number of workers that can complete the entire program in parallel.
    There are lots of turing-complete systems which are impractical to use in a fully-general way. For instance, to do what you're saying would probably involve an impractical amount of data-movement and synchronization.

    Originally posted by mSparks View Post
    with cpu frequency improvments dead in the water, and manufacturing hitting the limits of physics, it has to be that kind of thinking that will yield any kind of perf improvments over what currently exists.
    We agree on that much: we're approaching a point where the paradigm of wide OoO micro-architectures is going to hit a wall and some re-thinking will be unavoidable.

    Leave a comment:


  • coder
    replied
    Originally posted by jabl View Post
    there's a wide variety of microarchitectural features you can employ before bringing in the heavyweight OoO machinery. Like branch prediction, caches, prefetching, superscalar.
    Branch prediction is mostly about speculative execution, which requires OoO. And nobody said we weren't using caches or hardware prefetchers. Superscalar in-order cores aren't really so different or better than VLIW, except they have more overhead to check which instructions can run in parallel.

    Leave a comment:


  • jabl
    replied
    Originally posted by coder View Post
    BTW, with all the recent security vulnerabilities linked to branch prediction and speculative execution, there has been some talk of running security-sensitive applications to in-order cores. And among in-order cores, VLIW is king. At least, if we're talking about general-purpose computation.
    Aren't all these vulnerabilities we've seen the past few years all been due to speculation, and not OoO vs. in-order? Of course OoO enables much more aggressive speculation, but shouldn't affect the basic issue.

    I think some VLIW type ISA's have things enabling software speculation with things like predicate bits etc. instead of hw branch prediction. But I'm slightly sceptical the code bloat inherent in these kinds of approaches is worth it compared to fixing microarchitectural side channel leaks. AFAIU many of the recent issues can be largely fixed in HW, it just takes a long time for such redesigns to percolate out to shipping products.

    Leave a comment:


  • mSparks
    replied
    Originally posted by coder View Post

    I don't have the experience or expertise to comment on it, but it does have its share of criticisms. I think the overhead is probably too high for it to make sense on any but a limited set of problems.
    Hardly anyone does, its worth reading that paper to get something of an understanding of the principles though.

    E.g. talking about "OoO" - with map reduce turning complete it is technically possible to treat OoO as a map reduce problem and take any pure "in order" set of instructions, send them out to an arbitary number of workers that can complete the entire program in parallel.
    Massive development overhead in doing something like that, but the perf gains to be had are astronomical.

    So I wouldnt say its peaked, taken all the low hanging fruit and the paradigm will evolve to be unrecognisable from the original paper, but with cpu frequency improvments dead in the water, and manufacturing hitting the limits of physics, it has to be that kind of thinking that will yield any kind of perf improvments over what currently exists.

    Leave a comment:


  • jabl
    replied
    Originally posted by coder View Post
    Yes, if that's all you ever did. However, if you have a VLIW core, then you can handle a mix of low-ILP code and high-ILP with predictable access patterns very well. I know the bulk of general computing workloads tend to fall somewhere in between, but I'm just pointing out that it's another niche where VLIW not only doesn't suffer, but might actually have a slight advantage.
    Now you're moving the goalposts. But anyway if you want to cater to occasional high ILP code, there's a wide variety of microarchitectural features you can employ before bringing in the heavyweight OoO machinery. Like branch prediction, caches, prefetching, superscalar. And it's not like this is some hypothetical case either, a 2-wide superscalar in-order core featuring branch prediction, caches etc is a fairly common low power core design.

    Leave a comment:


  • coder
    replied
    Originally posted by mSparks View Post
    It does, really well (and CPUs), its that paper that lead to the explosion in CUDA, openCL follow up
    Not so sure it's actually responsible for GPGPU. I followed GPU compute in the early days (particularly from 2002 - 2009), and I didn't see much mention of Map/Reduce.

    Originally posted by mSparks View Post
    MR is a completely different way of thinking about coding
    I missed that you were talking about Map/Reduce. For all that, it seems to have quickly peaked and then the world moved on to other things -- some derivatives, some not.

    I don't have the experience or expertise to comment on it, but it does have its share of criticisms. I think the overhead is probably too high for it to make sense on any but a limited set of problems.

    Leave a comment:


  • coder
    replied
    Originally posted by jabl View Post
    If you have negligible ILP then per definition most of the instruction slots in the VLIW instruction bundles will be NOP's, and you'd be better of with a simpler scalar in-order design.
    Yes, if that's all you ever did. However, if you have a VLIW core, then you can handle a mix of low-ILP code and high-ILP with predictable access patterns very well. I know the bulk of general computing workloads tend to fall somewhere in between, but I'm just pointing out that it's another niche where VLIW not only doesn't suffer, but might actually have a slight advantage.

    Leave a comment:


  • mSparks
    replied
    Originally posted by coder View Post
    This doesn't mean they can map to GPUs
    It does, really well (and CPUs), its that paper that lead to the explosion in CUDA, openCL follow up and devices like NVs new DGX100 (5 PFlops FP16....). MR is a completely different way of thinking about coding (there is virtually no order or branching at all, which for those of us which grew up on ordered instructions and branching makes your head hurt), but the throughput you get is simply breathtaking.
    Last edited by mSparks; 26 February 2022, 10:30 AM.

    Leave a comment:


  • jabl
    replied
    Originally posted by coder View Post
    True. Decoding and scheduling instructions increases latency. And for sequential code, this can limit performance, which is an interesting case that Khrundel doesn't address, at all. If you have code with negligible concurrency in the instruction stream (typically called ILP), the VLIW CPU will likely deliver higher performance per clock cycle.
    If you have negligible ILP then per definition most of the instruction slots in the VLIW instruction bundles will be NOP's, and you'd be better of with a simpler scalar in-order design.

    Leave a comment:

Working...
X