Announcement

Collapse
No announcement yet.

Intel Begins Working On "Knights Mill" Support For LLVM/Clang

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by jabl View Post
    There are people who know a lot about microarchitectures that are of the opinion that the correct approach for data-parallel problems is basically the "traditional vector" ISA, as invented by Seymour Cray back in the 1970'ies. Unfortunately the microprocessor industry largely ignored this and thus in mainstream processors we have suffered the scourge of packed-SIMD (itself a 1950'ies invention) extensions for the past several decades. For modern examples of "proper" vector ISA's, see ARM SVE or the RISC-V vector extension. Though AVX-512 is to some extent getting there too.

    GPU's are good for (some) data-parallel problems not because SIMT is the ultimate programming model but because they architecturally get a lot of other things right (lots of cores + massive memory BW).

    See https://riscv.org/wp-content/uploads...p-june2015.pdf for some arguments why a real vector ISA is better than either packed-SIMD or GPU-style SIMT programming models.
    "a real vector ISA is better" for SOME tasks, not all. IF your task is explicitly structured to the strict layout of a vector ISA, great. But more and more tasks are not like that. If you;re doing, for example, work that involves computation on huge graphs (so lots of unstructured pointer chasing) you have a real problem...
    The sort of laned machine I described seems more or less optimal for throughput computing. It does good enough on structured problems (like dense linear algebra) although not quite as good as a vector machine. But it is VASTLY more flexible as soon as you leave the world of structured data and algorithms.

    Comment


    • #12
      Originally posted by name99 View Post

      Unlikely. Not even Intel thinks that these days.
      Next big machine is Sierra. POWER9+nV, ~.1 ExaFlop, maybe mid-2018
      Last I read Sierra is expected to clock in at about 125 PFlop/s. https://www.nextplatform.com/2017/10...supercomputer/

      Summit is supposedly quite similar to Sierra but somewhat bigger, around 200 PFlop/s, still far below an exaflop/s.

      After that we get Aurora which is Intel, supposed to hit Exa, but which is "Future Intel Part, NOT Knights Hill" and which appears to be unlike the existing Phi family.

      https://www.nextplatform.com/2017/09...-u-s-released/
      This is ... interesting, but awfully speculative. I'd love to know more but I guess we'll have to wait until more facts emerge on this..

      Comment


      • #13
        Originally posted by name99 View Post

        "a real vector ISA is better" for SOME tasks, not all. IF your task is explicitly structured to the strict layout of a vector ISA, great. But more and more tasks are not like that. If you;re doing, for example, work that involves computation on huge graphs (so lots of unstructured pointer chasing) you have a real problem...
        The sort of laned machine I described seems more or less optimal for throughput computing. It does good enough on structured problems (like dense linear algebra) although not quite as good as a vector machine. But it is VASTLY more flexible as soon as you leave the world of structured data and algorithms.
        Sure, if your software is arranged as a bunch of nodes connected via pointers, that beefy vector unit is going to do diddly squat.

        I'm going to sound like a Cray fanboy for bringing them up again, but they did a family of machines a bit like what you suggest, specifically for graph search type problems (reportedly development of that was financed by the NSA). Google for "Threadstorm", or "Cray XMT" for more info. Basically it was a bunch of simple in-order scalar cores with very little cache, and each core supported a huge number of HW threads (IIRC 128). These threads were (again, IIRC) not exposed as OS threads, but they had a C extension that allowed the programmer to fork off these lightweight HW threads (a bit like Cilk, IIRC). Further, in order to avoid hotspots in memory, memory addresses were hashed across the entire machine (at the page level, presumably). Turned out it was awesome for these graph search problems and little else, and IIRC that product line has come to an end.

        But more generally, it's all about tradeoffs. Sure, a MIMD architecture like you propose allows the maximum flexibility, but you have to pay the cost of instruction fetch, decode, execution etc. for every data element you work on. And it's precisely amortizing that overhead that makes vector and SIMT architectures so efficient, for the subset of all problems they are appropriate for of course.

        Comment

        Working...
        X