Announcement

Collapse
No announcement yet.

OpenBLAS 0.3.20 Adds Support For Russia's Elbrus E2000, Arm Neoverse N2/V1 CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by Khrundel View Post
    Compiler generates code for several independed "threads", like it does for VLIW, except it has more freedom. I think I must clarify myself: "thread" within quotes because it isn't real thread, it is a sequence of instructions, depending one on another but independent of other sequences.
    I got your meaning the first time, but I think it'd be much clearer if you'd refer to them as dependent-instruction chains. The term "thread" is already overloaded enough.

    Originally posted by Khrundel View Post
    When, in late 90th, nobody knew how to spend transistor budget, VLIW looked like something sane.
    No, they just looked at how much overhead OoO was adding, and observed its poor scaling properties. It's the same reason GPUs offer so much more perf/W and perf/area -- because they are just a sea of in-order cores, with no OoO overhead.

    The only problem with GPUs is that they require more coarse-grained concurrency, within the code. When we look at big GPUs, we're talking about literally tens of thousands of "threads" (using Nvidia's definition) or Work Items simultaneously in flight. That's fine for graphics, but not everything maps to that processing model.

    Originally posted by Khrundel View Post
    I think I've already heard this somewhere.... Ah, of course, that was RISC mantra. And now they have decoders too.
    But with much less complexity and fewer constraints. mshigorin has a point that compilers should ideally cater to the decoder of the CPU they're targeting, because x86 decoders can typically only decode one "complex" instruction per cycle, and the definition of a "complex" probably changes over the multiple CPU generations. That adds a scheduling constraint on the x86 compiler (if it's to keep the decoder running at max throughput).

    Comment


    • Originally posted by mSparks View Post
      OTOH, doing it in silicon must add a comparatively large amount of latency between starting to send instructions to the chip, and getting results out of the chip,
      True. Decoding and scheduling instructions increases latency. And for sequential code, this can limit performance, which is an interesting case that Khrundel doesn't address, at all. If you have code with negligible concurrency in the instruction stream (typically called ILP), the VLIW CPU will likely deliver higher performance per clock cycle.

      Comment


      • Originally posted by coder View Post
        That's fine for graphics, but not everything maps to that processing model.
        Slight sidetrack, Jeffrey Dean and Sanjay Ghemawat's 2004 MR paper "blew the bloody doors off" that theory, showing that pretty much everything people thought couldn't be mapped to "tens of thousands of "threads" (using Nvidia's definition) or Work Items simultaneously in flight" actually can be. I think its even been proved now it's turing complete.
        Last edited by mSparks; 26 February 2022, 09:26 AM.

        Comment


        • Originally posted by mSparks View Post
          Slight sidetrack, Jeffrey Dean and Sanjay Ghemawat's 2004 MR paper "blew the bloody doors off" that theory, showing that pretty much everything people thought couldn't be mapped to "tens of thousands of "threads" (using Nvidia's definition) or Work Items simultaneously in flight" actually can be. I think its even been proved now it's turing complete.
          This doesn't mean they can map to GPUs, however. Because those "threads" are batched into groups of 32 or 64 that have to run almost in lock-step (i.e. taking the same branches, calling the same functions, etc.). You can use predication to give them different effective branching behavior, but if any one of them takes a branch, they all suffer the performance hit of going down that path.

          As for the broader point, I'd just mention that Intel bought an Indian company called Soft Machines who pioneered what they call the VISC architecture of extracting classical CPU threads from instruction chains like those Khrundel mentioned.

          Last edited by coder; 26 February 2022, 10:03 AM.

          Comment


          • BTW, with all the recent security vulnerabilities linked to branch prediction and speculative execution, there has been some talk of running security-sensitive applications to in-order cores. And among in-order cores, VLIW is king. At least, if we're talking about general-purpose computation.

            Comment


            • Originally posted by coder View Post
              True. Decoding and scheduling instructions increases latency. And for sequential code, this can limit performance, which is an interesting case that Khrundel doesn't address, at all. If you have code with negligible concurrency in the instruction stream (typically called ILP), the VLIW CPU will likely deliver higher performance per clock cycle.
              If you have negligible ILP then per definition most of the instruction slots in the VLIW instruction bundles will be NOP's, and you'd be better of with a simpler scalar in-order design.

              Comment


              • Originally posted by coder View Post
                This doesn't mean they can map to GPUs
                It does, really well (and CPUs), its that paper that lead to the explosion in CUDA, openCL follow up and devices like NVs new DGX100 (5 PFlops FP16....). MR is a completely different way of thinking about coding (there is virtually no order or branching at all, which for those of us which grew up on ordered instructions and branching makes your head hurt), but the throughput you get is simply breathtaking.
                Last edited by mSparks; 26 February 2022, 10:30 AM.

                Comment


                • Originally posted by jabl View Post
                  If you have negligible ILP then per definition most of the instruction slots in the VLIW instruction bundles will be NOP's, and you'd be better of with a simpler scalar in-order design.
                  Yes, if that's all you ever did. However, if you have a VLIW core, then you can handle a mix of low-ILP code and high-ILP with predictable access patterns very well. I know the bulk of general computing workloads tend to fall somewhere in between, but I'm just pointing out that it's another niche where VLIW not only doesn't suffer, but might actually have a slight advantage.

                  Comment


                  • Originally posted by mSparks View Post
                    It does, really well (and CPUs), its that paper that lead to the explosion in CUDA, openCL follow up
                    Not so sure it's actually responsible for GPGPU. I followed GPU compute in the early days (particularly from 2002 - 2009), and I didn't see much mention of Map/Reduce.

                    Originally posted by mSparks View Post
                    MR is a completely different way of thinking about coding
                    I missed that you were talking about Map/Reduce. For all that, it seems to have quickly peaked and then the world moved on to other things -- some derivatives, some not.

                    I don't have the experience or expertise to comment on it, but it does have its share of criticisms. I think the overhead is probably too high for it to make sense on any but a limited set of problems.

                    Comment


                    • Originally posted by coder View Post
                      Yes, if that's all you ever did. However, if you have a VLIW core, then you can handle a mix of low-ILP code and high-ILP with predictable access patterns very well. I know the bulk of general computing workloads tend to fall somewhere in between, but I'm just pointing out that it's another niche where VLIW not only doesn't suffer, but might actually have a slight advantage.
                      Now you're moving the goalposts. But anyway if you want to cater to occasional high ILP code, there's a wide variety of microarchitectural features you can employ before bringing in the heavyweight OoO machinery. Like branch prediction, caches, prefetching, superscalar. And it's not like this is some hypothetical case either, a 2-wide superscalar in-order core featuring branch prediction, caches etc is a fairly common low power core design.

                      Comment

                      Working...
                      X