Announcement

Collapse
No announcement yet.

Western Digital To Begin Shipping Devices Using RISC-V

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by oiaohm View Post
    snip..
    I still think that having the vector register state shared between multiple threads would just be too difficult and confusing to handle in a sane way.

    Another explanation could be that the minion cores don't expose the full vector ISA, but rather that they sort-of form the vector lanes of the fat cores? So with 4096 minions and 16 fat cores, that would give something like a fat core with a vector unit having a vector length of 4096/16 = 256, with 64-bit DP 256*64 = 16384 bits. And then depending on the workload, you can run the chip as 16 cores each with a 16384 bit wide vector unit, or then for non-vectorizable workloads allow the minions to "run free" as scalar processors? That would at least drastically reduce the total size of the vector register file (16*32*16384/8 = 1 MB vs. 16 MB).

    What speaks against this explanation is that the explicitly say (https://www.esperanto.ai/single-post...ISC-V-standard) '4096 “ET-Minion” energy-efficient RISC-V cores each with vector floating point unit'.

    Comment


    • #32
      Originally posted by jabl View Post
      If my previous speculation is correct, and it has something like 16 MB register state, I guess that leaves caches out of the question as I guess there's not much point in having caches unless they are a lot larger than your register file, and there clearly isn't space on the die for hundreds of MB's of cache. I suppose it would make sense to have caches for the scalar stuff (e.g. loop indices, addresses etc.) whereas the vector data would bypass the caches and go straight to memory.
      Compilers still would need some stack area, even if you handcraft the inner loops, getting data in and out will need some framework around that code.
      Also if there ain't a cache for code then you will very badly stress the I/O subsystem, the more participants the more complicated the network between the processors will be.
      Without local storage you won't be able to deal with this amount of CPUs.

      Adding some KB (8-16) of static RAM per Core would likely be easier than broadening all branches of the IO Network. We are definitely looking at a big chip here.

      Comment


      • #33
        Originally posted by oiaohm View Post
        Western Digital and others are looking at RISC-V because arm cores are not a done deal. Arm instruction sets will only allow you to go so far.
        No. They're not close to exceeding the technical limits of ARM's ISA. This is all about royalties, plain and simple. When you're shipping 1-2 billion of something per year, those royalty costs really add up.

        Comment


        • #34
          Originally posted by oiaohm View Post
          The other interesting point about vector processing distribution is that it can happen transparent to the application code done by the CPU complex itself.
          You mean like GPUs?

          Originally posted by oiaohm View Post
          GPU started using wide SIMD not because it was good but because vector was patented when they started.
          Source?

          And GPUs aren't tied to anything, so the conditions driving earlier architectural decisions don't matter. That's one advantage they have - because their ISA and other architectural details are hidden, they can & do change from one generation to the next. AMD, Nvidia, and Intel have all converged on a combination of SIMD + SIMT. I'm sure it's no accident.

          Comment


          • #35
            Originally posted by jabl View Post
            So Nvidia Volta is supposed to provide about 7 DP Tflop/s. ...
            Nvidia has nothing to fear from these clowns. What Nvidia knows, AMD just learned, and these guys have yet to figure out this that it's all about power-efficiency. And having 4096 independent cores can't touch the power efficiency of Nvidia's SMs.

            And that's before you even talk about how to schedule work on them and transfer data, which independent cores must do using software and inefficient cache hierarchies (respectively).

            Comment


            • #36
              Originally posted by discordian View Post
              Compilers still would need some stack area, even if you handcraft the inner loops, getting data in and out will need some framework around that code.
              Also if there ain't a cache for code then you will very badly stress the I/O subsystem, the more participants the more complicated the network between the processors will be.
              Without local storage you won't be able to deal with this amount of CPUs.

              Adding some KB (8-16) of static RAM per Core would likely be easier than broadening all branches of the IO Network. We are definitely looking at a big chip here.
              To clarify, what I meant that it certainly makes sense to have smallish caches for scalar stuff, just that vector load/store would bypass them. Or perhaps it would make sense to have some shared last-level cache that the vector load/stores don't bypass, so that one can communicate between cores without having to go via main memory.

              Comment


              • #37
                Originally posted by coder View Post
                Nvidia has nothing to fear from these clowns. What Nvidia knows, AMD just learned, and these guys have yet to figure out this that it's all about power-efficiency.
                That's, uh, quite hyperbolic. The industry ran into the power wall ~15 years ago. Those who didn't realize it either went belly-up or had very deep pockets (say hello to Pentium 4!). Today, I'm quite sure everybody in the industry is acutely aware of the importance of power efficiency.

                Now, Esperanto is a startup, and like most startups in general it's more likely to fail than become spectacularly successful.

                And having 4096 independent cores can't touch the power efficiency of Nvidia's SMs.
                I think it's actually not that far from a GPU. So in Nvidia Volta you start with a "CUDA core", which is essentially a FP32 vector lane. Gang together 16 of them and you have a vector processor with a vector width of 16 (or width of 8 if you're doing FP64). Nvidia calls this a "sub-core" with a "warp scheduler", but essentially it's a vector processor scheduling vector instructions. Group together 4 sub-cores, add some L1 cache and an interface to the on-chip network, and you have a SM. (In Volta there's additionally two "tensor cores" per sub-core, which apparently is a systolic array for doing 4x4 FP16 matrix multiplies).

                So on the Esperanto device there's 4096 cores, which appear to be more independent than a CUDA core. OTOH they are executing mostly vector instructions, like a Nvidia "sub-core", so they can amortize the overhead of execution over the vector width.

                So far we don't know the memory hierarchy of this Esperanto device. Could be that 4 cores share L1 cache, 4 of these "clusters" are grouped together and share L2 cache, a memory controller, and a connection to the on-chip network. So you have 4096/4/4 = 256 of these "super-clusters". Not that terribly different from a Nvidia SM?



                Comment


                • #38
                  Originally posted by coder View Post
                  Nvidia has nothing to fear from these clowns. What Nvidia knows, AMD just learned, and these guys have yet to figure out this that it's all about power-efficiency. And having 4096 independent cores can't touch the power efficiency of Nvidia's SMs.

                  And that's before you even talk about how to schedule work on them and transfer data, which independent cores must do using software and inefficient cache hierarchies (respectively).
                  If you would use this chip for graphics, true. But the needs for AI are quite different to what a GPU does (pooling in from a common resource like textures).
                  The competition is more like Googles TPU, which is more power-efficient than GPUs for AI Workloads by some factors.

                  I agree that the scheduling of the IO is the interesting part, but you are wrong if you assume that you would need cache hierarchies for that (it would make sense for code however). What IO-Network is used is only speculation so far.

                  Comment


                  • #39
                    Originally posted by jabl View Post
                    That's, uh, quite hyperbolic. The industry ran into the power wall ~15 years ago. Those who didn't realize it either went belly-up or had very deep pockets (say hello to Pentium 4!). Today, I'm quite sure everybody in the industry is acutely aware of the importance of power efficiency.
                    Talking about GPUs, here. Comparing the most efficient of each vendor's latest generation, GTX 1080 delivers 45.7 GFLOPS/W, while Vega 56 manages only 39.5 (according to official specs). In real world conditions, the gap only widens.

                    It's funny that you mention Pentium 4, because that's exactly what AMD did to Vega. An AMD architect is quoted as saying most of the additional circuitry vs. Fury (which had the same # of "cores") were things like extra pipeline stages & buffering needed to boost its clock speed. Doh! AMD forgot that GPUs were supposed to run wide & slow. Had they spent the gates on more cores and just focused on improving power efficiency, I'd bet Vega 64 would be close to 1080 Ti-level performance.

                    Comment


                    • #40
                      Originally posted by discordian View Post
                      If you would use this chip for graphics, true. But the needs for AI are quite different to what a GPU does (pooling in from a common resource like textures).
                      Gosh, and just what do you think neural network weights are? The biggest difference is that they're accessed in a more coherent fashion than the sort of random texture reads GPUs can handle.

                      Comment

                      Working...
                      X