Announcement

Collapse
No announcement yet.

Western Digital To Begin Shipping Devices Using RISC-V

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    Originally posted by jabl View Post

    RISC-V has been designed specifically for enabling various extensions, by leaving free opcode space in the base ISA. The idea is to have a common base ISA, and hence share all the work for OS and toolchain support, and then add various application-specific acceleration extensions with very little effort compared to doing everything from scratch.
    Well that's the theory.
    When you look at the sheer extent of the extensions space (how many extensions were talked about at the meet-up last week? twenty? thirty? some insane number) it seems hard to believe that this is viable on-going path. Maybe OK if your goal is to stick with the world of very limited high volume MCUs running bespoke code that's never updated, but unlikely to get you out of that ghetto and even to the world of Raspberry Pi's, let alone the promised land of phones, servers, and desktops.

    So what happens next? I have no idea. It's is clear that academia has embraced RISC-V for all the reason you'd expect. It's equally clear that academia wants to generate an on-going stream of new extensions, again for all the reasons you'd expect; and that the companies making these cores all want their special snowflake to be different from everyone else (and ideally leaving just painful enough to make it not worth doing). Not sure why anyone expected things to turn out differently.

    So do we now get something like the WiFi Alliance, an industry group that POLITICALLY does not set standards, oh no, that's for the RISC-V Foundation, and they take no stance on extensions; but that PRACTICALLY says "here's what WE support --- you get our blessing if you support it and good luck if you want to do anything else"?

    As for the "it's a better ISA", come on. Better based on what?
    It's an *adequate*, easy-to-decode/implement ISA, like a hundred other RISC ISAs. It avoids the most unfortunate errors of many of the early RISC ISAs, but it shows no flashes of genius (something that I think is frequently evident in the AArch 64 ISA). At every stage faced with the decision of "what's easiest to implement" vs "what would be the most powerful" (in some sort of sense) RISC-V has gone for ease of implementation. Which is fine for MCUs, but doesn't mean that you have some sort of wonder-ISA, and means that you're starting from behind when it comes to trying to grow up into phones and beyond.
    (The sorts of things I mean are the AArch64 CSel options, the way ARM encodes immediates, load-store pairs, all that sort of stuff that's harder to implement but substantially amplifies the power of your instruction set once you've crossed the implementation hurdle.)

    Comment


    • #42
      Originally posted by oiaohm View Post

      Its really that you miss read it. If this is done as per the RISC-V Vector extension we are looking at a modern day Cray when they built own chips.

      Since this is RV64I this means minion cores vector would 64-bit not 256 bit for integer and 128 bit not 256 bit for floating point.

      Next RISC-V Vector extension does not have 1 per hardware thread of vector table. You have 1 vector table per core that is it and when in vector mode that is all you are accessing so the vector table can be the register table. So you do 256 by doing correct combination of 64 bit instructions.
      Also it does not state how many hardware threads behind a vector request or if it in fact stays on the same core.



      This bit is highly deceptive.
      https://content.riscv.org/wp-content...-spec-v2.2.pdf page 93.

      See the trap 32/4=8. So 32 registers is total after the times 4. You have 8 vector registers to play with in each vector loop. 32 vector registers is the state store size. And depended the core and the operation you might be doing more than 4 threads per vector loop cycle. Number of hardware threads behind a vector loop will be what the hardware support and what fits in the 32 vector registers in a single core.

      The largest option in risc-v currently is 128bit. But that is RV128 and most of the specs are suggesting RV64.

      So your maths is a little off. 256*32/1024/8=1kiB and that is per core so 4Meg if it was 256 bit. But this is not what we would be looking at instead 64*32/1024/8 0.25kiB per core with 4096 being 1meg. Once you add everything else that could be 4 Meg in register files.

      When you have 4096 minions to-do processing on do you really need 256 bit support directly in hardware or is 64 bit enough with the minions spending extra processing time doing 256 using 64 bit instructions. Risc-v clocks 1.5Ghz on 45nm so over 2Ghz is possible at 7nm. Going up to RV128 would most likely halve the number of cores. This comes question number of threads that can run at once vs largest bit width operation.

      Basically x86 and arm have gone the simd route to support wider and wider processing. Risc-v is going the route at this stage of lets just keep on throwing more cores at the problem. Its kind of cluster vs mainframe battle all over again.
      Just to remind you that putting ARM in the same basket as x86 here is misleading.
      ARM has SVE as its generic vector solution, and SVE has been around (ie standardized, work done on compilers) what, two years or so more than RISC-V's equivalent. I expect SVE hardware will also ship (on Fujitsu if nowhere else --- and who knows what Apple's SVE agenda is...) before RISC-V's equivalent.
      But yeah, Intel seems all-in on AVX-512, and that seems like a bad bet (though, to be honest, what I'd expect for Intel, a company that has shown extraordinary incompetence in ISA design no matter what the field, no matter what the starting point! Pretty much the only aspects of x86-64 that don't suck were where AMD had some limited flexibility to try to patch up the mess, before Intel got back in control.)

      Comment


      • #43
        Originally posted by discordian View Post
        If you would use this chip for graphics, true. But the needs for AI are quite different to what a GPU does (pooling in from a common resource like textures).
        The competition is more like Googles TPU, which is more power-efficient than GPUs for AI Workloads by some factors.

        I agree that the scheduling of the IO is the interesting part, but you are wrong if you assume that you would need cache hierarchies for that (it would make sense for code however). What IO-Network is used is only speculation so far.
        You're spouting buzz words but not thinking about the issues.
        The ultimate performance (eg how many multiples/sec can be done) depends on how many physical multiply units there are. Everything else around those units is wrapping, and you're assuming (with ZERO evidence, and in ignorance of the state of the art and the target domain) that the wrapping will take the form of a standard CPU (particular type of cache, coherence, lots of effort put into the load/store system, etc). MANY other possibilities exist.
        GPUs use a possibility that assumes hierarchies of memory that shared, and limited communication options between "computational lanes", use a PC that is shared across many lanes, use a register pool that can be divided many ways, use a separate piece of hardware that schedules some huge number of virtual threads onto the computation lanes.

        An alternative throughput engine could use almost all these ideas with only very minor modifications, and yet have something that is very much more appropriate for a wide range of tasks, for example switching from shared PC across warps to a single PC for each lane would have substantial performance implications.

        You're also assuming that the target market is AI, and once again that's not proven. In particular a feature of Esperanto is the 64-bit FP per minion, and of all the things that are not needed for most AI work, 64-bit FP is pretty high up... (The main reason TPU does so much better than GPUs is that it's doing mainly short integer multiplies, NOT even FP16, let alone FP64 multiplies).
        But there are many throughput tasks beyond just AI.

        Don't think of this as a scaled up Xeon; that's stupid.
        Think instead of a large shared-nothing message-passing supercomputer --- things like BlueGene/P with thousands of cores in separate racks --- and so obviously only able to share memory via explicit and slow operations --- but now shrunk down to a chip. Insofar as code exists that runs usefully on something like BlueGene/P, that same sort of code gets to run usefully on this style of chip. (Obviously not as fast --- 4096 cores is not hundreds of thousands of more performant cores --- but then again, it's a computer than can be owned by the department without having to be for expensive time from the government.)

        Comment


        • #44
          Originally posted by name99 View Post
          When you look at the sheer extent of the extensions space (how many extensions were talked about at the meet-up last week? twenty? thirty? some insane number) it seems hard to believe that this is viable on-going path. Maybe OK if your goal is to stick with the world of very limited high volume MCUs running bespoke code that's never updated, but unlikely to get you out of that ghetto and even to the world of Raspberry Pi's, let alone the promised land of phones, servers, and desktops.

          So what happens next? I have no idea. It's is clear that academia has embraced RISC-V for all the reason you'd expect. It's equally clear that academia wants to generate an on-going stream of new extensions, again for all the reasons you'd expect; and that the companies making these cores all want their special snowflake to be different from everyone else (and ideally leaving just painful enough to make it not worth doing). Not sure why anyone expected things to turn out differently.
          I was thinking it'd be interesting to specify some sort of namespace for extensions. You could have a portable EXE format, with extensions appropriately tagged with some unique namespace prefix. Then, the loader could translate all of the extensions to the opcodes used on the specific hardware on which it's being run. To save on the speed & power-efficiency of loading, these "localized" images could be cached on the device.

          That way, you'd be able to build a chip or soft core that supported any combination of extensions, without concern for collisions in opcode space.

          Comment


          • #45
            Originally posted by name99 View Post
            Think instead of a large shared-nothing message-passing supercomputer --- things like BlueGene/P with thousands of cores in separate racks --- and so obviously only able to share memory via explicit and slow operations --- but now shrunk down to a chip. Insofar as code exists that runs usefully on something like BlueGene/P, that same sort of code gets to run usefully on this style of chip.
            But it's not, because the amount of local memory for each of those cores would be minuscule compared to the message-passing supercomputers of yore.

            That aside, I'm amenable to the idea of shared-nothing (or shared-little) architectures. It's a much bigger lift for the software to exploit, but the potential for better scaling and energy efficiency vs. conventional cache hierarchy is too big to ignore.

            Comment


            • #46
              Originally posted by jabl View Post
              So far we don't know the memory hierarchy of this Esperanto device. Could be that 4 cores share L1 cache, 4 of these "clusters" are grouped together and share L2 cache, a memory controller, and a connection to the on-chip network. So you have 4096/4/4 = 256 of these "super-clusters". Not that terribly different from a Nvidia SM?
              Common designs on risc-v is 1 or 4 or 8 cores per complex going to network on chip. This is why vector is 4 wide because that matches a standard complex of 4 cores. Network on chip does not prevent more groups working with each other. Vector unit would suggest 4 core complex in riscv.

              http://www-personal.umich.edu/~rovin...17celerity.pdf This one here is at 16 nm and was 5 full cores with 512 minon style cores ASIC. 4096 is just a newer design. Some of the existing are show quite high performance for very decent power effectiveness.

              Please note on a riscv done network on chip the network is between the cache and the memory controller so each complex does not have a memory controller all memory control stuff is done over the network on chip. One of the differences using riscv over arm is the ability to nuke the memory controller and mess with complex design. You have things like tagged memory in the memory controller that is implementing like Linux kernel RCU at hardware level.

              Comment


              • #47
                Originally posted by name99 View Post

                You're spouting buzz words but not thinking about the issues.
                The ultimate performance (eg how many multiples/sec can be done) depends on how many physical multiply units there are. Everything else around those units is wrapping, and you're assuming (with ZERO evidence, and in ignorance of the state of the art and the target domain) that the wrapping will take the form of a standard CPU (particular type of cache, coherence, lots of effort put into the load/store system, etc). MANY other possibilities exist.
                GPUs use a possibility that assumes hierarchies of memory that shared, and limited communication options between "computational lanes", use a PC that is shared across many lanes, use a register pool that can be divided many ways, use a separate piece of hardware that schedules some huge number of virtual threads onto the computation lanes.

                An alternative throughput engine could use almost all these ideas with only very minor modifications, and yet have something that is very much more appropriate for a wide range of tasks, for example switching from shared PC across warps to a single PC for each lane would have substantial performance implications.

                You're also assuming that the target market is AI, and once again that's not proven. In particular a feature of Esperanto is the 64-bit FP per minion, and of all the things that are not needed for most AI work, 64-bit FP is pretty high up... (The main reason TPU does so much better than GPUs is that it's doing mainly short integer multiplies, NOT even FP16, let alone FP64 multiplies).
                But there are many throughput tasks beyond just AI.

                Don't think of this as a scaled up Xeon; that's stupid.
                Think instead of a large shared-nothing message-passing supercomputer --- things like BlueGene/P with thousands of cores in separate racks --- and so obviously only able to share memory via explicit and slow operations --- but now shrunk down to a chip. Insofar as code exists that runs usefully on something like BlueGene/P, that same sort of code gets to run usefully on this style of chip. (Obviously not as fast --- 4096 cores is not hundreds of thousands of more performant cores --- but then again, it's a computer than can be owned by the department without having to be for expensive time from the government.)
                Learn to read, I compared this to IBMs Cell with its isolated pools of local storage, I explicitly said that coherent caches and a uniform pool of memory are unlikely. Maybe dont jump into a discussion and at my throat without reading the prior posts? Makes you look rather stupid.

                AI was explicitly brought up by Esperanto, so this is one area they obviously target. Also I cast doubts at the "64-bit FP per minion", atleast I am not sure how this can be extrapolated from the announcement.

                Comment


                • #48
                  Originally posted by discordian View Post
                  Learn to read, I compared this to IBMs Cell with its isolated pools of local storage, I explicitly said that coherent caches and a uniform pool of memory are unlikely. Maybe dont jump into a discussion and at my throat without reading the prior posts? Makes you look rather stupid.

                  AI was explicitly brought up by Esperanto, so this is one area they obviously target. Also I cast doubts at the "64-bit FP per minion", atleast I am not sure how this can be extrapolated from the announcement.
                  To be honest, I think I clicked on the wrong comment as the "Quoted" comment when I wrote that!
                  I had a lot to say in response to the previous comments and not much time, and that's what happens...
                  Remember it's usually more likely that someone made a mistake than that they're deliberately engaged in hostile acts against you :-)

                  Comment

                  Working...
                  X