Announcement

Collapse
No announcement yet.

Western Digital To Begin Shipping Devices Using RISC-V

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by willmore View Post

    I hope that was a typo and that you meant RISC-V. MIPS-V: https://en.wikipedia.org/wiki/MIPS_V#MIPS_V
    Easy to get confused. MIPS is owned by Imagination (Apple supplier) now and their last revision was the Warrior for use as LTE modems. Its licensed to just about anyone who makes silicon these days. MediaTek just did a license deal recently.

    Comment


    • #22
      Originally posted by jabl View Post
      4096(!!!) simple in-order cores with the vector extensions, some machine learning extensions, and multiple threads. If successful, this is a monster for HPC/ML style workloads..
      That would be silly. Think of all the wasted resources decoding the same instructions and routing data movement. There are good reasons why GPUs use wide SIMD, and it's very good for neural networks. And does this thing contain 4112 MMUs? If it does, that's another huge waste. If it doesn't, then it won't be very useful for general-purpose workloads.

      Comment


      • #23
        Originally posted by wizard69 View Post
        I really dont see much if a future here for RISC-V. Maybe a decade ago it might have had a chance but the industry will slowly take focus off the CPU complex to focus on hardware that offers a bigger gain. AI acceleration is one example where there are plenty of architectures to consider.

        It isnt an issue of RISC-V being good or bad, rather it is the issue of where does a company focus its engineering teams. ARM cores are pretty much a done deal these days so do you waste time on another core when the differentiator will be in specialized hatdware. I just dont see the motivation for most manufactures.

        Really there is some key differences with RISC-V RV64GC instruction set is surprising dense. Its more dense than your x86 or arm instruction sets this helps with cache efficiency.

        Western Digital and others are looking at RISC-V because arm cores are not a done deal. Arm instruction sets will only allow you to go so far.

        https://www.youtube.com/watch?v=JuJDPbzWpR0 5 min in on this one start listing other faults that appear in other instruction set like arm and x86 when you got to doing out of order.

        RISC-V is young and is a dense ISA. You have to think we are heading to the manufacturing limit of silicon so density will come important. Like being able to fit 18 percent more instructions in the cpu cache does make quite a difference to your cache miss rates. As RISC-V ages the availability of generic chips will come more common.

        Really RISC-V is going to find markets. Instruction set way better than MIPS and well and truly competitive with x86 and arm including to being better than both in many ways. Performance will come down to how good RISC-V CPU complexs end up as there is nothing in the Instruction set that says that x86 or arm should be faster or have higher efficiency.

        Also on of the biggest makers of multi core risc chips that is 1024 cores+ per chip for AI acceleration is looking at RISC-V.

        Comment


        • #24
          Originally posted by coder View Post
          That would be silly. Think of all the wasted resources decoding the same instructions and routing data movement. There are good reasons why GPUs use wide SIMD, and it's very good for neural networks. And does this thing contain 4112 MMUs? If it does, that's another huge waste. If it doesn't, then it won't be very useful for general-purpose workloads.



          16 full cores and 4096 compact cores network on chip connected.

          For general-purpose workloads not exactly sure thinking the number of what are called general-purpose workloads that use opengl and the like and the massive chip should be more than able to do opengl and the like at suitable speed without leaving the chip . Also as long as the workload is not flooding the network on chip should not be a major problem. A 8 core x86 chip does not have 1 MMU per core either. There is still question what ratio is in fact required for MMU vs cache vs cores.



          Also it will depend if these are cray style vector system or not. If it cray style it will get very interesting able to shove one lot of exec code out with each cpu told to start at a particularly vector number. One of the key point about vector in a ISA compared to SIMD is the fact that stuff in a vector loop is self contained so it can be distributed between cores. The other interesting point about vector processing distribution is that it can happen transparent to the application code done by the CPU complex itself.

          Vector instruction set are lower complexity compared to SIMD in instruction set so simpler decode engine so a vector design MMU is smaller than a SIMD designed MMU on silicon space to start off with.

          GPU started using wide SIMD not because it was good but because vector was patented when they started.

          https://www.youtube.com/watch?v=828oMNFGSjg You are also seeing 1024+ risc-v in fpga. So 4096 + 16 core system is really turning what being done in fpga into ASTC. Why its turnout out fairly good to do accelerators in Risc-v on a fpga and should be even more powerful as an ASTC.

          If you are paying a license fee per core and you are having 4096+ per cores even a small fee is getting large.

          Comment


          • #25
            I think coder does have a point; There's something strange in those numbers. The RISC-V vector extension states that the minimum vector width is 4 elements, and if they're targeting HPC as well which is what they're saying, they must support 64-bit floats, which means that the vector width is at least 256 bits. Even on 7 nm, I'm finding it difficult to see how they can fit 4096 256-bit vector units on a single chip. Oh, and each of those minion cores has multiple hardware threads, so each thread needs its own register file. With 32 vector registers, each 256 bits, that's 1 kB for the register file (+ a little more for the scalar registers, predicate registers etc. but lets forget about those for the moment). If we assume 4 hw threads per minion core, that's 4096*4*1 kB = 16 MB just for the register files!

            But it doesn't seem to be a typo either, on their own blog they mention the same number: https://www.esperanto.ai/single-post...ISC-V-standard



            Also it will depend if these are cray style vector system or not.
            Hmm, I was just about to dismiss this idea out of hand as totally ridiculous, but, maybe you're actually on to something. If we decouple the vector ISA width from the execution width (like in the old-school Cray pipelined vector processing), then 4096 cores starts to sound feasible!

            So Nvidia Volta is supposed to provide about 7 DP Tflop/s. As it will probably take at least a year for Esperanto to actually get some product out of the door, lets assume they're targeting 16 DP Tflop/s. Further, lets assume a target clock speed of 2 GHz. So 16e12/2e9 = 8000 flops/cycle. Since it has FMA, it needs 4000 DP floating point pipelines. Thus, each of the 4096 cores needs only one FP execution pipeline, and the vector ISA is used (like in the Cray vector computers) to amortize instruction overhead and to drive memory level parallelism.

            Comment


            • #26
              Originally posted by jabl View Post
              I think coder does have a point; There's something strange in those numbers. The RISC-V vector extension states that the minimum vector width is 4 elements, and if they're targeting HPC as well which is what they're saying, they must support 64-bit floats, which means that the vector width is at least 256 bits. Even on 7 nm, I'm finding it difficult to see how they can fit 4096 256-bit vector units on a single chip. Oh, and each of those minion cores has multiple hardware threads, so each thread needs its own register file. With 32 vector registers, each 256 bits, that's 1 kB for the register file (+ a little more for the scalar registers, predicate registers etc. but lets forget about those for the moment). If we assume 4 hw threads per minion core, that's 4096*4*1 kB = 16 MB just for the register files!

              Its really that you miss read it. If this is done as per the RISC-V Vector extension we are looking at a modern day Cray when they built own chips.

              Since this is RV64I this means minion cores vector would 64-bit not 256 bit for integer and 128 bit not 256 bit for floating point.

              Next RISC-V Vector extension does not have 1 per hardware thread of vector table. You have 1 vector table per core that is it and when in vector mode that is all you are accessing so the vector table can be the register table. So you do 256 by doing correct combination of 64 bit instructions.
              Also it does not state how many hardware threads behind a vector request or if it in fact stays on the same core.

              Up to 32 vector data registers, v0-­‐v31, of at least 4 elements each, with variable bits/element (8,16,32,64,128)
              This bit is highly deceptive.
              https://content.riscv.org/wp-content...-spec-v2.2.pdf page 93.
              17.1 Vector Unit State

              The additional vector unit architectural state consists of 32 vector data registers (v0–v31), 8 vector predicate registers (vp0-vp7), and an XLEN-bit WARL vector length CSR,vl.
              See the trap 32/4=8. So 32 registers is total after the times 4. You have 8 vector registers to play with in each vector loop. 32 vector registers is the state store size. And depended the core and the operation you might be doing more than 4 threads per vector loop cycle. Number of hardware threads behind a vector loop will be what the hardware support and what fits in the 32 vector registers in a single core.

              The largest option in risc-v currently is 128bit. But that is RV128 and most of the specs are suggesting RV64.

              So your maths is a little off. 256*32/1024/8=1kiB and that is per core so 4Meg if it was 256 bit. But this is not what we would be looking at instead 64*32/1024/8 0.25kiB per core with 4096 being 1meg. Once you add everything else that could be 4 Meg in register files.

              When you have 4096 minions to-do processing on do you really need 256 bit support directly in hardware or is 64 bit enough with the minions spending extra processing time doing 256 using 64 bit instructions. Risc-v clocks 1.5Ghz on 45nm so over 2Ghz is possible at 7nm. Going up to RV128 would most likely halve the number of cores. This comes question number of threads that can run at once vs largest bit width operation.

              Basically x86 and arm have gone the simd route to support wider and wider processing. Risc-v is going the route at this stage of lets just keep on throwing more cores at the problem. Its kind of cluster vs mainframe battle all over again.

              Comment


              • #27
                Originally posted by oiaohm View Post

                Its really that you miss read it. If this is done as per the RISC-V Vector extension we are looking at a modern day Cray when they built own chips.

                Since this is RV64I this means minion cores vector would 64-bit not 256 bit for integer and 128 bit not 256 bit for floating point.
                RV64I refers to the properties of the scalar cpu, particularly that it has 64-bit integer registers. The connection between that and the vector unit is that the vector unit must support element sizes of at least MAX(XLEN, FLEN) (section 17.2). Thus, for RV64I the vector unit must support at least 64-bit elements. Further, it is stated quite clearly that the vector unit has 32 architectural registers v0-v31 (section 17.1), and further in section 17.3 "Implementations must provide an MVL of at least four elements for all supported configuration settings.". Thus we have 32 vector registers, each holding at least 4 64-bit (8 byte) elements. 32*4*8 = 1024B = 1kB.

                Next RISC-V Vector extension does not have 1 per hardware thread of vector table. You have 1 vector table per core that is it and when in vector mode that is all you are accessing so the vector table can be the register table. So you do 256 by doing correct combination of 64 bit instructions.
                Also it does not state how many hardware threads behind a vector request or if it in fact stays on the same core.
                Huh? How on earth could anything like this work? Normally at least, if a cpu has several hw threads the point is to make it look to the software as if there were several cpu's, but in reality multiplex those threads over the same execution hardware. This means that all architecturally visible state (including the register file, obviously) must be replicated!

                Now, it's of course possible that this Esperanto chip doesn't do this, and their hw threads share the architectural state, and the compiler and/or programmer has to ensure to not step on the toes of other threads (e.g. thread #0 is allocated vector registers v0-v7, thread #1 v8-15 etc. if 4 hw threads/core is used), but IMNSHO this sounds *really* far-fetched.

                This bit is highly deceptive.
                https://content.riscv.org/wp-content...-spec-v2.2.pdf page 93.

                See the trap 32/4=8. So 32 registers is total after the times 4. You have 8 vector registers to play with in each vector loop. 32 vector registers is the state store size. And depended the core and the operation you might be doing more than 4 threads per vector loop cycle. Number of hardware threads behind a vector loop will be what the hardware support and what fits in the 32 vector registers in a single core.

                The largest option in risc-v currently is 128bit. But that is RV128 and most of the specs are suggesting RV64.

                So your maths is a little off. 256*32/1024/8=1kiB and that is per core so 4Meg if it was 256 bit. But this is not what we would be looking at instead 64*32/1024/8 0.25kiB per core with 4096 being 1meg. Once you add everything else that could be 4 Meg in register files.
                I think you have to explain in a little bit more detail what you mean. Yes, in RISC-V V extension you can configure your vector registers file, and you can specify that you intend to use fewer than the 32 registers in which case you might (depending on the physical configuration of the vector registers and lanes on the chip) get a bigger maximum vector length. E.g. if you have a 1kB register file, you might configure it as 32 64-bit registers with a maximum vector length of 4, or as 16 64-bit registers with a maximum length of 8.

                When you have 4096 minions to-do processing on do you really need 256 bit support directly in hardware or is 64 bit enough with the minions spending extra processing time doing 256 using 64 bit instructions. Risc-v clocks 1.5Ghz on 45nm so over 2Ghz is possible at 7nm. Going up to RV128 would most likely halve the number of cores. This comes question number of threads that can run at once vs largest bit width operation.
                I think the point must be (the second part of my previous post that you didn't quote) that they are doing it Cray-style, in that the architecturally visible vector length is different from the hw vector length. E.g. each minion core has only a single 64-bit FP unit, but still a vector width of 4 (or why not 8? or 16?). Which just means that the core will use 4 (8?/16?) cycles per vector arithmetic instruction. Similar to e.g. how AMD Zen handles 256-bit AVX with it's 128-bit FP unit.

                Comment


                • #28
                  Is the RISC-V Vector extension finalised? The last time that I red about it, it was a different approach than the "Cray-style".
                  Namely that vector length is not fixed by the ISA, and is non-transparent to software.
                  like:

                  int vec_len = read_some_cpu_register();
                  for (int i = 0; i < count / vec_len; i++ )
                  {
                  vectype val = load_nextbytes(vec_len);
                  store_nextbytes(vec_len, val * 2);
                  }

                  This would mean the same code would run whether your CPU has a vector-size of 1 or 64.
                  Course with C/C++ the vectype would then be a problem and ideally you would require everything to stay in the vector registers.

                  btw, who claimes this thing would use 64bit floats? this is not needed for AI.

                  The hard part will be keeping those CPUs fed with memory bandwidth, I guess coherent cache will be out of question. Likely some IBM "Cell" modell with small local memory for a stack?

                  Comment


                  • #29
                    Originally posted by jabl View Post
                    RV64I refers to the properties of the scalar cpu, particularly that it has 64-bit integer registers. The connection between that and the vector unit is that the vector unit must support element sizes of at least MAX(XLEN, FLEN) (section 17.2). Thus, for RV64I the vector unit must support at least 64-bit elements. Further, it is stated quite clearly that the vector unit has 32 architectural registers v0-v31 (section 17.1), and further in section 17.3 "Implementations must provide an MVL of at least four elements for all supported configuration settings.". Thus we have 32 vector registers, each holding at least 4 64-bit (8 byte) elements. 32*4*8 = 1024B = 1kB.
                    The Vector unit is only up to 32 registers. Not that every core has to have that many. Also you are thinking in fixed values per core.

                    Originally posted by jabl View Post
                    Huh? How on earth could anything like this work? Normally at least, if a cpu has several hw threads the point is to make it look to the software as if there were several cpu's, but in reality multiplex those threads over the same execution hardware. This means that all architecturally visible state (including the register file, obviously) must be replicated!
                    I do understand the strangeness. It one of the highly warped stunts you find in some of the RISC-V fpga multi core vector supporting designs. Register memory block shared between 4 to 8 cores you find in some fpga designs. So you can in fact go the other way where multi cores appear as one core doing a heck load because you are able to use the share register space and use the fpu/... parts of the other cores. So it is a question about how the minion cores are done in this case if the register memory is shared to more than 1 core and the vector system exploits this you are looking at a cpu with a very different behaviour where it can run as 8 cores and when load suites 8 cores run as one. Basically think of it as one of the cores using the other cores near it as slaves.

                    Since the vector unit is extension done by different parties there is a lot of creativity on what one being used in this case.

                    Normal cpus you are use to that are not risc-v the concept of register/vector storage area being shareable between cores is not on the table. So depending on the risc-v design they can be using space other risc-v cores near them are not using in the register space. Once you get past the instruction set risc-v provides insane number of cpu design options.

                    Originally posted by jabl View Post
                    I think you have to explain in a little bit more detail what you mean. Yes, in RISC-V V extension you can configure your vector registers file, and you can specify that you intend to use fewer than the 32 registers in which case you might (depending on the physical configuration of the vector registers and lanes on the chip) get a bigger maximum vector length. E.g. if you have a 1kB register file, you might configure it as 32 64-bit registers with a maximum vector length of 4, or as 16 64-bit registers with a maximum length of 8.
                    The other thing its risc-v you may not have 32 64bit vector registers. You might have 16 64bit vector registers. This is the thing its a minon core. Some risc-v minion cores don't have floating point at all.

                    Originally posted by jabl View Post
                    I think the point must be (the second part of my previous post that you didn't quote) that they are doing it Cray-style, in that the architecturally visible vector length is different from the hw vector length. E.g. each minion core has only a single 64-bit FP unit, but still a vector width of 4 (or why not 8? or 16?). Which just means that the core will use 4 (8?/16?) cycles per vector arithmetic instruction. Similar to e.g. how AMD Zen handles 256-bit AVX with it's 128-bit FP unit.
                    Big thing here is a 64 bit risc-v only has to have 64bit int and 128bit fp max. So doing 256 bit is pushed back to be coded by programmer or compiler. There is a limit to how dynamic risc-v lengths are. This is why we can kind of expect risc-v to grow wider than 128bit.

                    minion cores in risc-v world are know as cut back cores. So it a question what have they cut and is there anything odd about their cores like shared register memory.

                    "Up to" in risc-v standard files has to be taken that the hardware may have less. Particularly when the instruction set provides way to query and get a failure if there is not enough. If core has minion in name its not a full core and has been cut back. Full risc-v has 32 general integer registers a cut back risc-v might only have 16 integer registers both are to specification. Thinking each minion has 32 vector is presuming a lot. There was no min stated. So 1 would pass but not be very useful.

                    Its the thing about risc-v is insane level of customisation.

                    Comment


                    • #30
                      Originally posted by discordian View Post
                      Is the RISC-V Vector extension finalised?
                      No. The current version seems to be 0.2. The plan is to have it ratified in 2018.

                      The last time that I red about it, it was a different approach than the "Cray-style".
                      Namely that vector length is not fixed by the ISA, and is non-transparent to software.
                      Yes, the non-fixed vector length is a difference from the old Cray vector ISA. A good introduction to the differences between "traditional vectors", SIMD, and GPU's is https://riscv.org/wp-content/uploads...p-june2015.pdf . That slide deck is a bit old, but contains a longer and better overview than the newer.

                      like:

                      int vec_len = read_some_cpu_register();
                      for (int i = 0; i < count / vec_len; i++ )
                      {
                      vectype val = load_nextbytes(vec_len);
                      store_nextbytes(vec_len, val * 2);
                      }
                      Yeah. Or more precisely, something like


                      Code:
                      int count_remaining = count;
                      int vec_len = setvl(count_remaining);
                      while (count_remaining) {
                      vectype val = load_nextbytes(vec_len);
                      store_nextbytes(vec_len, val * 2);
                      count_remaining -= vec_len;
                      int vec_len = setvl(count_remaining);
                      }
                      where the setvl instruction sets the vector length to MIN(hardware vector length, count_remaining). That allows to same vectorized main loop to handle the last few iterations, without having to have a separate non-vectorized "tail loop". Cray's had the same, the difference was that the hardware vector length was hardcoded in the ISA.

                      The hard part will be keeping those CPUs fed with memory bandwidth, I guess coherent cache will be out of question. Likely some IBM "Cell" modell with small local memory for a stack?
                      If my previous speculation is correct, and it has something like 16 MB register state, I guess that leaves caches out of the question as I guess there's not much point in having caches unless they are a lot larger than your register file, and there clearly isn't space on the die for hundreds of MB's of cache. I suppose it would make sense to have caches for the scalar stuff (e.g. loop indices, addresses etc.) whereas the vector data would bypass the caches and go straight to memory.

                      Comment

                      Working...
                      X