Announcement

Collapse
No announcement yet.

Intel Begins Teasing Their Discrete Graphics Card

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by coder View Post
    Since you're so interested in power efficiency, you should look up some power efficiency scaling analysis for Skylake-SP. As you scale up core counts, communication consumes more and more of your power budget. And all the queues and buffers you need to hide the latency burn power and die space, as well.
    The same happens with GPUs. Scaling up core counts consumes more and more of your power budget.

    Comment


    • Originally posted by juanrga View Post
      Cascade Lake solves one of the main problems of the Xeon Phi,
      Which problem is that?

      Originally posted by juanrga View Post
      problems have little to do with "cache coherency" and x86 "memory consistency".
      Those are intrinsic to x86, and surely among the reasons Intel is no longer trying to compete against GPUs with it. Instead, they've opted to compete against GPUs with one of their own.

      Comment


      • Originally posted by juanrga View Post
        The same happens with GPUs. Scaling up core counts consumes more and more of your power budget.
        Yes, but GPUs have orders of magnitude fewer cores. What Nvidia calls "cores", aren't. Their SM's are the equivalent of a CPU core. Why does it matter? Because you can batch entire 32-lane SIMD's worth of memory accesses, thus vastly improving coherency over what you'd have if a comparable number of MIMD cores were all running. The other efficiency you get - and this is big - is that you're only fetching and decoding one instruction stream for all of them.

        For problems that map well to SIMD, I think you can't beat GPUs. Sure there are going to be cases where 4096 MIMD cores outperform comparably sized GPUs, but those are probably going to be a minority of cases in HPC.

        Anyway, we'll see. I think none of us are privy to enough information to conclusively say otherwise. Personally, I'm ready to be surprised by a large-scale MIMD, like PEZY-SC2. The ultimate test will be whether they sell (in this case, outside of Japan).

        Comment


        • Originally posted by coder View Post
          You said they were funding it with "billion dollar budgets". I didn't imagine you were talking about this, because I can assure you they didn't put billions into Esperanto.
          The problem is you don't need to. Experanto is able to use chisel risc-v template. Westen digital can and does multi million dollar grants. Each of these grants goes to a company to work in particular area of silicon. Esperanto job is to work out the interconnects between massive number of risc-v cores.

          There is a different company Westen digital funds to work out what instructions should be in risc-v to assist with graphical processing.

          Originally posted by coder View Post
          Since you're so interested in power efficiency, you should look up some power efficiency scaling analysis for Skylake-SP. As you scale up core counts, communication consumes more and more of your power budget. And all the queues and buffers you need to hide the latency burn power and die space, as well..
          These are not network on chip solutions. Network on chip was designed to get rid of most of those queues and buffers. So skylake-SP does not compare to high performance risc-v. Also the way x86 does speculative execution comes back and hurts them.

          Originally posted by coder View Post
          The shocking reality is that GPUs and high-end GPUs are using HBM2 with data bus widths up to 4096-bits. I've not seen or heard of a GPU with an off-chip memory controller, and CPUs haven't had them for a decade.
          Current Gen-z out side socket is equal to HBM3. HBM does provide lower latency than Gen-z. Gen-z provides large ram storage sizes due to not being limited to a single socket size.

          Yes we have not seen cpus and gpu with outside socket mmu for a long time. Problem here is some of the risc-v prototype accelerators HBM2 at 4096 bits is not able to provide the transfer they need.

          HBM2 is in fact a system in package(SiP) meaning that the HBM2 MMU controller is not part of the GPU dia. This is also kind of new. Why because cpu cores need to be made at smallest nm possible and MMU units that are controlling ddr need to be made at a larger nm due to the analog signal processing the MMU is doing.

          Scary part is HBM with stacking ram chips there is work on 3D network on chip for risc-v. Yes take the HBM idea of chip stacking for ram and do that with the 4096+16 core risc-v prototype you are looking at an 16384+64 monister with higher ram requirements than what HBM4 can deliver in capacity and speed...

          So we are about to see computing power quad in 2 years. HBM a single path bus.

          Yes HBM has allowed you to sip place ddr chips this is very much a short term fix. The very idea HBM is based on of chip stacking is what brings it undone when cpu/gpu/accelerator core chips start doing the same thing.

          2020 is looking like a year of big trouble.

          Originally posted by coder View Post
          No, you can use a separate PCIe switch, like what's embedded in PC chipsets and on some adapter cards for hosting multiple M.2 drives in a PCIe slot.
          That still has a parent port.

          Originally posted by coder View Post
          That's cute how you're trying to spin it, but you're ignoring the reality of how they're actually using it.
          No it not nlink is used between Nvidia arm cores and their gpu in their arm systems as well. It really you are being cute and ignoring how Nvidia use nlink and the reason why nlink2 is not between x86 cpu and nvidia gpu is that x86 cpus don't support it. Its is a gpu/cpu bus.

          Originally posted by coder View Post
          That's only a software limitation.
          Having to go to cpu for permission in PCIe is the fact the switch has a parent port. Its part of PCIe design and its the parent port only that can give PCIe transfer permissions. PCIe is designed to form up into a single tree with 1 parent at the top of everything. This is a PCIe specification limitation. Gen-z is designed to allow multi parents to be allocated controlling different areas of the bus so a true multi tree system.

          Originally posted by coder View Post
          Yes, but GPUs have orders of magnitude fewer cores. What Nvidia calls "cores", aren't. Their SM's are the equivalent of a CPU core. Why does it matter? Because you can batch entire 32-lane SIMD's worth of memory accesses, thus vastly improving coherency over what you'd have if a comparable number of MIMD cores were all running. The other efficiency you get - and this is big - is that you're only fetching and decoding one instruction stream for all of them.
          This is not exactly true when you get into risc-v with vector.

          Problem here is some risc cores like the minion you can batch entire 32-lane of SIMD worth of memory accesses in a single core. So the new risc-v designs don't have the PEZY-SC2 limitation. Risc-v instruction set is a true general it include features of gpu instruction sets.

          Vector in Risc-v makes each core with vector SIMD. The way vector is done in risc-v it possible to spread vector MIMD as well so have multi risc-v cores act as one. In the network on chip this works as instruction transfers just pass down a single lane on the chip and the next chip along starts in a different location in the vector processing. .

          Really the Risc-v prototypes I have shown cores have a lot in common with what Nvidia calls Streaming Multiprocessors once you have full blown vector instructions being feed into them. So its a instruction design that works as MIMD and very wide SIMD. The result of vector instructions mixed with network on chip designs is dynamically sizeable Streaming Multiprocessors in the risc-v designs vs Nvidia/AMD gpus with fixed sized Streaming Multiprocessors. I am not sure that gpu designs really do have the advantage here as the GPU designs are no where near as flexible.

          Comment


          • Originally posted by oiaohm View Post
            The problem is you don't need to. Experanto is able to use chisel risc-v template. Westen digital can and does multi million dollar grants. Each of these grants goes to a company to work in particular area of silicon. Esperanto job is to work out the interconnects between massive number of risc-v cores.

            There is a different company Westen digital funds to work out what instructions should be in risc-v to assist with graphical processing.
            You're still off by a couple orders of magnitude in the funding needed for a competitive HPC chip. WD isn't bankrolling any such thing on their own.

            Originally posted by oiaohm View Post
            These are not network on chip solutions. Network on chip was designed to get rid of most of those queues and buffers. So skylake-SP does not compare to high performance risc-v. Also the way x86 does speculative execution comes back and hurts them.
            I'm talking about inside the cores, so they don't stall. If you just let them stall, then your performance evaporates.

            Originally posted by oiaohm View Post
            Current Gen-z out side socket is equal to HBM3. HBM does provide lower latency than Gen-z. Gen-z provides large ram storage sizes due to not being limited to a single socket size.
            How's it equal to HBM3? You just contradicted yourself - they're alike, except also completely different in all the ways that matter.

            Originally posted by oiaohm View Post
            HBM2 is in fact a system in package(SiP) meaning that the HBM2 MMU controller is not part of the GPU dia. This is also kind of new. Why because cpu cores need to be made at smallest nm possible and MMU units that are controlling ddr need to be made at a larger nm due to the analog signal processing the MMU is doing.
            I'm not sure you understand what a MMU is...

            Originally posted by oiaohm View Post
            It really you are being cute and ignoring how Nvidia use nlink and the reason why nlink2 is not between x86 cpu and nvidia gpu is that x86 cpus don't support it. Its is a gpu/cpu bus.
            At the GPU Technology Conference last week, we told you all about the new NVSwitch memory fabric interconnect that Nvidia has created to link multiple


            Originally posted by oiaohm View Post
            This is not exactly true when you get into risc-v with vector.

            Problem here is some risc cores like the minion you can batch entire 32-lane of SIMD worth of memory accesses in a single core.
            Except they won't, because there are so many of them. And they won't have the same level of SMT as GPUs, meaning they'll spend a lot of time stalled. It will have similar issues to Xeon Phi, except without the baggage of x86.

            Originally posted by oiaohm View Post
            Vector in Risc-v makes each core with vector SIMD. The way vector is done in risc-v it possible to spread vector MIMD as well so have multi risc-v cores act as one. In the network on chip this works as instruction transfers just pass down a single lane on the chip and the next chip along starts in a different location in the vector processing.
            MIMD can emulate SIMD at a logical level, but this ignores the key efficiency and performance benefits of SIMD.

            Originally posted by oiaohm View Post
            Really the Risc-v prototypes I have shown cores have a lot in common with what Nvidia calls Streaming Multiprocessors once you have full blown vector instructions being feed into them.
            They're similar, except in most of the ways that really matter.

            Comment


            • Originally posted by coder View Post
              You're still off by a couple orders of magnitude in the funding needed for a competitive HPC chip. WD isn't bankrolling any such thing on their own.
              That is the thing WD don't need to bankroll it all. This is the risc-v effect you have many different groups making their own HPC chips targeting different areas of HPC development all based around chisel and risc-v.

              Originally posted by coder View Post
              I'm talking about inside the cores, so they don't stall. If you just let them stall, then your performance evaporates.
              Due to the bandwidth the network on chip provides those buffers in each core is smaller. There have been risc-v 4096 done with no more than normal L1 cache on a network on chip. So even dropped l2 cache out the cores. There is a different way of taking on the problem than the x86 uses. The trap with Xeon Phi was making the buffers to big. skylake-SP has the same fault as Xeon Phi its called bufferbloat.

              Originally posted by coder View Post
              How's it equal to HBM3? You just contradicted yourself - they're alike, except also completely different in all the ways that matter.
              Total bandwidth provide HBM3 and Gen-Z are equal.

              That is not a very good writeup on cpu placement on the nvswitch to see that you need to look at ARM and powerpc presentations. Yes it mentions the powerpc cpu as bus controller.

              Also there is a problem here when you look at Gen-Z. Gen-Z does not lock you to exactly 1 switch design.

              Originally posted by coder View Post
              Except they won't, because there are so many of them. And they won't have the same level of SMT as GPUs, meaning they'll spend a lot of time stalled. It will have similar issues to Xeon Phi, except without the baggage of x86.
              Xeon Phi problems is not just x86. Its the way intel designed the routers in their NoC.

              Yes Bufferbloat effects NoC designs badly. So the idea of buffers inside cores getting bigger to get higher throughput is backwards. As you buffers get too big you number of stalled cpu cores increase because it slow but surely lowers bandwidth.

              Intel with the x86 tried 1 design of NoC. The Risc-v world has tried so far about 4000 different designs of NoC. Any like the intel design performs badly.

              Do note how they say that increasing buffers in a network system fails to increase throughput. This does not matter if it a wired switch or a network on chip.

              Originally posted by coder View Post
              MIMD can emulate SIMD at a logical level, but this ignores the key efficiency and performance benefits of SIMD.
              Except this is one of the catches. SIMD takes quite a bit silicon to implement large ones. Area of silicon vs performance if you get your NoC design right starts undermining the SIMD idea.

              Originally posted by coder View Post
              They're similar, except in most of the ways that really matter.
              So true it requires thinking differently. Can the SM function be done with a network packet. The answer is yes. Network packet control means you don't need as many wires to control the individual modules(warp schedulers) that make up the virtual Streaming Multiprocessors.

              So yes inside the risc-v prototypes small cores=warp schedulers. The SM is a network on chip construct. As you said MIMD can emulate SIMD when there are not huge buffers causing buffer-bloat and you are using a highly effect NoC it can turn out to be more effective than wiring up large SIMD circuitry and more flexable.

              So a 4096/16 core risc-v due to the design of the NoC as a 4096 warp schedulers to 16 SM units. This configuration could also be changed on fly due to this being network on chip.

              The idea of dynamic creating blocks of processing units goes back to tile processors so its not exactly a new idea to use the network on chip to control groups of cores as 1 unified unit.

              Coder its really simple to ignore that the risc-v have more than 1 cpu type inside just like a GPU. Just because these prototypes care multi core and don't have the SM stacked on top of warp schedulers does not mean there was not another way to skin that cat and use less silicon to achieve same result.

              Originally posted by coder View Post
              They're similar, except in most of the ways that really matter.
              This line of yours compare Skylake-SP and XeonPhi to risc-v multi core or PEZY-SC2 why the risc-v and PEZY-SC2 don't have a buffer bloat problem. So all the problems you see with the Skylake-SP and XeonPhi with buffers is intel making the same mistake as those who were doing network traffic where bigger buffers would make things better. Buffer has a limited size it improves performance then buffer hinders performance so creating stalls.

              Yes the Nvidia interlink current design also breaks rules learnt from bufferbloat. Really how many times are people going to make networks in silicon and miss what we have already learnt is extremely bad idea and proceed to implement those extremely bad ideas in standard Ethernet networking. Networking is networking.

              Comment


              • Originally posted by oiaohm View Post
                Having to go to cpu for permission in PCIe is the fact the switch has a parent port. Its part of PCIe design and its the parent port only that can give PCIe transfer permissions. PCIe is designed to form up into a single tree with 1 parent at the top of everything. This is a PCIe specification limitation.
                Ethernet, InfiniBand, and the handful of high-speed, low-latency interconnects that have been designed for supercomputers and large shared memory systems

                Comment


                • Originally posted by oiaohm View Post
                  Due to the bandwidth the network on chip provides those buffers in each core is smaller.
                  ...not understanding the distinction between bandwidth and latency.

                  Originally posted by oiaohm View Post
                  Total bandwidth provide HBM3 and Gen-Z are equal.
                  And that is?

                  Originally posted by oiaohm View Post
                  That is not a very good writeup on cpu placement on the nvswitch
                  Precisely because the CPU is not on the NVSwitch. Again, you missed the point.

                  Originally posted by oiaohm View Post
                  Yes Bufferbloat effects NoC designs badly. So the idea of buffers inside cores getting bigger to get higher throughput is backwards. As you buffers get too big you number of stalled cpu cores increase because it slow but surely lowers bandwidth.
                  You're confusing buffers in the cores with buffers in the NoC. Buffers in the cores reduce stalling, but burn die space and power. Buffers in the NoC would increase latency, and thereby stalling of the cores.

                  And where is it written how big are the buffers in Intel's NoC? Where is the performance analysis indicating this is a problem?

                  Originally posted by oiaohm View Post
                  So true it requires thinking differently. Can the SM function be done with a network packet. The answer is yes. Network packet control means you don't need as many wires to control the individual modules(warp schedulers) that make up the virtual Streaming Multiprocessors.

                  So yes inside the risc-v prototypes small cores=warp schedulers.
                  Now you're just throwing words around.

                  Originally posted by oiaohm View Post
                  ... does not mean there was not another way to skin that cat and use less silicon to achieve same result.
                  I think my whole point is that adding so many additional instruction decoders and schedulers + NoC nodes is not using less silicon or power. If your problem is already a good fit for a GPU, then you really can't do much better.

                  Comment


                  • And so with ExpressFabric we have taken PCI-Express and extended it on top of the spec so you can do this. << Did not read this bit.
                    Yes you are able to have the host set up this tcp.

                    RONNIEE Express is only designed to go between cpus. Not designed to deal with GPU and accelerators or storage.

                    In this video from the OpenFabrics Workshop, Greg Casey from Dell EMC presents: GEN-Z: An Overview and Use Cases."This session will focus on the new Gen-Z me...

                    As what is covered in this video there is a limit to how much ram we can connect to CPU directly.

                    Gen-Z introduces storage class memory where you have off chip media controllers.

                    Originally posted by coder View Post
                    Precisely because the CPU is not on the NVSwitch. Again, you missed the point.
                    No you are missing the problem. NVSwitch can be used between powerpc cores. This covered in a different presentations covering powerpc usage.

                    Originally posted by coder View Post
                    You're confusing buffers in the cores with buffers in the NoC. Buffers in the cores reduce stalling, but burn die space and power. Buffers in the NoC would increase latency, and thereby stalling of the cores.
                    I am absolutely not. The risc-v have experimented with only L1 caches 1 L1 for instructions and 1 L1 for data no L2 no L3 and no buffers in the core.

                    Originally posted by coder View Post
                    And where is it written how big are the buffers in Intel's NoC? Where is the performance analysis indicating this is a problem?
                    E31 Core Complex is a risc-v core for NoC

                    You will see has 16kb of instruction and 64kb of data at L1 cache level. No L2 no L3.

                    Xeon Phi 32kB L1 + 512kB L2
                    Yes intel drop L3 but then proceed to double L2. Xeon of the time only has a 256kb L2.
                    The problem here you introduce latency searching if something is in the L2 that would be better off been spent crossing the network to the NoC caches that are the replacement to L3.

                    The fact Xeon Phi has a L2 shows it has too much buffering to use NoC effectively. If that is required NoC design inside the Xeon Phi is too slow.

                    NoC cpu you increase the size of you L1 you loss your l2 and what was L3 is now caches between memory controller and NoC.

                    NoC cpus as processing cores only have L1 cache systems if they are going to perform well. If the cores have a l2 they have too much buffering. Doing the request from l1 to NoC caches generate a request latency equal if you had a cache miss in l2 and had to go to L3 in a normal cpu core.

                    Having a l2 in your cpu cores in fact causes the same problems as bufferbloat in a normal network where you are spending time processing a buffer instead of sending request so creating stall when if you have sent the request straight on to network would have got answer before you in fact need it.

                    Yes the other thing is the L1 of these risc cores is large and different configured to the Xeon phi. Even mips chips like the PEZY-SC2 have got rid of l2 out of their mass processing cores and altered L1 configuration.

                    Buffers in cores to reduce direct core stalling is L1. L2 and l3 don't make sense in a massive multi core. 4 core system L1 l2 l3 cache setup makes sense.

                    Really Xeon Phi was never going to perform well. Also Xeon Phi no on dia cache between memory controllers and NoC or cache in memory controllers.

                    Xeon phi is classic example of designs of massive multi processor cores that have been give up on.

                    Early risc-v and mips prototypes of massive multi core were just putting standard cores with standard l1,l2,l3 on NoC as if they were a cluster this turned out to be huge mistake. Then just like intel they tried making remove L3 and making l2 caches bigger this does not fix problem same as bufferbloat mistake. Correct solution is go exactly the other way make total over all buffers smaller(cache and inside the noc routers) and focus on lifting NoC speed and only use L1 and caches on interfaces to NoC. Once you do that you get a massive multi core system that performs.

                    Originally posted by coder View Post
                    I think my whole point is that adding so many additional instruction decoders and schedulers + NoC nodes is not using less silicon or power. If your problem is already a good fit for a GPU, then you really can't do much better.
                    Problem is I showed you darpa numbers and you said the task is not a good fit for GPU. Reality here is if you stop thinking the wrong way and coping intel and other who have goofed up there massive multi core designs. You will see that the buffers shrink.

                    Instruction decoders for a risc-v is not hyper complex. Processing units in gpu have ques and schedulers.


                    Page 13. Risc-v cores with vector processing are fairly a drop in for the warp scheduler. I do mean drop in able to perform as much floating point per clock cycle.

                    What they call L0 cache there would be what I am referring to as L1 cache.


                    Do note the Tensor instructions on the ET-minion is the same thing Nvidia calls Tensor.

                    Basically that 4096/16 core system is like you have assembled a Nvidia gpu with the SM ripped out and replaced by NoC and the Nvidia instruction set replaced by a risc-v equal. Vector instructions suite a NoC better. Each ET-minon core is able to process as much as Nvidia warp scheduler. Each Nvidia warp scheduler has a instruction decoder so there is no difference in instruction decoder counts. Communication to outside world due to using NoC instead of SM is less complex.

                    ET-minion have a full generic instruction set so can do generic processing they also have full vector so can do wide SIMD processing. Advantage of vector if request it too large to be complete processed 1 core the part it cannot do can be farmed off to the next core.

                    Comment


                    • Originally posted by coder View Post
                      ...not understanding the distinction between bandwidth and latency.
                      I was wondering the same thing, but then I remembered when he was talking about a 4 clock cycle latency from store-forwarding and used it as a bandwidth argument.

                      Comment

                      Working...
                      X