Announcement

Collapse
No announcement yet.

Libre RISC-V Snags $50k EUR Grant To Work On Its RISC-V 3D GPU Chip

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Originally posted by log0 View Post
    You'll never get the GFLOP/Watt of a modern GPU just by slapping a bunch of Risc-V cores with some simd/vector capability together.

    Just look at AMDs GCN. A compute unit has 64KB local data share, 4KB L1 cache, 16 load/store units, 4 texture filtering units, 4 SIMD units. Each SIMD unit has 16 lanes backed by a huge 64KB register file.

    That's a whole 'nother world compared to this RISC-V "GPU" project.
    Spread over various videos, writings, and mailing list discussions, a picture is beginning to emerge of a suitable microarchitecture.


    Except this Libre RISC-V GPU project is not a normal risc-v core. You don't find 128 FP and 128 integer registers in a normal Risc-V gpu. The normal is 16 or 32 registers in a risc-v core.

    You are right normal Risc-V cores would never give the Gflop/watt of a modern GPU. The current targeted GFLOP value at 28 nm when you wake up its 64 bit GLFOP and it size scaled down it can for sure compete against the AMD GCN. AMD GCN uses a lot larger silicon area. SIMD units in the GCN are larger than the targeted Libre RISC-V GPU cores but not in GFLOP of processing. So per GCN silicon area you can have at least 4 of the libre Risc-V GPU.

    So I would not be so sure its a whole another world. You need to go back to basics and compare performance to silicon area with allowances for fabric costs. They are not as far apart as it first appears.

    Comment


    • #62
      Originally posted by log0 View Post
      You'll never get the GFLOP/Watt of a modern GPU just by slapping a bunch of Risc-V cores with some simd/vector capability together.
      absolutely. Jeff Bush's Nyuzi paper is the canonical reference, here, which was a research project to find out precisely *why* Larrabee failed (Intel's team were prevented and prohibited by their Marketing Dept from speaking up, hence why you saw Larrabee used as a high-performance Computing Cluster ASIC, *NOT* in GPUs).

      after speaking with Mitch Alsup (who designed the Samsung GPU Texture opcodes), we will almost certainly be doing texturisation instructions. follow the trail here: http://bugs.libre-riscv.org/show_bug.cgi?id=91 - we learned from him that texturisation is done through *massive* regularly-sized Vulkan texture maps, and that the Floating-Point pixel coordinates are used as a lookup system. if the FP number is not an integer, you need to also look up the *neighbouring* textures and perform interpolation. if done in software alone it's really quite horrendous, and interacts with the LOAD/STORE system in a way that means it's best done not as "standard" LD/ST but as its own "thing", bypassing the standard LD/ST checks needed for general-purpose computing.

      these kinds of decisions are just not needed - at all - in a standard "Parallel Compute Cluster" ASIC.


      Comment


      • #63
        Originally posted by oiaohm View Post
        So I would not be so sure its a whole another world. You need to go back to basics and compare performance to silicon area with allowances for fabric costs. They are not as far apart as it first appears.
        this concurs with my estimates. if we scaled up to say 256 cores, we'd easily be around the 150W mark, and also be at 64x the performance. so if we managed to hit 12 GFLOPs in the current design within the 2.5W budget (@28nm), that ramps up to 768 GFLOPs @ around the 150W mark (@28nm), which is not shabby at all. i've got openpiton as an open bugreport to investigate its use as the NoC http://bugs.libre-riscv.org/show_bug.cgi?id=69 which would give us potential scalability up to 500,000 cores (on and off chip). of course, now we'd also need the memory controller(s) to be able to cope with that... we're into multi-tens-of-millions-of-dollars territory, and i'd rather get the basics up and running first, on this (much smaller) budget. walk before run.

        Comment


        • #64
          Originally posted by lkcl View Post
          after speaking with Mitch Alsup (who designed the Samsung GPU Texture opcodes), we will almost certainly be doing texturisation instructions. follow the trail here: http://bugs.libre-riscv.org/show_bug.cgi?id=91
          ha! cool! jacob's talking in that bugreport about the idea of *auto-generating* the actual texturisation HDL (nmigen) directly from the Vulkan Texturisation API formats, at the same time as developing the SPIR-V to LLVM IR conversion, that will have the very texturisation opcodes in it that are also auto-generated. cool!

          the only fly in the ointment being, it's a frickin lot of work.

          Comment


          • #65
            1.

            RISCĀ·V based linear algebra accelerator for SoC designs
            Claiming faster than GPU

            Tell me, what's this technology all about? What exactly its useful for & what's its purpose?
            [obviously beyond of what#s already said in the presentation]
            What happened to the into the formal specs proposal of this - approved?
            Why do you not use this, or do you?
            What's your stand on this?


            2.

            ha! cool! jacob's talking in that bugreport about the idea of *auto-generating* the actual texturisation HDL (nmigen) directly from the Vulkan Texturisation API formats, at the same time as developing the SPIR-V to LLVM IR conversion, that will have the very texturisation opcodes in it that are also auto-generated. cool!

            the only fly in the ointment being, it's a frickin lot of work.
            Talking about second stage card - so to say amplified, after the 50000 card is delivered.
            super-scaling those processors directly on FPGA alone
            these both to be addressed with the same questions
            - what are the expectations in terms of power, performance, etc. In detail?
            - how much of the biggest Xilinx Virtex Defense Family FPGA would that utilize?
            - & also how many would are required to be run in a cluster for a really high-performance tasks [cad, gaming],
            if capable at all - if not what will it be capable of & what needs to be done to be


            Any other comments on the matter
            Last edited by tildearrow; 29 July 2019, 07:46 PM.

            Comment

            Working...
            X