Announcement

Collapse
No announcement yet.

Libre RISC-V Open-Source Effort Now Looking At POWER Instead Of RISC-V

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Originally posted by Qaridarium View Post
    this is just wrong it means: "many specific cores" what means ASIC Transistor logic for every Vulkan command

    many small cores if it is not specific ASIC transistor logic will not do the job.
    Wrong, no GPU dedicates transistor logic to each Vulkan command.
    Vulkan is not as simple as a media codec (x264 for example) where you can make an ASIC for that specific algorithm.

    GPUs are using cores that are more general-purpose than that, while not as general-purpose as CPU cores.

    Comment


    • #62
      Originally posted by starshipeleven View Post
      Wrong, no GPU dedicates transistor logic to each Vulkan command.
      Vulkan is not as simple as a media codec (x264 for example) where you can make an ASIC for that specific algorithm.

      GPUs are using cores that are more general-purpose than that, while not as general-purpose as CPU cores.
      your "Many tiny cores" is just wrong. if it is "Many spezific cores" then it is right.

      "GPUs are using cores that are more general-purpose than that, while not as general-purpose as CPU cores."

      this is right but this does not make your "many tiny cores" right at all.

      and the "many core" was just wrong the last 4+ years...

      because a AMD R9 Fury X has 4096 cores 2015-06-25, 12:22

      A AMD Vega64 has 4096 cores

      A AMD Radeon Pro WX 9100 has 4096 cores

      A Vega 20 7nm: AMD Radeon Instinct MI60 has 4096 cores 2019-02-05(Vega 20 XT)

      THE CORE COUNT DID NOT GO UP IN THE LAST 4+ YEARS

      if your "Many tiny cores" where right then the core count would go up the last 4 years.

      but this is just wrong.

      what they did optimize is the core clock:

      Chip Fiji PRO "GCN 3" Fertigung 28nm Chiptakt 1000MHz

      VS

      Chip Navi 10 XT "AMD RDNA 1.0" Fertigung 7nm (TSMC) Chiptakt 1770MHz, Boost: 2010MHz

      as you can see they go from 1000mhz to 2000mhz in 4 years

      THEY DID NOT INCREASE THE CORE COUNT.

      but you still talk bullshit: "Many tiny cores"....

      the Navi 10 XT only has 2560 cores

      this means the core count even goes down and NOT Up.
      Phantom circuit Sequence Reducer Dyslexia

      Comment


      • #63
        Originally posted by Qaridarium View Post
        if your "Many tiny cores" where right then the core count would go up the last 4 years.
        Nonsense

        More cores/processing units means more transistors. With increasing number of cores you also need more logic to feed them efficiently. All this means bigger chips, lower yields, higher costs.

        In the end it is a design parameter. If you can push frequency you'll rather go with less cores, as it is simply cheaper.

        Btw GPU core is typically simply a FP/INT unit, very much like INT and FP units in a CPU. So calling it a "tiny core" is not wrong.

        Comment


        • #64
          Originally posted by nokipaike View Post
          do you want to create an opensource gpu?
          write an architecture that is already multicore - parallel computing.
          Create a "micro-core that has a minimum of cycles required" for each existing Opengl / Vukan API loop call. Get help from a machine learning architecture to optimize performance, consumption, memory allocation and semaphore.
          voila here is your open-gpu.
          this was the hope that inspired Larrabee. they created an absolutely fantastic "Parallel Compute Engine". unnnfortunately, the GPU-level performance was so bad that the team was *not allowed* to publish the numbers

          Jeff Bush from Nyuzi had to research it, and i talked with him over the course of several months: we established that a software-only GPU - with no custom accelerated opcodes - would have only TWENTY FIVE percent the performance of, say, MALI 400, for the same silicon die area. that means that if you had a comparable performing software-only GPU, it would require FOUR times the power (and die area).

          obviously, that's not going to fly

          in speaking with Mitch Alsup on comp.arch i found out a little bit more about why this is. it turns out one of the reasons is that if you want a "fully accurate" IEEE754 FP unit, to get that extra 0.5 ULP (units in last place), you need THREE TIMES the silicon area.

          in a GPU you just don't care that much about accuracy, and that's why in the Vulkan Spec you are allowed a lot less accurate answers in SQRT, RSQRT, SIN, COS, LOG etc.

          basically there are areas where you are trading speed for accuracy, and these tend to conflict badly with "accuracy" requirements of traditional "Compute" Engines. we are kinda... lunies for even trying however if you look at the MIPS 3D ASE (you can still find it online), running instructions twice to get better accuracy is a known technique, and if we plan the ALUs in advance, we can "reprocess" intermediary results using microcoding, and serve *both* markets - GPU (less accurate, less time, less power), and IEEE754 (fully accurate, longer, more power).

          Comment


          • #65
            Originally posted by starshipeleven View Post
            Wrong, no GPU dedicates transistor logic to each Vulkan command.
            Vulkan is not as simple as a media codec (x264 for example) where you can make an ASIC for that specific algorithm.

            GPUs are using cores that are more general-purpose than that, while not as general-purpose as CPU cores.
            we're finding this out. Jacob has been studying the Vulkan spec for some time: Mitch Alsup has been helping on comp.arch to keep us on the "straight and narrow" as far as gate-count is concerned, and yes, you have sin, cos and atan2, but you do *not* bother to put down arctan or arccos etc. because those can be computed to reasonable accuracy in software, just like on any general-purpose processor, and they're so infrequently used that on the face of it it's not worthwhile adding them

            however, one of the things that we want to provide is the "unusual" stuff - the "long tail" of 3D, so that people can innovate and don't get caught out by the "mass market" GPU focus. and for that, we simply can't predict what people *might* use the Libre GPU for. therefore, we may just have to put the hardware opcodes in anyway. buuut, doing so is... expensive (if they are dedicated units) so, one thing we might do is just put in a CORDIC engine, and use microcode for anything that's not commonly used. CORDIC is so versatile it can do almost anything, it's really amazing.

            that way, all the "unpopular" opcodes, well, at least there *is* a small performance gain to be had, and we can see what happens in the market as customers pick up on it.

            Comment


            • #66
              Originally posted by nokipaike View Post
              if you think about it, it would consume very little. it would be easy to design and build. the power would be all in parallelism.
              yyeh, but sadly, its general-purpose performance would suck. we want to combine the two tasks, so that you don't *have* two L1 caches, two sets of RAM, two sets of everything-but-slightly-different.

              so we are doing a compromise: turns out that if you have every other pipeline latch being dynamically "transparent", you can turn a 5-stage pipeline into a 10-stage one at the flick of a switch.

              running on low power, at low speed, you open up the gates and two pipeline combinatorial blocks are now connected back-to-back. want to run at desktop-level speeds, close up the gates and you have a 10-stage pipe that can run at 1.6ghz.

              Comment


              • #67
                Originally posted by log0 View Post
                Nonsense
                More cores/processing units means more transistors. With increasing number of cores you also need more logic to feed them efficiently. All this means bigger chips, lower yields, higher costs.
                In the end it is a design parameter. If you can push frequency you'll rather go with less cores, as it is simply cheaper.
                Btw GPU core is typically simply a FP/INT unit, very much like INT and FP units in a CPU. So calling it a "tiny core" is not wrong.
                well well well but you have to admit that in the end the core count goes down and not up.
                this means the "many tiny core" theory does not fit to the reality because if it fit the core count would always go up.
                the reasons why this is so has multible variables like you say: "need more logic to feed them efficiently"
                or "If you can push frequency you'll rather go with less cores, as it is simply cheaper."
                this all means "Many tiny core" theory does not fit to the reality.

                "Btw GPU core is typically simply a FP/INT unit, very much like INT and FP units in a CPU. So calling it a "tiny core" is not wrong."

                yes... but why not call it specific core ? if "tiny" is the key you could use old MOS Technology 6502 8bit core from the commodore 64 but as you can read from lkcl if you only use tiny core instead of specific core it will result in this: "a software-only GPU - with no custom accelerated opcodes - would have only TWENTY FIVE percent the performance of, say, MALI 400, for the same silicon die area."

                in the end the "Many tiny core" people are defeated and the "Tiny core" people are also defeated.

                so we all can agree that we need many specific cores with "custom accelerated opcodes"
                Phantom circuit Sequence Reducer Dyslexia

                Comment


                • #68
                  Originally posted by lkcl View Post

                  this was the hope that inspired Larrabee. they created an absolutely fantastic "Parallel Compute Engine". unnnfortunately, the GPU-level performance was so bad that the team was *not allowed* to publish the numbers

                  Jeff Bush from Nyuzi had to research it, and i talked with him over the course of several months: we established that a software-only GPU - with no custom accelerated opcodes - would have only TWENTY FIVE percent the performance of, say, MALI 400, for the same silicon die area. that means that if you had a comparable performing software-only GPU, it would require FOUR times the power (and die area).

                  obviously, that's not going to fly

                  in speaking with Mitch Alsup on comp.arch i found out a little bit more about why this is. it turns out one of the reasons is that if you want a "fully accurate" IEEE754 FP unit, to get that extra 0.5 ULP (units in last place), you need THREE TIMES the silicon area.

                  in a GPU you just don't care that much about accuracy, and that's why in the Vulkan Spec you are allowed a lot less accurate answers in SQRT, RSQRT, SIN, COS, LOG etc.

                  basically there are areas where you are trading speed for accuracy, and these tend to conflict badly with "accuracy" requirements of traditional "Compute" Engines. we are kinda... lunies for even trying however if you look at the MIPS 3D ASE (you can still find it online), running instructions twice to get better accuracy is a known technique, and if we plan the ALUs in advance, we can "reprocess" intermediary results using microcoding, and serve *both* markets - GPU (less accurate, less time, less power), and IEEE754 (fully accurate, longer, more power).

                  You all tend to exclude neural network techniques in the management of the cores and threads for energy and performance as a winning part of the puzzle.

                  Useless I try to explain how the wheel is made, there are those who have already done it and understand much better than me ..
                  This is an interesting video that tries to make explicit where the big players are moving ...

                  The FUTURE of Computing Performance
                  https://youtu.be/3PjNgRWmv90
                  Last edited by nokipaike; 10-24-2019, 07:40 PM.

                  Comment


                  • #69
                    Originally posted by nokipaike View Post
                    You all tend to exclude neural network techniques in the management of the cores and threads for energy and performance as a winning part of the puzzle.
                    Useless I try to explain how the wheel is made, there are those who have already done it and understand much better than me ..
                    This is an interesting video that tries to make explicit where the big players are moving ...
                    The FUTURE of Computing Performance
                    https://youtu.be/3PjNgRWmv90
                    no one exclude anything... but for a first generation of open-source GPU there is no such techniques like neural network necessary.
                    Phantom circuit Sequence Reducer Dyslexia

                    Comment

                    Working...
                    X