Announcement

Collapse
No announcement yet.

Apple M1 Open-Source GPU Bring-Up Sees An Early Triangle

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by rmfx View Post
    Good luck because that will be even more undocumented than Nvidia hardware.
    Actually Apple has covered the GPU at its WWDC events. It isn’t a manual nor even detail but is better than what NVidia offers which is largely nothing. So maybe more of a tease of information but every little bit helps.

    Comment


    • #22
      Originally posted by Alexmitter View Post
      Its a rather simple Tile based GPU, very restrictive but OK for the also very restrictive backwards thought Metal API. Most likely based on the SGI GPUs they licensed for their SoCs.
      The only real benefit would be if that would jumpstart the PowerVR efforts.
      It’s a full fat TBDR GPU with some features not found anywhere else (tile shaders, image blocks etc). There is nothing ‘simple’ about it.

      Comment


      • #23
        Originally posted by Boland View Post

        It’s a full fat TBDR GPU with some features not found anywhere else (tile shaders, image blocks etc). There is nothing ‘simple’ about it.
        You can implement from simple (Gouraud shaded triangles are already working) to complex (like going upwards in OpenGL versions in Zink). You may even never reach the last features, but since Linux (its 3D APIs) doesn't use them, you will mis nothing (when running Linux).

        Comment


        • #24
          Originally posted by Ladis View Post

          You can implement from simple (Gouraud shaded triangles are already working) to complex (like going upwards in OpenGL versions in Zink). You may even never reach the last features, but since Linux (its 3D APIs) doesn't use them, you will mis nothing (when running Linux).
          Sure. But the person I was replying to was dismissing the GPU as ‘simple’ and ‘backward’ which is simply not true at all. It’s an incredibly fast/power efficient GPU design.

          It would be fantastic if it was supported on Linux.


          Comment


          • #25
            Originally posted by AdrianBc View Post
            Until recently it was believed that achieving a much higher IPC than in current CPUs will cost too much, so the development roadmaps of both Intel and AMD had relatively modest targets of increasing the IPC by only around 20% in each generation, e.g. Skylake => Ice Lake => Alder Lake or Zen 1 => Zen 2 => Zen 3.
            It is much harder for intel and AMD to increase IPC. You see, in ARM64, every instruction is 32 bits (4 bytes). If you want to decode 8 instructions, you simply load 32 bytes at a time and send them to 8 decoders. Special care is needed for branches, but overall if you want to decode n instructions per clock, you just fetch n * 4 bytes and send to n decoders. This is a linear problem.

            In x86, an instruction can be anywhere between 1 byte to 15 bytes. If you want to decode 8 instructions, you would have to load up to 120 bytes. In order to tell where the second instruction is, you must first decode the first instruction. In order to tell where the third is, you need to decode the first and second. In order to tell where the 4th is, you must decode all prior 3 instructions. This is an exponential problem. In order to decode more than 4, you will need a slower clock and a lot more hardware. I could be wrong, but I believe they decode 4. They use many decoders for every possible position for the 2nd, 3rd, and 4th instructions and then discard the ones that don't pan out.

            The only realistic way to improve IPC in x86 is at the micro-operations level. So the only realistic way to improve performance is to add more and more complex operations which translate to more micro-operations. This is what they have been doing for years. The challenge is that this only helps if compilers take advantage of new instructions.
            Last edited by paulpach; 24 January 2021, 08:05 AM.

            Comment


            • #26
              Originally posted by mdedetrich View Post
              While taking so little power that its passively cooled?

              I think you are wrong on that one.
              Disable the turbo-boost on any of the 7nm 15w SKU's from AMD, and the fans almost never spin up. Even the 10nm Tiger Lake from intel, with turbo boost off, the laptop is silent. A passive solution is certainly within the realm of possibility for the current crop of x86-64 parts.

              Comment


              • #27
                Originally posted by paulpach View Post
                It is much harder for intel and AMD to increase IPC. You see, in ARM64, every instruction is 32 bits (4 bytes). If you want to decode 8 instructions, you simply load 32 bytes at a time and send them to 8 decoders. Special care is needed for branches, but overall if you want to decode n instructions per clock, you just fetch n * 4 bytes and send to n decoders. This is a linear problem.

                In x86, an instruction can be anywhere between 1 byte to 15 bytes. If you want to decode 8 instructions, you would have to load up to 120 bytes. In order to tell where the second instruction is, you must first decode the first instruction. In order to tell where the third is, you need to decode the first and second. In order to tell where the 4th is, you must decode all prior 3 instructions. This is an exponential problem. In order to decode more than 4, you will need a slower clock and a lot more hardware.
                Not sure I agree - yes decoding is more complex for x86-64 than for A64 but a few things to consider:

                - decoding more than 4 instructions/clock would definitely require more hardware but not necessarily a slower clock since AFAIK we tag instruction boundaries in $I which allows independent & parallel extraction & decoding... and even without that pre-tagging an extra pipeline stage would be used instead of a slower clock

                - largest possible x86-64 instructions can be very large, but *average* instruction size is very small, well under 4 bytes

                - most of the large x86-64 instructions are made large by use of immediate operands but A64 does not eliminate their need... A64 is limited to 8 or 16-bit immediates and up to 5 ARM64 instructions are required to accumulate a 64-bit immediate in a register and use it... so the "more instructions per clock" advantage goes away quickly

                - most execution happens out of the macro-op cache - we often clock-gate the instruction decoder off to reduce power - and macro-ops are already fixed length

                Originally posted by paulpach View Post
                I could be wrong, but I believe they decode 4. They use many decoders for every possible position for the 2nd, 3rd, and 4th instructions and then discard the ones that don't pan out.
                My understanding is that Intel uses speculative parallel decode (as you describe) but we pre-decode and tag instruction boundaries in the instruction cache instead. After reading from the instruction cache, extraction (aka "alignment") is a separate pipeline stage from decoding but extraction of each instruction is independent of the instructions before it.

                If you think of decoding as a three-stage activity (partially decode going into $I, extract & align coming out of $I, decode pre-aligned instructions) and combine that with average x86 instruction being smaller than average ARM instruction I think that gives a better picture of the processing.

                Originally posted by paulpach View Post
                The only realistic way to improve IPC in x86 is at the micro-operations level. So the only realistic way to improve performance is to add more and more complex operations which translate to more micro-operations. This is what they have been doing for years. The challenge is that this only helps if compilers take advantage of new instructions.
                I don't think the macro-ops or ISA necessarily need to change - we just need fast/wide processing at the macro-op level which we have already - 8 fixed-length macro-ops per clock from the cache, although the next stage (dispatch) only currently issues 6 macro-ops per clock.

                I am a bit fuzzy on how large immediate operands are handled at a macro-op level - not sure if they are split into multiple smaller immediate operands to maintain fixed length or if they represent the only exception (albeit a clean one) to the fixed length rule.

                As execution pipelines continue to get wider the decoder width will need to increase at some point to keep up, but I don't think the kind of "wall" you are describing (limiting x86-64 to decoding 4 instructions per clock) actually exists.
                Last edited by bridgman; 25 January 2021, 11:56 AM.

                Comment


                • #28
                  Originally posted by paulpach View Post

                  It is much harder for intel and AMD to increase IPC. You see, in ARM64, every instruction is 32 bits (4 bytes). If you want to decode 8 instructions, you simply load 32 bytes at a time and send them to 8 decoders. Special care is needed for branches, but overall if you want to decode n instructions per clock, you just fetch n * 4 bytes and send to n decoders. This is a linear problem.

                  In x86, an instruction can be anywhere between 1 byte to 15 bytes. If you want to decode 8 instructions, you would have to load up to 120 bytes. In order to tell where the second instruction is, you must first decode the first instruction. In order to tell where the third is, you need to decode the first and second. In order to tell where the 4th is, you must decode all prior 3 instructions. This is an exponential problem. In order to decode more than 4, you will need a slower clock and a lot more hardware. I could be wrong, but I believe they decode 4. They use many decoders for every possible position for the 2nd, 3rd, and 4th instructions and then discard the ones that don't pan out.

                  The only realistic way to improve IPC in x86 is at the micro-operations level. So the only realistic way to improve performance is to add more and more complex operations which translate to more micro-operations. This is what they have been doing for years. The challenge is that this only helps if compilers take advantage of new instructions.

                  You are of course right about the difficulty of decoding simultaneously many Intel/AMD instructions. There is no doubt that they will never be able to decode in parallel as many instructions as those who implement the ARM ISA.

                  Nevertheless, since Intel Sandy Bridge and AMD Zen 1, Intel & AMD use the workaround of keeping the decoded instructions in a micro-operation cache, so whenever any code is executed for the second time, i.e. in all loops and procedures, which normally determine the majority of the execution time, they may execute as many instructions in parallel as any other CPUs, regardless of the ISA, e.g. up to 8 instructions per cycle in the new Zen 3.


                  This workaround causes an extra cost with the area & power required for the complex x86 decoders and for the micro-op cache that is larger than an instruction cache storing the same number of instructions, but nonetheless Intel & AMD should be able to reach a similar IPC like Apple, when they will increase in size all the out-of-order supporting structures from the core to sizes comparable with Apple.












                  Comment


                  • #29
                    Originally posted by AdrianBc View Post


                    When AMD will introduce 5 nm CPUs, those will certainly be faster than whatever CPUs Apple will have by then, but with the price of a much higher power consumption, exactly like it is today when comparing desktop Zen 3 CPUs with Apple M1.

                    Despite their misleading claims during the M1 launch about Apple CPUs being the fastest, Apple will never make the fastest CPUs, because they would gain nothing by doing that.

                    Apple could have easily made a CPU much faster than M1 and much faster that any Intel/AMD, by just designing a larger chip with more cores.

                    However that would have meant having much larger manufacturing costs and a requirement for larger and more expensive cooling systems, both of which could only diminish Apple's profits without bringing them any new customer.

                    Apple CPUs consume less power at a given performance because they achieve that performance at a much lower clock frequency (two thirds) than Intel/AMD.

                    Intel Alder Lake and AMD Zen 4 might achieve an increase in instructions per clock of 20% over Tiger Lake and Zen 3, but that will not be enough to match the IPC of Apple M1, much less the IPC of its successor, so they will still have a lower energy efficiency, even if the top models will be faster than Apple's.


                    Only around 2023 it is unpredictable which CPU will be the fastest and which will have the highest IPC, because there are no public details of the next generation projects of Intel and AMD.

                    Until recently it was believed that achieving a much higher IPC than in current CPUs will cost too much, so the development roadmaps of both Intel and AMD had relatively modest targets of increasing the IPC by only around 20% in each generation, e.g. Skylake => Ice Lake => Alder Lake or Zen 1 => Zen 2 => Zen 3.

                    Now, after Apple has demonstrated that higher increases in IPC are possible at a reasonable cost, it is likely that both Intel and AMD have modified their design goals to be more ambitious, in order to catch up with Apple, but a couple of years might pass until a result will be seen.


                    Apple's technical achievement is impressive, but, unfortunately, except for their captive loyal customers, this achievement is worthless.

                    Unlike traditional computer companies, Apple does not publish anything about their processors. In the past, one could learn a lot from the articles published by IBM, Intel, AMD and many other companies that are less important today. No company publishes today as many technical details as they were publishing 10 years ago and far less than they were publishing 20 years ago. Nevertheless, they still publish information about the results of their research, while Apple does not publish anything useful. Whatever Apple might have discovered, they keep that jealously for themselves.

                    Moreover, Apple, despite contrary claims, does not sell any computer. Any Apple computer is not the property of its buyer, because Apple continues to be able to make decisions remotely about how the Apple computer may be used, e.g. whether to allow or not some programs to run. While I was satisfied with an Apple laptop that I had many years ago, when they did not have yet the restrictions of today, I will not buy again an Apple computer, because I use only computers that I own, i.e. which do only exactly what I tell them to do and nothing else.
                    Yeah...I wish I had done more research before taking the leap. First Apple computer. Great to look at; power consumption; performance. The only thing that I can't figure out now is what exactly to do with it. As it is right now, I basically paid $1900 for a chromebook; except with the chromebook, I have more free software options and potential use-cases at a tiny fraction of the cost.

                    Not sure if I should be announcing this in the open, but I had to let out the frustration and disappointment at least in a pseudo-anonymous way. You jump on the bandwagon too quickly, and chances are you'll find out that there's no way to steer.


                    FWIW, the one thing I did like is the improved performance/integration of CUPS. Although, that's apparently to be expected, as they purchased the source code.
                    Last edited by azdaha; 08 February 2021, 03:55 PM.

                    Comment

                    Working...
                    X