Originally posted by rmfx
View Post
Announcement
Collapse
No announcement yet.
Apple M1 Open-Source GPU Bring-Up Sees An Early Triangle
Collapse
X
-
Originally posted by Alexmitter View PostIts a rather simple Tile based GPU, very restrictive but OK for the also very restrictive backwards thought Metal API. Most likely based on the SGI GPUs they licensed for their SoCs.
The only real benefit would be if that would jumpstart the PowerVR efforts.
- Likes 1
Comment
-
Originally posted by Boland View Post
It’s a full fat TBDR GPU with some features not found anywhere else (tile shaders, image blocks etc). There is nothing ‘simple’ about it.
Comment
-
Originally posted by Ladis View Post
You can implement from simple (Gouraud shaded triangles are already working) to complex (like going upwards in OpenGL versions in Zink). You may even never reach the last features, but since Linux (its 3D APIs) doesn't use them, you will mis nothing (when running Linux).
It would be fantastic if it was supported on Linux.
Comment
-
Originally posted by AdrianBc View PostUntil recently it was believed that achieving a much higher IPC than in current CPUs will cost too much, so the development roadmaps of both Intel and AMD had relatively modest targets of increasing the IPC by only around 20% in each generation, e.g. Skylake => Ice Lake => Alder Lake or Zen 1 => Zen 2 => Zen 3.
In x86, an instruction can be anywhere between 1 byte to 15 bytes. If you want to decode 8 instructions, you would have to load up to 120 bytes. In order to tell where the second instruction is, you must first decode the first instruction. In order to tell where the third is, you need to decode the first and second. In order to tell where the 4th is, you must decode all prior 3 instructions. This is an exponential problem. In order to decode more than 4, you will need a slower clock and a lot more hardware. I could be wrong, but I believe they decode 4. They use many decoders for every possible position for the 2nd, 3rd, and 4th instructions and then discard the ones that don't pan out.
The only realistic way to improve IPC in x86 is at the micro-operations level. So the only realistic way to improve performance is to add more and more complex operations which translate to more micro-operations. This is what they have been doing for years. The challenge is that this only helps if compilers take advantage of new instructions.Last edited by paulpach; 24 January 2021, 08:05 AM.
Comment
-
Originally posted by mdedetrich View PostWhile taking so little power that its passively cooled?
I think you are wrong on that one.
Comment
-
Originally posted by paulpach View PostIt is much harder for intel and AMD to increase IPC. You see, in ARM64, every instruction is 32 bits (4 bytes). If you want to decode 8 instructions, you simply load 32 bytes at a time and send them to 8 decoders. Special care is needed for branches, but overall if you want to decode n instructions per clock, you just fetch n * 4 bytes and send to n decoders. This is a linear problem.
In x86, an instruction can be anywhere between 1 byte to 15 bytes. If you want to decode 8 instructions, you would have to load up to 120 bytes. In order to tell where the second instruction is, you must first decode the first instruction. In order to tell where the third is, you need to decode the first and second. In order to tell where the 4th is, you must decode all prior 3 instructions. This is an exponential problem. In order to decode more than 4, you will need a slower clock and a lot more hardware.
- decoding more than 4 instructions/clock would definitely require more hardware but not necessarily a slower clock since AFAIK we tag instruction boundaries in $I which allows independent & parallel extraction & decoding... and even without that pre-tagging an extra pipeline stage would be used instead of a slower clock
- largest possible x86-64 instructions can be very large, but *average* instruction size is very small, well under 4 bytes
- most of the large x86-64 instructions are made large by use of immediate operands but A64 does not eliminate their need... A64 is limited to 8 or 16-bit immediates and up to 5 ARM64 instructions are required to accumulate a 64-bit immediate in a register and use it... so the "more instructions per clock" advantage goes away quickly
- most execution happens out of the macro-op cache - we often clock-gate the instruction decoder off to reduce power - and macro-ops are already fixed length
Originally posted by paulpach View PostI could be wrong, but I believe they decode 4. They use many decoders for every possible position for the 2nd, 3rd, and 4th instructions and then discard the ones that don't pan out.
If you think of decoding as a three-stage activity (partially decode going into $I, extract & align coming out of $I, decode pre-aligned instructions) and combine that with average x86 instruction being smaller than average ARM instruction I think that gives a better picture of the processing.
Originally posted by paulpach View PostThe only realistic way to improve IPC in x86 is at the micro-operations level. So the only realistic way to improve performance is to add more and more complex operations which translate to more micro-operations. This is what they have been doing for years. The challenge is that this only helps if compilers take advantage of new instructions.
I am a bit fuzzy on how large immediate operands are handled at a macro-op level - not sure if they are split into multiple smaller immediate operands to maintain fixed length or if they represent the only exception (albeit a clean one) to the fixed length rule.
As execution pipelines continue to get wider the decoder width will need to increase at some point to keep up, but I don't think the kind of "wall" you are describing (limiting x86-64 to decoding 4 instructions per clock) actually exists.Last edited by bridgman; 25 January 2021, 11:56 AM.Test signature
- Likes 1
Comment
-
Originally posted by paulpach View Post
It is much harder for intel and AMD to increase IPC. You see, in ARM64, every instruction is 32 bits (4 bytes). If you want to decode 8 instructions, you simply load 32 bytes at a time and send them to 8 decoders. Special care is needed for branches, but overall if you want to decode n instructions per clock, you just fetch n * 4 bytes and send to n decoders. This is a linear problem.
In x86, an instruction can be anywhere between 1 byte to 15 bytes. If you want to decode 8 instructions, you would have to load up to 120 bytes. In order to tell where the second instruction is, you must first decode the first instruction. In order to tell where the third is, you need to decode the first and second. In order to tell where the 4th is, you must decode all prior 3 instructions. This is an exponential problem. In order to decode more than 4, you will need a slower clock and a lot more hardware. I could be wrong, but I believe they decode 4. They use many decoders for every possible position for the 2nd, 3rd, and 4th instructions and then discard the ones that don't pan out.
The only realistic way to improve IPC in x86 is at the micro-operations level. So the only realistic way to improve performance is to add more and more complex operations which translate to more micro-operations. This is what they have been doing for years. The challenge is that this only helps if compilers take advantage of new instructions.
You are of course right about the difficulty of decoding simultaneously many Intel/AMD instructions. There is no doubt that they will never be able to decode in parallel as many instructions as those who implement the ARM ISA.
Nevertheless, since Intel Sandy Bridge and AMD Zen 1, Intel & AMD use the workaround of keeping the decoded instructions in a micro-operation cache, so whenever any code is executed for the second time, i.e. in all loops and procedures, which normally determine the majority of the execution time, they may execute as many instructions in parallel as any other CPUs, regardless of the ISA, e.g. up to 8 instructions per cycle in the new Zen 3.
This workaround causes an extra cost with the area & power required for the complex x86 decoders and for the micro-op cache that is larger than an instruction cache storing the same number of instructions, but nonetheless Intel & AMD should be able to reach a similar IPC like Apple, when they will increase in size all the out-of-order supporting structures from the core to sizes comparable with Apple.
Comment
-
Originally posted by AdrianBc View Post
When AMD will introduce 5 nm CPUs, those will certainly be faster than whatever CPUs Apple will have by then, but with the price of a much higher power consumption, exactly like it is today when comparing desktop Zen 3 CPUs with Apple M1.
Despite their misleading claims during the M1 launch about Apple CPUs being the fastest, Apple will never make the fastest CPUs, because they would gain nothing by doing that.
Apple could have easily made a CPU much faster than M1 and much faster that any Intel/AMD, by just designing a larger chip with more cores.
However that would have meant having much larger manufacturing costs and a requirement for larger and more expensive cooling systems, both of which could only diminish Apple's profits without bringing them any new customer.
Apple CPUs consume less power at a given performance because they achieve that performance at a much lower clock frequency (two thirds) than Intel/AMD.
Intel Alder Lake and AMD Zen 4 might achieve an increase in instructions per clock of 20% over Tiger Lake and Zen 3, but that will not be enough to match the IPC of Apple M1, much less the IPC of its successor, so they will still have a lower energy efficiency, even if the top models will be faster than Apple's.
Only around 2023 it is unpredictable which CPU will be the fastest and which will have the highest IPC, because there are no public details of the next generation projects of Intel and AMD.
Until recently it was believed that achieving a much higher IPC than in current CPUs will cost too much, so the development roadmaps of both Intel and AMD had relatively modest targets of increasing the IPC by only around 20% in each generation, e.g. Skylake => Ice Lake => Alder Lake or Zen 1 => Zen 2 => Zen 3.
Now, after Apple has demonstrated that higher increases in IPC are possible at a reasonable cost, it is likely that both Intel and AMD have modified their design goals to be more ambitious, in order to catch up with Apple, but a couple of years might pass until a result will be seen.
Apple's technical achievement is impressive, but, unfortunately, except for their captive loyal customers, this achievement is worthless.
Unlike traditional computer companies, Apple does not publish anything about their processors. In the past, one could learn a lot from the articles published by IBM, Intel, AMD and many other companies that are less important today. No company publishes today as many technical details as they were publishing 10 years ago and far less than they were publishing 20 years ago. Nevertheless, they still publish information about the results of their research, while Apple does not publish anything useful. Whatever Apple might have discovered, they keep that jealously for themselves.
Moreover, Apple, despite contrary claims, does not sell any computer. Any Apple computer is not the property of its buyer, because Apple continues to be able to make decisions remotely about how the Apple computer may be used, e.g. whether to allow or not some programs to run. While I was satisfied with an Apple laptop that I had many years ago, when they did not have yet the restrictions of today, I will not buy again an Apple computer, because I use only computers that I own, i.e. which do only exactly what I tell them to do and nothing else.
Not sure if I should be announcing this in the open, but I had to let out the frustration and disappointment at least in a pseudo-anonymous way. You jump on the bandwagon too quickly, and chances are you'll find out that there's no way to steer.
FWIW, the one thing I did like is the improved performance/integration of CUPS. Although, that's apparently to be expected, as they purchased the source code.
Last edited by azdaha; 08 February 2021, 03:55 PM.
Comment
Comment