Originally posted by tildearrow
View Post
And it's not black magic. Almost every aspect of what Apple does is known and understood (at least by some of us). The issue is that Apple was willing to implement this stuff.
Some of the basic techniques include
- splitting what used to be a single object into multiple, each optimally sized, objects. Apple doesn't have a single "ROB", it has one object that's somewhat like a ROB, but which points into multiple other objects. One of these is the History File (used to rewind register mappings in the event of misspeculation recovery), another tracks branches, another tracks taken branches (both of these not yet fully understood)
- doing as much work as possible as EARLY as possible in the pipeline. Apple "execute" simple branches (unconditional, no link register involvement) at DECODE time. They execute register moves and initialization at RENAME time. They even execute some loads at RENAME time. Intel started going down this path with the Stack Engine, which is vaguely the same idea -- but then they apparently lost interest and never took the idea further. Meanwhile Apple have changed the tech details of how they implement these zero-cycle moves at least three times. (The original 2012 version was kinda lame, but did the job. The 2014 version is what I would have imagined as the best way to do it. The 2019 version -- damn that's clever!)
- doing as much as possible as LATE as possible. Apple has multiple branch predictors (sure, everyone does) but they are willing to go much further down the In-Order pipeline to flush misfetched instructions. Of course if you can catch every misfetch while it's still IN-ORDER (and InOrder extends all the way to Rename...) then you don't have to pay the cost of a flush, just a fetch resteer! Being willing to do this means your most sophisticated predictors can be generating a prediction as late as five cycles later than the fastest predictors and still have value.
- split a single task into multiple subtasks, and do as many of them as possible as early as possible.
Maybe ISA semantics means I can't issue this load until some early serializing instruction has cleared. OK -- but I can still issue a prefetch to that instruction so that when I DO execute the load it's lower latency...
Similar idea with the handling of special registers.
Yes, Apple has large caches, wide pipelines, etc. These are all options open to Intel. But they're options that have not been exercised.
Why hasn't Intel gone down this path? Most of this technology was known fifteen years ago. Hell much of it is based on research that INTEL sponsored...
Well, I have my theories.
But I think the main reason is -- look at the fury and denial whenever Apple's performance is brought up. If most of your customer base cares more about the brand name on the box than about actual performance, if they will go out of their way to avoid learning about alternatives (eg by shouting down any voices that try to explain what those alternatives are and how they work), then why bother working hard to make things better? The customer base has already told you they will buy whatever you ship -- and won't buy any alternative. Customers get the Intel they deserve. And if customers prioritize the Intel brand (or the x86 brand), and naive GHz, over actual performance over EVERYTHING else, they will get companies that take advantage of that fact.
Comment