Announcement

Collapse
No announcement yet.

Amazon Graviton3 vs. Intel Xeon vs. AMD EPYC Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • mdedetrich
    replied
    NobodyXu

    In your quote from wikipedia you missed the second part of that sentence (bold for my emphasis)

    However, in most current cases, SMT is about hiding memory latency, increasing efficiency, and increasing throughput of computations per amount of hardware used.
    This is referring exactly to my point which is that making sure a CPU is constantly executing instructions rather than sitting and waiting.

    And yes hiding memory latency is another reason why SMT is used but I also mentioned before why that is less of an issue with Apple's M1 and its also something that Apple is likely not going to change.

    Leave a comment:


  • coder
    replied
    Originally posted by NobodyXu View Post
    Context: The "most current cases" here refers to intel and amd's x86 cpus, and "the other cases" refers to some researches.
    As pointed out earlier in the thread, there are other production implementations of SMT (several of them are even RISC!), some current:
    • IBM POWER 5+ (2-way)
    • IBM POWER 8+ (8-way)
    • Sun UltraSPARC T-series (4-way)
    • DEC Alpha EV8 (4-way)
    • Intel Itanium 9300+ (2-way)
    • MIPS I6400 & P6600 (2-way)
    • PEZY-SC2 (8-way) - there's another HPC example for you, mdedetrich - and this core was designed for HPC from the ground-up!
    • AFAIK, all current GPUs

    That covers basically all major ISA families for the past 2 decades, aside from RISC-V (which, probably not coincidentally, has also focused on low-power). Even ARM is included, if you include ARM-compatible server cores not designed by them.

    Originally posted by NobodyXu View Post
    SMT is used to hide memory latency.
    That's only one of the benefits it can provide.

    Leave a comment:


  • coder
    replied
    Originally posted by mdedetrich View Post
    Except its not speculation if you look at the facts on the ground
    Each design is the summation of a multitude of decisions, many interacting with the others. When you see the end product, you can't simply point to one design decision, in isolation, as if it proves a broader point. More data is needed to draw such conclusion. All we can say is that the M1 apparently isn't suffering for lack of SMT. Those are the "facts on the ground". Any broader points are indeed speculation.

    Another "fact on the ground" is that the M1 Max has hit the scaling limit, in the Ultra. There aren't more interconnects for it to scale beyond 2 dies, and there's no support for external DRAM, which is needed for it to be a proper replacement for the Mac Pro. So, we don't know what happens when you try to scale up further. Maybe the Firestorm cores will be at a serious deficit, when latencies further increase due to more cores contending for access to external DRAM.

    Originally posted by mdedetrich View Post
    and also understand what specific problem SMT is solving. SMT is solving the problem by not being able to (almost) fully utilize a single core by creating the concept of multiple (currently and typically 2) virtual cores that are multiplexed onto a single real core.
    That's reducing the broader set of reasons why you have under-utilization:
    • Code with poor ILP
    • Code with high branch-density
    • Code with erratic branch behavior, leading to many mis-predictions
    • Memory latency
    • Front end bottlenecks

    In-package memory and the AArch64 ISA only address the last two points. More sophisticated branch prediction and a larger reorder buffer can chip away at the first two, but there are ultimately limits to what you can achieve.

    At some point, you hit diminishing returns by merely scanning & tracking code for dependencies. The nice thing about SMT is that it scales well. This is exactly why & how GPUs use it. In order to scale up, you need to keep the compute units simple and small, and that wouldn't happen if you integrated big & complex out-of-order machinery.

    As an example, we can look to how poorly Xeon Phi fared against GPUs. Its Silvermont Atom cores were updated to use SMT-4 and AVX-512. I can't say the SMT was insufficient, but it was still far less than what its competitors used. Maybe if there cores weren't out-of-order, they could've put the same area into making them SMT-8 and gotten more mileage out of them. Or just added even more of them.

    Originally posted by mdedetrich View Post
    If you look at the single core benchmarks for the current Apple M1 cores (and since some time has passed they are quite comprehensive) the performance of the Apple M1's completely blow any competition out of the water,
    But it's also competing on an uneven playing field, if you're interested in a true comparison of the Firestorm's micro architecture. For that, you'd want both CPUs to have similar memory configurations and fabrication on more similar process nodes.

    Also, you don't know how efficient its pipeline utilization is. Maybe there's untapped potential, in some of those benchmarks, that SMT could unlock. Without this knowledge, you can't say SMT wouldn't be a further asset to the core. If you're trying to reach some broader conclusions about SMT, you can't do it without that data.

    Final point, here is that your data is obsolete. Alder Lake performs well against the M1 Max, even while being on an inferior process node and having external DRAM. So does Zen3, for that matter, but less so & not in floating point.



    Originally posted by mdedetrich View Post
    The only real thing specific to Apple (compared to x86/64) is
    • SoC layout (and hence the improved memory)
    • AArch64 ISA (which provides a lot of pipelining improvements that improves the single core IPC).
    That's not a remotely complete list of everything the M1's Firestorm cores do, nor does in include the relative sizes of structures like the reorder buffer.

    Originally posted by mdedetrich View Post
    You mentioned before that mobile devices typically have lower performance because they are trying to be power efficient but such devices do not have such high IPC
    Those tend to have worse IPC than desktop/server CPUs because they're also generally more cost-constrained than desktop cores. Apple is somewhat of an exception, because it can charge more for its phones than just about anyone else. Sure, there are some expensive phones made by others, but they don't sell in the volume that comparable iPhones do.

    Originally posted by mdedetrich View Post
    the Apple firestorm cores is an exception here since they have ridiculously high single core IPC. This is why classing them as a "mobile SKU" is misleading
    It's not misleading because, like all other mobile-first CPU cores, they need to prioritize perf/W above all else. That's a simple fact. Apple can't afford to do anything with them that benefits performance at the expense of perf/W.

    Originally posted by mdedetrich View Post
    If they want more concurrency (which is what SMT actually provides) they can just add more cores since their single core IPC is already so high.
    Performance doesn't scale linearly with core count. The more cores you have, the higher your latencies become, and the ability of the reorder buffer to hide them will be exceeded. And unlike ARM's Neoverse cores, Apple's cores are pretty big. So, it can't just compensate by adding more cores than anyone else, such as we saw in Ampere's Altra CPUs.

    Originally posted by mdedetrich View Post
    This is the last time I am going to respond to this thread because we are going around in circles,
    I agree that it's pointless to continue, if you're resistant to taking onboard new information.

    Originally posted by mdedetrich View Post
    if you want you can bookmark this post because I still stand by my statement which is as long as Apple doesn't completely change their architecture around or use a different ISA I can fairly safely say for the next 5-10 years (at least) they are not going to use SMT
    It just makes me feel sad to see someone be so self-assured on the basis of so little information.

    If you'd at least run some performance analysis of the actual vs. theoretical throughput Apple's cores, then you could actually say something about whether they indeed have untapped potential. Even then, it would still be a statement about that specific implementation, but it'd be better-informed that what you've so far used as evidence. As it stands, all you have is a system-level performance comparison which includes many other variables. It cannot be taken as an absolute statement about SMT, or even SMT as applied to AArch64 CPUs.

    Originally posted by mdedetrich View Post
    if you look at the HPC/supercomputer/server space that uses ARM, none of their systems as far as I am aware have SMT either
    Are there any examples besides the A64FX? Being a green-oriented, government project they appear to have focused on optimizing perf/W and just shoveled boatloads of money into scaling it up large enough to reach the top of the list.

    Leave a comment:


  • NobodyXu
    replied
    Originally posted by mdedetrich View Post

    Except its not speculation if you look at the facts on the ground and also understand what specific problem SMT is solving. SMT is solving the problem by not being able to (almost) fully utilize a single core by creating the concept of multiple (currently and typically 2) virtual cores that are multiplexed onto a single real core.

    If you look at the single core benchmarks for the current Apple M1 cores (and since some time has passed they are quite comprehensive) the performance of the Apple M1's completely blow any competition out of the water, they were even competing against desktop SKU's (which is kinda of ridiculous). So what are the possible reasons why the single core performance is so good
    • They are clocked a lot higher. As you pointed out this isn't the case, they are actually clocked lower compared to other CPU's (boost up to 3.2ghz)
    • IPC for a single core is much higher than the competition which is the case.
    Since its clear that their IPC is so high for a single clock (i.e. no hardware concurrency/parallelism) Apple has reached this single core IPC with a combination of the various techniques that have been mentioned before (branch prediction, OOO execution, caching) however none of these techniques are specific to M1's. The only real thing specific to Apple (compared to x86/64) is
    • SoC layout (and hence the improved memory)
    • AArch64 ISA (which provides a lot of pipelining improvements that improves the single core IPC).
    With such high single core IPC SMT does provide no benefit as a tool, saying different techniques have "pros and cons" is just handwaving and dismissing critical details. You mentioned before that mobile devices typically have lower performance because they are trying to be power efficient but such devices do not have such high IPC (thats one of the reasons why they use less power) however the Apple firestorm cores is an exception here since they have ridiculously high single core IPC. This is why classing them as a "mobile SKU" is misleading because they are not like any other mobile SKU in the conventional sense. If you are were talking about some random Qualcomm Android SoC SKU you may have a point.

    So with all of this, assuming that Apple doesn't lower their single IPC core in future architecture which is generally the complete opposite of what CPU designers want to do (unless they somehow manage to increase their clock speeds to ridiculous levels but at least with current material science design of CPU's, good luck conventionally cooling a 6ghz+ CPU). SMT does, by design, give them no almost no benefit. If they want more concurrency (which is what SMT actually provides) they can just add more cores since their single core IPC is already so high. If Apple's M1 cores didn't have such high single core IPC I wouldn't be saying this, but thats not the case.

    This is the last time I am going to respond to this thread because we are going around in circles, but if you want you can bookmark this post because I still stand by my statement which is as long as Apple doesn't completely change their architecture around or use a different ISA I can fairly safely say for the next 5-10 years (at least) they are not going to use SMT even for their high end desktop Mac Pro when it gets released with their CPU's rather than Intel's. And to rub it in even more, if you look at the HPC/supercomputer/server space that uses ARM, none of their systems as far as I am aware have SMT either and in that space performance is the top priority and since all of the programs running in these spaces are massively concurrent (either they run single programs that are multithreaded or they run many programs at once or both) they would definitely use SMT if it was beneficial.

    The only ARM SKU at least that I am aware of that has SMT is Cortex-A65AE, but I am not aware of anyone actually building and then using these SKU's (can someone fill me in here?)
    According to wikipedia https://en.m.wikipedia.org/wiki/Simu...multithreading :

    Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPU
    and

    However, in most current cases, SMT is about hiding memory latency, increasing efficiency, and increasing throughput of computations per amount of hardware used.
    Context: The "most current cases" here refers to intel and amd's x86 cpus, and "the other cases" refers to some researches.

    I fully agree what wikipedia says.

    SMT is used to hide memory latency.

    If instruction from one of the process stalls, the cpu core could just read and execute one from another process.
    This has nothing to do some IPC, it's just there to hide memory latency.

    I suspect that the reason Apple doesn't use SMT on M1 is because put the memory straight onto the SoC using system-in-a-package design.
    By putting cpu/gpu and ram next to each other, it can significantly reduce latency on the interconnections between the ram module and the cpu package.

    Plus they also use LPDDR4X which performs better than DDR4 while draws less power.

    Also this:


    unusually large 192 KB of L1 instruction cache and 128 KB of L1 data cache
    With reduced memory latency, an unusually large L1 instruction cache combined with effective OOO engine, Apple might archive a pretty good way of hiding memory latency and thus does not need SMT in M1.
    Last edited by NobodyXu; 03 June 2022, 10:28 AM.

    Leave a comment:


  • mdedetrich
    replied
    Originally posted by coder View Post
    This is a basic flaw in your argumentation. SMT is a design feature that has benefits and drawbacks. Apple had certain goals for their Firestorm cores, first among which seemed to be maximizing perf/W, because its primary applications were phones, tablets, and laptops. While the M1 Ultra isn't a laptop CPU, it's also not a high-volume part compared with the rest of the lot. Furthermore, we don't know if it was a factor in the design of the Firestorm, or if Apple's decision to make the Ultra came only after they had enough experience to believe building such a product made sense.

    So, if SMT had little benefit in low core-count applications and didn't pull its weight on the perf/W front, that would be reasonable grounds for them not to use it.

    Unfortunately, we can only speculate. We don't know the real reason(s) they haven't used it. Furthermore, you can't divorce the decision from the context. And the context here is a phone-oriented core - yes, one which they had bigger ambitions for, but not at the expense of its initial/primary application.
    Except its not speculation if you look at the facts on the ground and also understand what specific problem SMT is solving. SMT is solving the problem by not being able to (almost) fully utilize a single core by creating the concept of multiple (currently and typically 2) virtual cores that are multiplexed onto a single real core.

    If you look at the single core benchmarks for the current Apple M1 cores (and since some time has passed they are quite comprehensive) the performance of the Apple M1's completely blow any competition out of the water, they were even competing against desktop SKU's (which is kinda of ridiculous). So what are the possible reasons why the single core performance is so good
    • They are clocked a lot higher. As you pointed out this isn't the case, they are actually clocked lower compared to other CPU's (boost up to 3.2ghz)
    • IPC for a single core is much higher than the competition which is the case.
    Since its clear that their IPC is so high for a single clock (i.e. no hardware concurrency/parallelism) Apple has reached this single core IPC with a combination of the various techniques that have been mentioned before (branch prediction, OOO execution, caching) however none of these techniques are specific to M1's. The only real thing specific to Apple (compared to x86/64) is
    • SoC layout (and hence the improved memory)
    • AArch64 ISA (which provides a lot of pipelining improvements that improves the single core IPC).
    With such high single core IPC SMT does provide no benefit as a tool, saying different techniques have "pros and cons" is just handwaving and dismissing critical details. You mentioned before that mobile devices typically have lower performance because they are trying to be power efficient but such devices do not have such high IPC (thats one of the reasons why they use less power) however the Apple firestorm cores is an exception here since they have ridiculously high single core IPC. This is why classing them as a "mobile SKU" is misleading because they are not like any other mobile SKU in the conventional sense. If you are were talking about some random Qualcomm Android SoC SKU you may have a point.

    So with all of this, assuming that Apple doesn't lower their single IPC core in future architecture which is generally the complete opposite of what CPU designers want to do (unless they somehow manage to increase their clock speeds to ridiculous levels but at least with current material science design of CPU's, good luck conventionally cooling a 6ghz+ CPU). SMT does, by design, give them no almost no benefit. If they want more concurrency (which is what SMT actually provides) they can just add more cores since their single core IPC is already so high. If Apple's M1 cores didn't have such high single core IPC I wouldn't be saying this, but thats not the case.

    This is the last time I am going to respond to this thread because we are going around in circles, but if you want you can bookmark this post because I still stand by my statement which is as long as Apple doesn't completely change their architecture around or use a different ISA I can fairly safely say for the next 5-10 years (at least) they are not going to use SMT even for their high end desktop Mac Pro when it gets released with their CPU's rather than Intel's. And to rub it in even more, if you look at the HPC/supercomputer/server space that uses ARM, none of their systems as far as I am aware have SMT either and in that space performance is the top priority and since all of the programs running in these spaces are massively concurrent (either they run single programs that are multithreaded or they run many programs at once or both) they would definitely use SMT if it was beneficial.

    The only ARM SKU at least that I am aware of that has SMT is Cortex-A65AE, but I am not aware of anyone actually building and then using these SKU's (can someone fill me in here?)
    Last edited by mdedetrich; 03 June 2022, 05:00 AM.

    Leave a comment:


  • coder
    replied
    Originally posted by bridgman View Post
    I don't think Silvermont decision re: SMT is applicable to this discussion
    It is, to the extent that it's a mobile x86 core. The theme being that mobile cores are the ones lacking SMT.

    Moreover, when Intel reused Silvermont for the second-gen Xeon Phi (KNL), they re-added SMT-4! That move is in line with the notion that SMT pulls more weight in higher-core count applications, which isn't what Apple's cores are optimized for.

    Originally posted by bridgman View Post
    Silvermont was about going from an in-order core to an out-of-order core and using out-of-order execution to help mask memory latency rather than SMT.
    Uhh... they don't say how many entries its reorder buffer had, but I'm sure it wasn't big enough to cover more than a L1 miss. However, don't forget that modern CPUs have hardware prefetchers, which can help reduce the frequency of L2 misses. This probably factored into their decision to drop SMT.

    I think Silvermont's OoO move wasn't only about solving a single problem. It should've been doing double-duty, covering L1 misses as well as getting better utilization of what limited ALU resources it did have.

    Originally posted by bridgman View Post
    P4 (which has comparable width), where enabling HT was a hit-and-miss thing in terms of performance.
    Ah, the Pentium 4. Among its problems with HT was the lack of any mechanism to keep the threads from thrashing each others' cache. When Intel brought back HT (after leaving it out of the Pentium M and Core/Core 2 products), this is one of the aspects they addressed.

    Originally posted by bridgman View Post
    My recollection is that a typical OOO core usually does a better job of dealing with memory latency than SMT does,
    Depends on how much, right? The cool thing about SMT is that if one thread has sufficiently high locality, it can cover virtually infinite amount of latency seen by the other(s).

    Again, we need look only at GPUs to see how effective SMT can be at hiding latency. They have certainly the widest SMT implementations of any hardware today, and virtually no other reason for it than latency-hiding. Intel's Gen9 iGPUs had 7-way, GCN supported 64 wavefronts per CU, and Nvidia has supported between 32 and 64 warps per SM. IDK how many Xe supports.

    Leave a comment:


  • coder
    replied
    Originally posted by mdedetrich View Post
    I didn't say they need to, nor did I say that its always the case
    That response feels a bit insincere, given that it's been a recurring theme of your posts.

    Originally posted by mdedetrich View Post
    This is true until the Apple released the M1 cores for laptops and now their desktop's (or mini PC's). I mean the M1 pro goes up to 3.2 ghz which for a laptop is fairly on par.
    5+ years ago, 3.2 GHz peak clocks might've been "on par" for a premium laptop, but no more. And you're disregarding that Apple is clocking lower on a newer process node than either AMD or Intel is using, which makes it even more of an outlier.

    So, yes 3.2 GHz is low, and it's low for a reason. It's low because Apple can afford the extra die space on a wider core, which is (by no coincidence) the way to maximize perf/W.

    Originally posted by mdedetrich View Post
    then get into the fallacy of directly comparing clock speed between different architectures,
    It's a fallacy only if it's used as a proxy for performance. When trying to understand the design decisions made in the CPUs, it's a relevant consideration.

    Originally posted by mdedetrich View Post
    In any case people need to stop pushing the sentiment that the M1 is a "mobile phone CPU that has lower clocks" because its not.
    That's a mis-characterization of what I said. Fact: it uses the same Firestorm cores as their A14 phone SoC.

    Originally posted by mdedetrich View Post
    Its clock speed is already well past the "mobile" range.
    Clearly, you haven't been keeping up on phone SoCs. Competing chips, made on a similar process node, run at similar peak clocks.

    Vendor SoC Mfg. Process Peak Clock Speed (GHz)
    Apple A14 TSMC N5 2.998
    Samsung Exynos 2100 Samsung 5LPE 2.91
    Mediatek Dimensity 1200 TSMC N6 3.0

    Originally posted by mdedetrich View Post
    The M1 chips was deliberately designed for high power products (i.e. pro laptops and faster) that is loosely based on their A series architecture, its not like Apple just shoved a mobile SKU into their laptops.
    You're confusing the core IP with the SoC. The Firestorm cores, used in their M1 products, were taken from their A14 phone SoC.

    It's sounding like you know a lot less about the M1 than you seem to think.

    Originally posted by mdedetrich View Post
    This is the basic flaw in your argumentation, if SMT was as beneficial as you implied (relative to die tax and other factors), Apple would have done it.
    This is a basic flaw in your argumentation. SMT is a design feature that has benefits and drawbacks. Apple had certain goals for their Firestorm cores, first among which seemed to be maximizing perf/W, because its primary applications were phones, tablets, and laptops. While the M1 Ultra isn't a laptop CPU, it's also not a high-volume part compared with the rest of the lot. Furthermore, we don't know if it was a factor in the design of the Firestorm, or if Apple's decision to make the Ultra came only after they had enough experience to believe building such a product made sense.

    So, if SMT had little benefit in low core-count applications and didn't pull its weight on the perf/W front, that would be reasonable grounds for them not to use it.

    Unfortunately, we can only speculate. We don't know the real reason(s) they haven't used it. Furthermore, you can't divorce the decision from the context. And the context here is a phone-oriented core - yes, one which they had bigger ambitions for, but not at the expense of its initial/primary application.

    Originally posted by mdedetrich View Post
    I said this before and I said it again, Apple's M1 target is desktop class, not mobile.
    That flies in the face of their sales volume, which is disproportionately biased towards laptops. The Firestorm and the M1 CPUs could not afford to pursue performance at the expense of power efficiency.

    Apple surely knows that delivering a knock-out desktop computer while launching an inaugural ARM-based laptop that overheats, thermally-throttles, and chews through battery charge is entirely counterproductive. They needed to deliver a successful laptop, while continuing their success in phones. Their desktop ambitions, if anything, were a stretch goal.

    Originally posted by mdedetrich View Post
    5% is actually quite a bit.
    If the performance benefit is significantly larger than that (and you aren't prioritizing perf/W above all else), then it's still an easy decision.

    Originally posted by mdedetrich View Post
    Let me put it a different way, that figure is large that the primary reason for excluding SMT on Silvermont is because that 5% die size is signficant much. See https://www.anandtech.com/show/6936/...about-mobile/3
    Silvermont is such a simple core that the relative overhead of HT would've surely been larger. However, you're mis-quoting the article, which actually confirms what I've been saying about SMT having a net perf/W penalty. Silvermont needed to be efficient, because its applications included phones, with a few actually seeing the light of day before Intel pulled the plug on their phone ambitions. It even got into the 1st gen MS Hololens.

    What they said about area is that HT had similar footprint to that of Silvermont's reorder buffer, although they don't say how many entries it had.

    Originally posted by mdedetrich View Post
    The amount of die space taken up with SMT is proportional to how many cores you have, so more cores is more die space taken up if you implement SMT. That can add up
    Except that die area for an entire CPU or SoC is much larger than just the cores. So, while we might be talking 5% per core, the figure is likely down to 2-3% for a server SoC or much lower for mobile (where the CPU cores occupy a shrinking minority of the entire die).

    Originally posted by mdedetrich View Post
    , so as I said before its much wiser for Apple to use that die space for something else.
    If you think a couple % larger caches are going to deliver performance benefits on par with SMT, the data we've already seen would seem to contradict that.

    And, as I said before, Apple doesn't seem to be very concerned about minimizing die area. Much less than any of its competitors, for obvious reasons I shouldn't need to repeat.

    Originally posted by mdedetrich View Post
    When it comes to CPU's anything is on the table if its brings good enough performance. As said before, if SMT was as good as you are implying, Apple would have put it in.
    Not if it hurts perf/W, as mentioned in that Silvermont article.

    Originally posted by mdedetrich View Post
    When you are a CPU company, you never rule anything out unless you have a very good reason and in Apple's case with how many resources they have, if SMT was beneficial they would have done it.
    Apple is not a CPU company. They're a products company. Their CPUs are tied to their product ambitions, which currently are centered around phones and laptops.

    Originally posted by mdedetrich View Post
    I don't know what point you are making here,
    The point I was making is that maybe the reason they moved the LPDDR5 in-package is precisely because the latency of keeping it external was more than Firestorm cores could cope with. We don't know which decision came first. That's why it's not very informative and why you can't transplant their design decisions to another context.
    Last edited by coder; 02 June 2022, 11:16 PM.

    Leave a comment:


  • bridgman
    replied
    Originally posted by mdedetrich View Post
    5% is actually quite a bit. Let me put it a different way, that figure is large that the primary reason for excluding SMT on Silvermont is because that 5% die size is signficant much. See https://www.anandtech.com/show/6936/...about-mobile/3
    I don't think Silvermont decision re: SMT is applicable to this discussion - previous Atom cores were in-order with SMT but the purpose of SMT was to help deal with memory latency. Silvermont is only 2-wide, ie not enough execution resources to get a real performance gain from having a second thread making use of resources that the first thread could not keep busy, ie you really need a wide core to get benefit from SMT.

    Silvermont was about going from an in-order core to an out-of-order core and using out-of-order execution to help mask memory latency rather than SMT. Keeping SMT on a 2-wide execution back end would probably have delivered results similar to P4 (which has comparable width), where enabling HT was a hit-and-miss thing in terms of performance.

    My recollection is that a typical OOO core usually does a better job of dealing with memory latency than SMT does, but I don't think that is always the case. I don't remember any good studies of latency tolerance for SMT/in-order vs no-SMT/OOO off the top of my head but I'm sure they exist.

    coder there you go
    Last edited by bridgman; 02 June 2022, 02:59 PM.

    Leave a comment:


  • mdedetrich
    replied
    Originally posted by coder View Post
    Not sure if you read my previous reply, but I think your mistaken in assuming that modern x86 CPUs need to execute old programs as efficiently as new ones.

    I happen to know of a specific example, where Skylake reduced MMX performance by approximately half, relative to the generation before it. They did it and nobody complained, because the old software that used MMX already ran fast enough, and newer software is using SSE2 or AVX2.
    I didn't say they need to, nor did I say that its always the case

    Originally posted by coder View Post
    You've got that backwards. Mobile-oriented cores need to run at lower clock speeds, because power tends to scale nonlinearly with clock speed. Therefore, they need to rely more heavily on IPC for performance, than desktop-oriented cores.

    In Apple's case, vertical integration and their focus on higher-priced products with generous margins also lets them worry less about silicon area than Intel and AMD can afford to do. Intel and AMD (and, to some extent, ARM proper) are all trying to optimize performance per mm^2 of die area, because die area determines price, which is a primary concern of their customers. This leads to smaller cores that rely more on clock speed, for delivering performance. And a micro-architecture that clocks high naturally can't be doing as much work per clock cycle, since your critical path needs to be shorter.
    This is true until the Apple released the M1 cores for laptops and now their desktop's (or mini PC's). I mean the M1 pro goes up to 3.2 ghz which for a laptop is fairly on par. While it is true that Intel/AMD laptop SKU's can go to higher ghz for laptops in a similar class you then get into the fallacy of directly comparing clock speed between different architectures, i.e. one CPU can have lower clock speed then another CPU but because of architectural reasons/IPC that CPU with lower clocks can perform as fast, if not faster (ryzen for example beating Intel with SKU's in the same class with lower cloks).

    In any case people need to stop pushing the sentiment that the M1 is a "mobile phone CPU that has lower clocks" because its not. Its clock speed is already well past the "mobile" range. The M1 chips was deliberately designed for high power products (i.e. pro laptops and faster) that is loosely based on their A series architecture, its not like Apple just shoved a mobile SKU into their laptops.

    This is the basic flaw in your argumentation, if SMT was as beneficial as you implied (relative to die tax and other factors), Apple would have done it. Their laptops are designed to compete against Intel/AMD on power, its not just meant to be a low clocked mobile CPU and its not. Have you even seen the 17 inch M1 max?

    I said this before and I said it again, Apple's M1 target is desktop class, not mobile.

    Originally posted by coder View Post
    As mentioned in my previous post, the best recent source I found is ~5% additional core size. As for complexity, the number of CPUs and GPUs which have successfully implemented it suggests that it's not very difficult.
    5% is actually quite a bit. Let me put it a different way, that figure is large that the primary reason for excluding SMT on Silvermont is because that 5% die size is signficant much. See https://www.anandtech.com/show/6936/...about-mobile/3

    Originally posted by coder View Post
    What does multi-core have to do with it?
    The amount of die space taken up with SMT is proportional to how many cores you have, so more cores is more die space taken up if you implement SMT. That can add up, so as I said before its much wiser for Apple to use that die space for something else.

    Originally posted by coder View Post
    Except they just reused cores from their A14 phone SoCs. So, SMT mightn't even have been on the table.
    When it comes to CPU's anything is on the table if its brings good enough performance. As said before, if SMT was as good as you are implying, Apple would have put it in.

    When you are a CPU company, you never rule anything out unless you have a very good reason and in Apple's case with how many resources they have, if SMT was beneficial they would have done it.

    Originally posted by coder View Post
    In fact, that could actually be what pushed them to move memory in-package, in which case you've got cause-and-effect reversed. That's what's so treacherous about drawing so many conclusions from one example, in particular. An infinite number of lines cross a single point, so you can extrapolate in any direction you want.
    I don't know what point you are making here, I was just simply stating that there are a laundry list of reasons why SMT is not appropriate for Apple M1's and that was yet another reason for that list (which is that cache misses on M1 is not as penalizing compared to other comparable x86/64 systems).
    Last edited by mdedetrich; 02 June 2022, 07:26 AM.

    Leave a comment:


  • bridgman
    replied
    mdedetrich made some good points about M1 - practically speaking the M1 is the only Arm64 product competing with x86-64 in the desktop/laptop space, so it makes sense that it would have a wider/deeper implementation (for higher peak IPC) than any of the other high end Arm64 cores.

    Leave a comment:

Working...
X