Amazon Graviton3 vs. Intel Xeon vs. AMD EPYC Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bridgman
    replied
    Originally posted by coder View Post
    Some algorithms simply don't have much ILP. Linked-list following is a prime example of inherently serial code, but there are others. Not only is it serial, but also quite likely to be memory-bound, depending on the size of the list and the degree of heap fragmentation.
    Good point - that might help to explain why you see SMT4 used more in server environments than in desktop.

    Originally posted by coder View Post
    Being a GPU guy, I'm surprised you didn't mention the potential of SMT for mitigating memory latency.
    I'm trying to write shorter posts. May give up soon.

    Leave a comment:


  • coder
    replied
    Originally posted by PerformanceExpert View Post
    V1 has a very wide backend and runs at a lower frequency (so can cover a higher percentage of the memory latency with similar sized ROB). Whichever way you cut it, the end result of the wide and deep OoO engine is 30-40% higher IPC than Zen 3 - quite a feat given its small size and low power.
    Graviton 3 is impressive, for sure. It does use a newer process node and DDR5, which neither of its competitors are on.

    I do wish we could've seen some single thread benchmarks and had more data on the power budgets of the x86 CPUs.

    Leave a comment:


  • coder
    replied
    Originally posted by mdedetrich View Post
    x86/64 has the problem where they still need to support programs that are compiled like 4 decades ago (if not more?) and so they have the problem of having to run a lot of "poorly" OOO optimized programs.
    Not sure if you read my previous reply, but I think your mistaken in assuming that modern x86 CPUs need to execute old programs as efficiently as new ones.

    I happen to know of a specific example, where Skylake reduced MMX performance by approximately half, relative to the generation before it. They did it and nobody complained, because the old software that used MMX already ran fast enough, and newer software is using SSE2 or AVX2.

    Originally posted by mdedetrich View Post
    At least with M1, the IPC is kind of ridiculous if you take into account power budget.
    You've got that backwards. Mobile-oriented cores need to run at lower clock speeds, because power tends to scale nonlinearly with clock speed. Therefore, they need to rely more heavily on IPC for performance, than desktop-oriented cores.

    In Apple's case, vertical integration and their focus on higher-priced products with generous margins also lets them worry less about silicon area than Intel and AMD can afford to do. Intel and AMD (and, to some extent, ARM proper) are all trying to optimize performance per mm^2 of die area, because die area determines price, which is a primary concern of their customers. This leads to smaller cores that rely more on clock speed, for delivering performance. And a micro-architecture that clocks high naturally can't be doing as much work per clock cycle, since your critical path needs to be shorter.

    Originally posted by mdedetrich View Post
    In other words they don't realize how expensive SMT is to implement both in terms of die budget ... and complexity.
    As mentioned in my previous post, the best recent source I found is ~5% additional core size. As for complexity, the number of CPUs and GPUs which have successfully implemented it suggests that it's not very difficult.

    Originally posted by mdedetrich View Post
    Especially with multicore systems, its just far simpler to not have to deal with SMT
    What does multi-core have to do with it?

    Originally posted by mdedetrich View Post
    Apples M1 is a SoC which means that the memory latency and bandwidth is a lot lower compared to comparable x86/64 systems.
    Except they just reused cores from their A14 phone SoCs. So, SMT mightn't even have been on the table.

    In fact, that could actually be what pushed them to move memory in-package, in which case you've got cause-and-effect reversed. That's what's so treacherous about drawing so many conclusions from one example, in particular. An infinite number of lines cross a single point, so you can extrapolate in any direction you want.
    Last edited by coder; 01 June 2022, 01:32 PM.

    Leave a comment:


  • PerformanceExpert
    replied
    Originally posted by coder View Post
    ROB size shouldn't be normalized by the core's physical size (i.e. die area), but rather by its backend width and typical memory latency (i.e. in terms of clock cycles).
    V1 has a very wide backend and runs at a lower frequency (so can cover a higher percentage of the memory latency with similar sized ROB). Whichever way you cut it, the end result of the wide and deep OoO engine is 30-40% higher IPC than Zen 3 - quite a feat given its small size and low power.

    Leave a comment:


  • mdedetrich
    replied
    Originally posted by bridgman View Post

    For what it's worth, my take on why/where SMT makes sense is quite different from yours, although we might end up with the same conclusion.
    Actually I think we have the same arguments, we just emphasized different parts and stated them differently.

    Originally posted by bridgman View Post
    - significantly expand the OOO capabilities of the core to improve the chances of finding enough ready-to-execute micro-ops in a single instruction stream - M1 is the poster child for this but so far the ARM-designed cores are more in line with x86-64. IIRC the Neoverse V-1 core in Graviton3 has a 256 entry re-order buffer, the same as Zen3.
    This is what I mentioned before, albeit maybe not with enough emphasis as stated earlier. Since Apple M1 only supports AArch64 and as stated before the ISA of AArch64 by design makes it a lot easier to encode OOO capabilities, all of the programs compiled for Apple SoC are extremely pipelinable. x86/64 has the problem where they still need to support programs that are compiled like 4 decades ago (if not more?) and so they have the problem of having to run a lot of "poorly" OOO optimized programs.


    Originally posted by bridgman View Post

    Anyways, I think we are in agreement - it's not so much that there are fundamental reasons A64 CPUs will never need SMT but more "they've gotten away without it so far". It's not clear to me that the SW running on A64 is necessarily ever going to demand the kind of high peak IPCs that are taken for granted in the x86-64 world so it's possible that "only doing efficiency cores" might suffice.
    At least with M1, the IPC is kind of ridiculous if you take into account power budget. Granted its not on the high end Ryzen/intel K series for single core IPC but we are comparing a laptop/small form factor to desktop SKU's now. While of course there is always the theoretical benefit of SMT, the issue with people advocating it for ARM (especially Apples M1) don't quantify their arguments. In other words they don't realize how expensive SMT is to implement both in terms of die budget (which has increasingly become more critical over time especially as we keep increasing power budget of desktop CPU so you cant just keep on throwing sillicon on the problem because now we have issues with cooling) and complexity. Especially with multicore systems, its just far simpler to not have to deal with SMT and solve the issue with other tools.

    There are also other things that I didn't mention before in the aims of brevity which I should have that tilt the sides even more for not going SMT, Apples M1 is a SoC which means that the memory latency and bandwidth is a lot lower compared to comparable x86/64 systems. This means even in cases where you get a lot of CPU cache misses and end up having to recompute/go to system RAM, Apples M1 is a lot faster in this case which has a tangible effect on mitigating the downisdes of CPU caches misses.

    On the note of IPC and to put things into perspective, the Apple M1 pro was extremely competitive when it comes to gaming (which is still quite single core IP dependent) even when running the games via Rosetta x86/64 translation layer.

    Having to recompile software is a definite downside, its not like ARM is absolutely perfect. There is software that is written that is no longer being updated or maintained and this is why I personally believe x86/64 really shines and why its still so dominant. This also ties back to my earlier example with the JVM which is also similar, JVM can still run programs (jar's) from the 90's without those jar's needing to be recompiled because a lot of the optimization magic is being done by the JVM at runtime. Yes its definitely true that its not just x86/64 that has its own internal microcode, the difference is that with x86/64 it does a lot of black magic/black box style optimizations under the hood where as with typical AArch64 CPU the translation (if done) is a lot more direct.

    Although as a counter argument, as we see with the M1's Rosetta if you put the translation of the heavily bottlenecked ISA instructions into the die itself it can come a long way. You might not be able to run x86/64 programs at native execution, but its still damn fast and there is a good overlap in the sense that if software is not actively being maintained you probably don't care about the performance of it too much (otherwise you would expect maintenance to iterate on performance over time). Actually I think the biggest impact Apple did with M1 was they showed how far you can go with efficient x86/64 emulation and you can make a decent argument that Microsoft colossally failing in this area with their ARM machines hasn't helped with ARM laptop/desktop situation. Microsoft partnering with Qualcomm wasn't the best idea here (also due to the exclusivity deal which is what is not allowing you to run Windows ARM on Apples M1 chips).
    Last edited by mdedetrich; 01 June 2022, 05:13 AM.

    Leave a comment:


  • coder
    replied
    Originally posted by PerformanceExpert View Post
    Comparing ROB entries is non-trivial, but Neoverse V1 is much smaller than Zen 3, so the ROB is relatively large for its size (ROB doubled since N1).
    ROB size shouldn't be normalized by the core's physical size (i.e. die area), but rather by its backend width and typical memory latency (i.e. in terms of clock cycles).

    Leave a comment:


  • coder
    replied
    Originally posted by bridgman View Post
    The primary value of SMT in a modern CPU is the ability to build a very wide (high peak IPC) core that can efficiently execute both well-optimized code and older / less optimized code.
    Some algorithms simply don't have much ILP. Linked-list following is a prime example of inherently serial code, but there are others. Not only is it serial, but also quite likely to be memory-bound, depending on the size of the list and the degree of heap fragmentation.

    When I first got my hands on a Pentium 4 with hyperthreading, I observed a near linear speedup (i.e. nearly 2x), when running two threads each doing something like computing a Fibonacci sequence.

    Being a GPU guy, I'm surprised you didn't mention the potential of SMT for mitigating memory latency.

    Leave a comment:


  • coder
    replied
    Originally posted by mdedetrich View Post
    This is like fundamental ISA design, my source would be actually study the ISA's and understand what SMT means and you will very quickly see why this is the case.
    That's rubbish. You are making testable assertions. People who study computer architecture can (and I'm sure do) run simulations and study detailed performance measurements of real code on real hardware, to gain insight into the utilization efficiency of modern microarchitectures. Furthermore, I'm quite certain CPU designers use detailed simulations during the design and tuning of their microarchitectures. Just because you have no data to support your presumptions doesn't mean it's unknowable or that nobody else does.

    Originally posted by mdedetrich View Post
    You asking me for sources on this is like asking for sources as to why Vulkan is a superior API especially when doing multicore game engines, its because the API is designed in a specific way that enables this. You are not going to find a "source" that says this because its kind of "duh its obvious" to anyone that is technical and works with Vulkan.
    This is another flawed statement. There's available data, in the form of performance analysis of software with Vulkan + some other backend (e.g. OpenGL, DX11, etc.). While that's not a perfect experiment, because there are likely other improvements in the Vulkan backend (assuming it's newer than the others), an aggregate of such data should support general conclusions about relative multi-core efficiency.

    Furthermore, one can do comparative analysis of such APIs, to identify specific features and characteristics which have such consequences.

    Originally posted by mdedetrich View Post
    lets explain in a simple way why SMT is not needed for modern ARM and at the same time explain why it is needed for a performant x86/64 core
    You're simply restating your prior assumptions, using more words. This presumes our disagreement was due to me failing to understand your position. I think understand it well enough.

    The problem here is that you're taking a limited set of data and imputing meaning that it's simply insufficient to support. In other words, you're merely speculating why SMT hasn't featured more prominently, in ARM cores. You're not allowing for the possibility that you're wrong, but that possibility is very real.

    I think there's more to be learned in looking at cases where Intel has and hasn't employed it. Specifically, how none of their E-cores have had it, after the original Atom (which was an in-order core, with 4-way SMT). A notable exception is the modified Silvermont core that they employed in 2nd Gen (KNL) Xeon Phi, which is an OoO core with 4-way SMT. This suggests the driving factor in whether to employ SMT is probably one of power-efficiency. This aligns with the data you cited about ARM cores, as all of ARM's own cores, as well as Apple's, have been mobile-first.

    Another noteworthy data point is that Xeon Phi scaled up to 72 cores, which is an order of magnitude beyond the scales we see in Phone SoCs.

    Originally posted by mdedetrich View Post
    (note that this is simplified, in reality it is more complex).
    Oversimplification is the enemy. Modern micro-architectures are very complex, as is the business of tuning them. Long gone are the days when designs are committed to silicon without detailed modelling, analysis, and tuning on real software.

    Originally posted by mdedetrich View Post
    The last thing you want is having a CPU' sitting there not executing as many instructions it can in a given clock cycle,
    Don't agree. The amount of energy spent keeping functional units busy needs to be balanced against the overheads involved in letting them idle. This is even more critical for mobile-oriented cores.

    Originally posted by mdedetrich View Post
    Due to a number of factors (general CISC design, not being able to deprecate older badly designed instructions over a 40 year period?) a lot of the instructions that x86/64 execute are not easily pipelinable.
    Newer CPUs can pessimize legacy instructions, because the software that needs to run fastest is nearly always being compiled with modern toolchains that are smart enough not to emit poorly-performing instructions and instruction sequences.

    Originally posted by mdedetrich View Post
    Put in a different way its incredibly hard to figure out at the current given time what the control order flow of a running program (composed of x86/64 instructions) will be in the future.
    That's not equivalent to what you said before. You're confusing "pipelinability" with branch prediction.

    Originally posted by mdedetrich View Post
    Now we have this problem whereby they have to work with an instruction set that gives very little information (i.e. a lot of the instructions don't have control order flow information)
    Please provide more specifics. Moreover, "control order flow" appears to be a phrase you invented. Google doesn't seem to know it, at least.

    Originally posted by mdedetrich View Post
    So engineers came up with this ingenious solution which is basically "well we can't do too much above our current techniques to solve this issue of x86/64 ISA not being easily pipelinable
    Please provide examples of these instructions that aren't "easily pipelinable" and explain why compilers can't simply avoid emitting such instructions or sequences.

    Originally posted by mdedetrich View Post
    so lets change the problem around, why don't we instead just give the OS the ability to have two virtual cores that sit ontop of a real physical core so that if one virtual core isn't constantly feeding the real core instructions then instead it will get instructions from the second core". Bingo you now have SMT.
    You're guessing that's why they used SMT, but there are other possible explanations.

    Originally posted by mdedetrich View Post
    And this is why SMT is largely unnecessary on ARM because simply put current ARM cores (and especially Apple's M1) don't have this problem of not being able to constantly feed the CPU instructions.
    You're ignoring all the reasons besides "pipelinability" that CPU pipelines idle.

    Originally posted by mdedetrich View Post
    the Apple M1's do not have this problem of not feeding the CPU with instructions.
    I think you don't know enough to say that. You probably have no idea what the utilization efficiency is like, in the cores of any Apple SoC, much less Firestorm cores of the M1 SoCs.

    Originally posted by mdedetrich View Post
    If you put this into context of how expensive SMT is in terms of die space (you basically have to implement a multiplexer on the hardware level since you have to co-ordinate instructions from different virtual cores onto a single physical core),
    Modern CPU cores already have register renaming. SMT adds a little overhead, but not as much as you think. David Kanter recently estimated it at about 5% of core area:



    Originally posted by mdedetrich View Post
    Fun fact and another tidbit: Due to the fact that x86/64 instructions generally doesn't contain control order flow modern Intel/AMD cpu's don't actually execute x86/64 directly. Instead they translate at runtime the x86/64 to a microcode which is actually RISC like in design and the microcode is whats executed,
    Fun fact: microcode is not new and not always or traditionally RISC-like. Micro-ops do tie directly into this lineage.

    It's interesting that you pose micro-ops as the solution to "x86/64 instructions generally doesn't contain control order flow", after already positioning SMT as a solution to this same problem.

    I think the main reason Intel adopted micro-ops is due to the complex and multi-faceted nature of x86 instructions. Aspects like memory operands and address arithmetic are easier to manage and optimize, if you break them into separate operations.

    Originally posted by mdedetrich View Post
    instructions for AArm64 is a constant 32 bits in size and knowing that instructions are a constant size is incredibly useful
    This is the root of the "frontend bottleneck" problem, which micro-op caches do a pretty good job of mitigating.

    There are lots of ways to crack this nut, but it is an issue. In Intel's Tremont E-core, they employed two parallel 3-wide decoders that can concurrently decode instruction streams from different branch targets. That's not as good as SMT, but it gets you part way there.

    Leave a comment:


  • bridgman
    replied
    Originally posted by PerformanceExpert View Post
    That's not true, AArch64 still has many conditional instructions such as CSEL, CSET, CCMP etc. These cover the majority of use cases and avoid some of the issues predicated instructions have.
    OK, I see the problem. ARM does not call those "conditional instructions" but rather "unconditionally executed instructions that include condition code information as one of the inputs". Looks like they basically removed all the predication capabilities except for conditional branch but added in a few new "conditional data processing instructions". There are a lot of those opcodes but many of them appear to be aliases of other instructions.

    My statement about conditional branch being the only remaining conditional instruction came directly from ARM programming materials but apparently some of the terminology changed as well so there are a few other instructions with "Conditional" in their name.

    Originally posted by PerformanceExpert View Post
    All CPUs decode instructions into internal micro-ops - this is not unique to high-end or OoO cores. Recent Arm designs have a micro-op cache but the M1 proves it is not needed for high-end (I bet that dropping Arm/Thumb helps since that means only 32-bit instructions).
    I'm trying to write more concisely but always end up regretting it. Agree that most if not all CPUs decode instructions into internal micro-ops, but lower end CPUs execute those micro-ops immediately while OOO CPUs schedule and execute the micro-ops independently of the fetch/decode activities.

    Anyways, my point was that mdedetrich 's comment about x86-64's use of independently scheduled and executed micro-ops also applied to Arm64.

    Originally posted by PerformanceExpert View Post
    You'd optimize your code already, and compilers improve considerably over time, but people still want better performance. So then you're back to SMT or going deep and wide. Comparing ROB entries is non-trivial, but Neoverse V1 is much smaller than Zen 3, so the ROB is relatively large for its size (ROB doubled since N1).
    Yep, that's fair - although I suspect that a fair amount of the size difference is due to Zen3 being designed to run at significantly higher clocks than N1 on the same fab process.

    I should note that while compilers tend to improve over time it's less common for SW vendors to recompile and redistribute new binaries unless they are also releasing new or significantly updated SW. It probably is fair to say that x86-64 is impacted by that more than Arm64 simply because the kind of SW that doesn't tend to get new binaries regularly (games) tends to be run a lot more on x86-64 than on Arm64.

    Anyways, I think we are in agreement - it's not so much that there are fundamental reasons A64 CPUs will never need SMT but more "they've gotten away without it so far". It's not clear to me that the SW running on A64 is necessarily ever going to demand the kind of high peak IPCs that are taken for granted in the x86-64 world so it's possible that "only doing efficiency cores" might suffice.

    Leave a comment:


  • PerformanceExpert
    replied
    Originally posted by bridgman View Post
    ARM actually removed nearly all of the conditional instructions as part of the A32->A64 transition. The only conditional instruction remaining is conditional branch. Your statement is correct for A32 and T32, however.
    That's not true, AArch64 still has many conditional instructions such as CSEL, CSET, CCMP etc. These cover the majority of use cases and avoid some of the issues predicated instructions have.

    High end A64 (Arm64) CPUs decode ARM ISA into micro-ops and execute the micro-ops. The latest cores even include a micro-op cache just like x86-64 - they call it the "L0-decoded" cache.
    All CPUs decode instructions into internal micro-ops - this is not unique to high-end or OoO cores. Recent Arm designs have a micro-op cache but the M1 proves it is not needed for high-end (I bet that dropping Arm/Thumb helps since that means only 32-bit instructions).

    For what it's worth, my take on why/where SMT makes sense is quite different from yours, although we might end up with the same conclusion.

    The primary value of SMT in a modern CPU is the ability to build a very wide (high peak IPC) core that can efficiently execute both well-optimized code and older / less optimized code. Making good use of a wide core's execution resources can be done in a few different ways:

    - optimize the code so that a reasonably deep OOO execution engine can find enough ready-to-execute micro-ops in a single instruction stream to keep most/all of the execution resources busy

    - use SMT so that on average you only have to find enough ready-to-execute micro-ops in a single instruction stream to keep 1/2 or 1/4 of the execution resources busy - or put differently you have to find enough ready-to-execute micro-ops in 2 or 4 instruction streams to keep most/all of the execution resources busy

    - significantly expand the OOO capabilities of the core to improve the chances of finding enough ready-to-execute micro-ops in a single instruction stream - M1 is the poster child for this but so far the ARM-designed cores are more in line with x86-64. IIRC the Neoverse V-1 core in Graviton3 has a 256 entry re-order buffer, the same as Zen3.
    You'd optimize your code already, and compilers improve considerably over time, but people still want better performance. So then you're back to SMT or going deep and wide. Comparing ROB entries is non-trivial, but Neoverse V1 is much smaller than Zen 3, so the ROB is relatively large for its size (ROB doubled since N1).

    Leave a comment:

Working...
X