Announcement

Collapse
No announcement yet.

Amazon Graviton3 vs. Intel Xeon vs. AMD EPYC Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #51
    Originally posted by mdedetrich View Post
    This is like fundamental ISA design, my source would be actually study the ISA's and understand what SMT means and you will very quickly see why this is the case.
    That's rubbish. You are making testable assertions. People who study computer architecture can (and I'm sure do) run simulations and study detailed performance measurements of real code on real hardware, to gain insight into the utilization efficiency of modern microarchitectures. Furthermore, I'm quite certain CPU designers use detailed simulations during the design and tuning of their microarchitectures. Just because you have no data to support your presumptions doesn't mean it's unknowable or that nobody else does.

    Originally posted by mdedetrich View Post
    You asking me for sources on this is like asking for sources as to why Vulkan is a superior API especially when doing multicore game engines, its because the API is designed in a specific way that enables this. You are not going to find a "source" that says this because its kind of "duh its obvious" to anyone that is technical and works with Vulkan.
    This is another flawed statement. There's available data, in the form of performance analysis of software with Vulkan + some other backend (e.g. OpenGL, DX11, etc.). While that's not a perfect experiment, because there are likely other improvements in the Vulkan backend (assuming it's newer than the others), an aggregate of such data should support general conclusions about relative multi-core efficiency.

    Furthermore, one can do comparative analysis of such APIs, to identify specific features and characteristics which have such consequences.

    Originally posted by mdedetrich View Post
    lets explain in a simple way why SMT is not needed for modern ARM and at the same time explain why it is needed for a performant x86/64 core
    You're simply restating your prior assumptions, using more words. This presumes our disagreement was due to me failing to understand your position. I think understand it well enough.

    The problem here is that you're taking a limited set of data and imputing meaning that it's simply insufficient to support. In other words, you're merely speculating why SMT hasn't featured more prominently, in ARM cores. You're not allowing for the possibility that you're wrong, but that possibility is very real.

    I think there's more to be learned in looking at cases where Intel has and hasn't employed it. Specifically, how none of their E-cores have had it, after the original Atom (which was an in-order core, with 4-way SMT). A notable exception is the modified Silvermont core that they employed in 2nd Gen (KNL) Xeon Phi, which is an OoO core with 4-way SMT. This suggests the driving factor in whether to employ SMT is probably one of power-efficiency. This aligns with the data you cited about ARM cores, as all of ARM's own cores, as well as Apple's, have been mobile-first.

    Another noteworthy data point is that Xeon Phi scaled up to 72 cores, which is an order of magnitude beyond the scales we see in Phone SoCs.

    Originally posted by mdedetrich View Post
    (note that this is simplified, in reality it is more complex).
    Oversimplification is the enemy. Modern micro-architectures are very complex, as is the business of tuning them. Long gone are the days when designs are committed to silicon without detailed modelling, analysis, and tuning on real software.

    Originally posted by mdedetrich View Post
    The last thing you want is having a CPU' sitting there not executing as many instructions it can in a given clock cycle,
    Don't agree. The amount of energy spent keeping functional units busy needs to be balanced against the overheads involved in letting them idle. This is even more critical for mobile-oriented cores.

    Originally posted by mdedetrich View Post
    Due to a number of factors (general CISC design, not being able to deprecate older badly designed instructions over a 40 year period?) a lot of the instructions that x86/64 execute are not easily pipelinable.
    Newer CPUs can pessimize legacy instructions, because the software that needs to run fastest is nearly always being compiled with modern toolchains that are smart enough not to emit poorly-performing instructions and instruction sequences.

    Originally posted by mdedetrich View Post
    Put in a different way its incredibly hard to figure out at the current given time what the control order flow of a running program (composed of x86/64 instructions) will be in the future.
    That's not equivalent to what you said before. You're confusing "pipelinability" with branch prediction.

    Originally posted by mdedetrich View Post
    Now we have this problem whereby they have to work with an instruction set that gives very little information (i.e. a lot of the instructions don't have control order flow information)
    Please provide more specifics. Moreover, "control order flow" appears to be a phrase you invented. Google doesn't seem to know it, at least.

    Originally posted by mdedetrich View Post
    So engineers came up with this ingenious solution which is basically "well we can't do too much above our current techniques to solve this issue of x86/64 ISA not being easily pipelinable
    Please provide examples of these instructions that aren't "easily pipelinable" and explain why compilers can't simply avoid emitting such instructions or sequences.

    Originally posted by mdedetrich View Post
    so lets change the problem around, why don't we instead just give the OS the ability to have two virtual cores that sit ontop of a real physical core so that if one virtual core isn't constantly feeding the real core instructions then instead it will get instructions from the second core". Bingo you now have SMT.
    You're guessing that's why they used SMT, but there are other possible explanations.

    Originally posted by mdedetrich View Post
    And this is why SMT is largely unnecessary on ARM because simply put current ARM cores (and especially Apple's M1) don't have this problem of not being able to constantly feed the CPU instructions.
    You're ignoring all the reasons besides "pipelinability" that CPU pipelines idle.

    Originally posted by mdedetrich View Post
    the Apple M1's do not have this problem of not feeding the CPU with instructions.
    I think you don't know enough to say that. You probably have no idea what the utilization efficiency is like, in the cores of any Apple SoC, much less Firestorm cores of the M1 SoCs.

    Originally posted by mdedetrich View Post
    If you put this into context of how expensive SMT is in terms of die space (you basically have to implement a multiplexer on the hardware level since you have to co-ordinate instructions from different virtual cores onto a single physical core),
    Modern CPU cores already have register renaming. SMT adds a little overhead, but not as much as you think. David Kanter recently estimated it at about 5% of core area:



    Originally posted by mdedetrich View Post
    Fun fact and another tidbit: Due to the fact that x86/64 instructions generally doesn't contain control order flow modern Intel/AMD cpu's don't actually execute x86/64 directly. Instead they translate at runtime the x86/64 to a microcode which is actually RISC like in design and the microcode is whats executed,
    Fun fact: microcode is not new and not always or traditionally RISC-like. Micro-ops do tie directly into this lineage.

    It's interesting that you pose micro-ops as the solution to "x86/64 instructions generally doesn't contain control order flow", after already positioning SMT as a solution to this same problem.

    I think the main reason Intel adopted micro-ops is due to the complex and multi-faceted nature of x86 instructions. Aspects like memory operands and address arithmetic are easier to manage and optimize, if you break them into separate operations.

    Originally posted by mdedetrich View Post
    instructions for AArm64 is a constant 32 bits in size and knowing that instructions are a constant size is incredibly useful
    This is the root of the "frontend bottleneck" problem, which micro-op caches do a pretty good job of mitigating.

    There are lots of ways to crack this nut, but it is an issue. In Intel's Tremont E-core, they employed two parallel 3-wide decoders that can concurrently decode instruction streams from different branch targets. That's not as good as SMT, but it gets you part way there.

    Comment


    • #52
      Originally posted by bridgman View Post
      The primary value of SMT in a modern CPU is the ability to build a very wide (high peak IPC) core that can efficiently execute both well-optimized code and older / less optimized code.
      Some algorithms simply don't have much ILP. Linked-list following is a prime example of inherently serial code, but there are others. Not only is it serial, but also quite likely to be memory-bound, depending on the size of the list and the degree of heap fragmentation.

      When I first got my hands on a Pentium 4 with hyperthreading, I observed a near linear speedup (i.e. nearly 2x), when running two threads each doing something like computing a Fibonacci sequence.

      Being a GPU guy, I'm surprised you didn't mention the potential of SMT for mitigating memory latency.

      Comment


      • #53
        Originally posted by PerformanceExpert View Post
        Comparing ROB entries is non-trivial, but Neoverse V1 is much smaller than Zen 3, so the ROB is relatively large for its size (ROB doubled since N1).
        ROB size shouldn't be normalized by the core's physical size (i.e. die area), but rather by its backend width and typical memory latency (i.e. in terms of clock cycles).

        Comment


        • #54
          Originally posted by bridgman View Post

          For what it's worth, my take on why/where SMT makes sense is quite different from yours, although we might end up with the same conclusion.
          Actually I think we have the same arguments, we just emphasized different parts and stated them differently.

          Originally posted by bridgman View Post
          - significantly expand the OOO capabilities of the core to improve the chances of finding enough ready-to-execute micro-ops in a single instruction stream - M1 is the poster child for this but so far the ARM-designed cores are more in line with x86-64. IIRC the Neoverse V-1 core in Graviton3 has a 256 entry re-order buffer, the same as Zen3.
          This is what I mentioned before, albeit maybe not with enough emphasis as stated earlier. Since Apple M1 only supports AArch64 and as stated before the ISA of AArch64 by design makes it a lot easier to encode OOO capabilities, all of the programs compiled for Apple SoC are extremely pipelinable. x86/64 has the problem where they still need to support programs that are compiled like 4 decades ago (if not more?) and so they have the problem of having to run a lot of "poorly" OOO optimized programs.


          Originally posted by bridgman View Post

          Anyways, I think we are in agreement - it's not so much that there are fundamental reasons A64 CPUs will never need SMT but more "they've gotten away without it so far". It's not clear to me that the SW running on A64 is necessarily ever going to demand the kind of high peak IPCs that are taken for granted in the x86-64 world so it's possible that "only doing efficiency cores" might suffice.
          At least with M1, the IPC is kind of ridiculous if you take into account power budget. Granted its not on the high end Ryzen/intel K series for single core IPC but we are comparing a laptop/small form factor to desktop SKU's now. While of course there is always the theoretical benefit of SMT, the issue with people advocating it for ARM (especially Apples M1) don't quantify their arguments. In other words they don't realize how expensive SMT is to implement both in terms of die budget (which has increasingly become more critical over time especially as we keep increasing power budget of desktop CPU so you cant just keep on throwing sillicon on the problem because now we have issues with cooling) and complexity. Especially with multicore systems, its just far simpler to not have to deal with SMT and solve the issue with other tools.

          There are also other things that I didn't mention before in the aims of brevity which I should have that tilt the sides even more for not going SMT, Apples M1 is a SoC which means that the memory latency and bandwidth is a lot lower compared to comparable x86/64 systems. This means even in cases where you get a lot of CPU cache misses and end up having to recompute/go to system RAM, Apples M1 is a lot faster in this case which has a tangible effect on mitigating the downisdes of CPU caches misses.

          On the note of IPC and to put things into perspective, the Apple M1 pro was extremely competitive when it comes to gaming (which is still quite single core IP dependent) even when running the games via Rosetta x86/64 translation layer.

          Having to recompile software is a definite downside, its not like ARM is absolutely perfect. There is software that is written that is no longer being updated or maintained and this is why I personally believe x86/64 really shines and why its still so dominant. This also ties back to my earlier example with the JVM which is also similar, JVM can still run programs (jar's) from the 90's without those jar's needing to be recompiled because a lot of the optimization magic is being done by the JVM at runtime. Yes its definitely true that its not just x86/64 that has its own internal microcode, the difference is that with x86/64 it does a lot of black magic/black box style optimizations under the hood where as with typical AArch64 CPU the translation (if done) is a lot more direct.

          Although as a counter argument, as we see with the M1's Rosetta if you put the translation of the heavily bottlenecked ISA instructions into the die itself it can come a long way. You might not be able to run x86/64 programs at native execution, but its still damn fast and there is a good overlap in the sense that if software is not actively being maintained you probably don't care about the performance of it too much (otherwise you would expect maintenance to iterate on performance over time). Actually I think the biggest impact Apple did with M1 was they showed how far you can go with efficient x86/64 emulation and you can make a decent argument that Microsoft colossally failing in this area with their ARM machines hasn't helped with ARM laptop/desktop situation. Microsoft partnering with Qualcomm wasn't the best idea here (also due to the exclusivity deal which is what is not allowing you to run Windows ARM on Apples M1 chips).
          Last edited by mdedetrich; 01 June 2022, 05:13 AM.

          Comment


          • #55
            Originally posted by coder View Post
            ROB size shouldn't be normalized by the core's physical size (i.e. die area), but rather by its backend width and typical memory latency (i.e. in terms of clock cycles).
            V1 has a very wide backend and runs at a lower frequency (so can cover a higher percentage of the memory latency with similar sized ROB). Whichever way you cut it, the end result of the wide and deep OoO engine is 30-40% higher IPC than Zen 3 - quite a feat given its small size and low power.

            Comment


            • #56
              Originally posted by mdedetrich View Post
              x86/64 has the problem where they still need to support programs that are compiled like 4 decades ago (if not more?) and so they have the problem of having to run a lot of "poorly" OOO optimized programs.
              Not sure if you read my previous reply, but I think your mistaken in assuming that modern x86 CPUs need to execute old programs as efficiently as new ones.

              I happen to know of a specific example, where Skylake reduced MMX performance by approximately half, relative to the generation before it. They did it and nobody complained, because the old software that used MMX already ran fast enough, and newer software is using SSE2 or AVX2.

              Originally posted by mdedetrich View Post
              At least with M1, the IPC is kind of ridiculous if you take into account power budget.
              You've got that backwards. Mobile-oriented cores need to run at lower clock speeds, because power tends to scale nonlinearly with clock speed. Therefore, they need to rely more heavily on IPC for performance, than desktop-oriented cores.

              In Apple's case, vertical integration and their focus on higher-priced products with generous margins also lets them worry less about silicon area than Intel and AMD can afford to do. Intel and AMD (and, to some extent, ARM proper) are all trying to optimize performance per mm^2 of die area, because die area determines price, which is a primary concern of their customers. This leads to smaller cores that rely more on clock speed, for delivering performance. And a micro-architecture that clocks high naturally can't be doing as much work per clock cycle, since your critical path needs to be shorter.

              Originally posted by mdedetrich View Post
              In other words they don't realize how expensive SMT is to implement both in terms of die budget ... and complexity.
              As mentioned in my previous post, the best recent source I found is ~5% additional core size. As for complexity, the number of CPUs and GPUs which have successfully implemented it suggests that it's not very difficult.

              Originally posted by mdedetrich View Post
              Especially with multicore systems, its just far simpler to not have to deal with SMT
              What does multi-core have to do with it?

              Originally posted by mdedetrich View Post
              Apples M1 is a SoC which means that the memory latency and bandwidth is a lot lower compared to comparable x86/64 systems.
              Except they just reused cores from their A14 phone SoCs. So, SMT mightn't even have been on the table.

              In fact, that could actually be what pushed them to move memory in-package, in which case you've got cause-and-effect reversed. That's what's so treacherous about drawing so many conclusions from one example, in particular. An infinite number of lines cross a single point, so you can extrapolate in any direction you want.
              Last edited by coder; 01 June 2022, 01:32 PM.

              Comment


              • #57
                Originally posted by PerformanceExpert View Post
                V1 has a very wide backend and runs at a lower frequency (so can cover a higher percentage of the memory latency with similar sized ROB). Whichever way you cut it, the end result of the wide and deep OoO engine is 30-40% higher IPC than Zen 3 - quite a feat given its small size and low power.
                Graviton 3 is impressive, for sure. It does use a newer process node and DDR5, which neither of its competitors are on.

                I do wish we could've seen some single thread benchmarks and had more data on the power budgets of the x86 CPUs.

                Comment


                • #58
                  Originally posted by coder View Post
                  Some algorithms simply don't have much ILP. Linked-list following is a prime example of inherently serial code, but there are others. Not only is it serial, but also quite likely to be memory-bound, depending on the size of the list and the degree of heap fragmentation.
                  Good point - that might help to explain why you see SMT4 used more in server environments than in desktop.

                  Originally posted by coder View Post
                  Being a GPU guy, I'm surprised you didn't mention the potential of SMT for mitigating memory latency.
                  I'm trying to write shorter posts. May give up soon.
                  Test signature

                  Comment


                  • #59
                    mdedetrich made some good points about M1 - practically speaking the M1 is the only Arm64 product competing with x86-64 in the desktop/laptop space, so it makes sense that it would have a wider/deeper implementation (for higher peak IPC) than any of the other high end Arm64 cores.
                    Test signature

                    Comment


                    • #60
                      Originally posted by coder View Post
                      Not sure if you read my previous reply, but I think your mistaken in assuming that modern x86 CPUs need to execute old programs as efficiently as new ones.

                      I happen to know of a specific example, where Skylake reduced MMX performance by approximately half, relative to the generation before it. They did it and nobody complained, because the old software that used MMX already ran fast enough, and newer software is using SSE2 or AVX2.
                      I didn't say they need to, nor did I say that its always the case

                      Originally posted by coder View Post
                      You've got that backwards. Mobile-oriented cores need to run at lower clock speeds, because power tends to scale nonlinearly with clock speed. Therefore, they need to rely more heavily on IPC for performance, than desktop-oriented cores.

                      In Apple's case, vertical integration and their focus on higher-priced products with generous margins also lets them worry less about silicon area than Intel and AMD can afford to do. Intel and AMD (and, to some extent, ARM proper) are all trying to optimize performance per mm^2 of die area, because die area determines price, which is a primary concern of their customers. This leads to smaller cores that rely more on clock speed, for delivering performance. And a micro-architecture that clocks high naturally can't be doing as much work per clock cycle, since your critical path needs to be shorter.
                      This is true until the Apple released the M1 cores for laptops and now their desktop's (or mini PC's). I mean the M1 pro goes up to 3.2 ghz which for a laptop is fairly on par. While it is true that Intel/AMD laptop SKU's can go to higher ghz for laptops in a similar class you then get into the fallacy of directly comparing clock speed between different architectures, i.e. one CPU can have lower clock speed then another CPU but because of architectural reasons/IPC that CPU with lower clocks can perform as fast, if not faster (ryzen for example beating Intel with SKU's in the same class with lower cloks).

                      In any case people need to stop pushing the sentiment that the M1 is a "mobile phone CPU that has lower clocks" because its not. Its clock speed is already well past the "mobile" range. The M1 chips was deliberately designed for high power products (i.e. pro laptops and faster) that is loosely based on their A series architecture, its not like Apple just shoved a mobile SKU into their laptops.

                      This is the basic flaw in your argumentation, if SMT was as beneficial as you implied (relative to die tax and other factors), Apple would have done it. Their laptops are designed to compete against Intel/AMD on power, its not just meant to be a low clocked mobile CPU and its not. Have you even seen the 17 inch M1 max?

                      I said this before and I said it again, Apple's M1 target is desktop class, not mobile.

                      Originally posted by coder View Post
                      As mentioned in my previous post, the best recent source I found is ~5% additional core size. As for complexity, the number of CPUs and GPUs which have successfully implemented it suggests that it's not very difficult.
                      5% is actually quite a bit. Let me put it a different way, that figure is large that the primary reason for excluding SMT on Silvermont is because that 5% die size is signficant much. See https://www.anandtech.com/show/6936/...about-mobile/3

                      Originally posted by coder View Post
                      What does multi-core have to do with it?
                      The amount of die space taken up with SMT is proportional to how many cores you have, so more cores is more die space taken up if you implement SMT. That can add up, so as I said before its much wiser for Apple to use that die space for something else.

                      Originally posted by coder View Post
                      Except they just reused cores from their A14 phone SoCs. So, SMT mightn't even have been on the table.
                      When it comes to CPU's anything is on the table if its brings good enough performance. As said before, if SMT was as good as you are implying, Apple would have put it in.

                      When you are a CPU company, you never rule anything out unless you have a very good reason and in Apple's case with how many resources they have, if SMT was beneficial they would have done it.

                      Originally posted by coder View Post
                      In fact, that could actually be what pushed them to move memory in-package, in which case you've got cause-and-effect reversed. That's what's so treacherous about drawing so many conclusions from one example, in particular. An infinite number of lines cross a single point, so you can extrapolate in any direction you want.
                      I don't know what point you are making here, I was just simply stating that there are a laundry list of reasons why SMT is not appropriate for Apple M1's and that was yet another reason for that list (which is that cache misses on M1 is not as penalizing compared to other comparable x86/64 systems).
                      Last edited by mdedetrich; 02 June 2022, 07:26 AM.

                      Comment

                      Working...
                      X