Announcement

Collapse
No announcement yet.

Amazon Graviton3 vs. Intel Xeon vs. AMD EPYC Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    Originally posted by coder View Post
    This part is almost the very definition of revisionist history.
    "hyperthreading is mainly a result of trying to squeeze out more performance from an older CISC based style ISA that has different word size's for ISA instructions."

    Its revisionist if you simplify my point down so it loses meaning which is what you did

    Originally posted by coder View Post
    SMT has a long history, only part of which is overlapped by x86 adoption. As for why they adopted it, you're again fitting your own rationale onto a selective reading of the facts. For the most part, x86 CPUs do quite well at mitigating the frontend bottlenecks by use of micro-op caches. By itself, further mitigating that shouldn't deliver nearly enough benefit to justify it.


    You're deducing that from a limited amount of information. This approach is fraught. We don't know precisely why it hasn't factored bigger into ARM ISA implementations, but I think Apple's cores provide a useful case to examine.
    If you do some research you will notice why SMT hasn't been used that much with ARM based architectures, its because the ISA doesn't pose the same amount of pipe-lining issues compared to x86, thats why the benefits of SMT for ARM is a lot less.

    One example in the ISA for example is conditional execution. In modern ARM instructions have a feature whereby the individual instruction is not executed if some condition holds. Thse instructions can be converted to be branch-less which makes it incredibly easy to pipeline. Modern ARM actually removes a lot of instructions that relied on predication (which is the converse of what I just said) because it makes dependency path in ISA execution very unpredictable.

    When your entire ISA has been designed so that dependency flow is very clear because of what I just said, the benefit of SMT drops massively because everything is made to be trivially pipe-linable by design.

    Originally posted by coder View Post
    The cores in their M1 SoCs are the same cores they used in earlier phone SoCs. So far, they haven't made dedicated cores only for their M1's.
    I can fairly confidently guarantee that Apple M1 will not add SMT even for their higher and desktop workstation CPU's. If you even have a look at how Apple designed their M1 down to the silicon, its quite clear this is not going to happen. They basically "abused" ARM's RISC to the extreme in their silicon to the point where them adding SMT would be the complete opposite of the direction they took.


    Originally posted by coder View Post
    I'll say it again: SMT is a technique (or "tool" if you prefer) used to help tackle a variety of issues. In Apple's case, it seems like they found they can keep the backend of their cores adequately busy with a different set of solutions:
    • wide frontend
    • large reorder buffer (indeed, large enough to sometimes even hide full cache hierarchy misses)
    • large caches
    • lower clock speed (which reduces the latency hit of a cache miss, in terms of clock cycles)
    Through these and other tweaks, they don't need SMT, at the scale they've so far been deploying their cores. It does have drawbacks, including some slight power overhead. When mobile is your main focus, anything with a power overhead is immediately a negative.
    And I will say it again, SMT may be a tool but in the case of ARM its not a good tool. Even the points in your list demonstrate that you don't understand how Apple designed their silicon. For example you state "larger cache" as if its always a good thing, but a again a larger cache is necessary for x86/64 because of the difficulty of pipe-lining CISC style instructions. For apples sillicon, its actually far more beneficial to not have such a massive cache but instead to reduce its latency/improve its speed. A large cache obviously helps (to a limit) but having massive caches is more necessary for x86/64

    Originally posted by coder View Post
    Apple is also less price-sensitive than others, due to their vertical integration and focus on the premium market. So, they're less concerned with maximizing perf/mm^2 (and, by extension, perf/$) than other CPUs you're comparing against.
    You can argue that any way you want, you can also say since the are less price sensitive they have even more of an ability to research and add SMT into the cores for performance. This is a circular argument.

    Originally posted by coder View Post
    BTW, I can tell you another mobile-first CPU core that doesn't have SMT - Intel's E-cores. And they're x86, with Gracemont having similar IPC as Skylake. So, you can't argue they don't need it by virtue of being slow, or else you'd be arguing that Skylake didn't need it either.
    Thats because Intel's E-cores are trading performance for power, I didn't say that SMT is required for x86/64 (thats an obvious strawman), I said that SMT is not necessary for modern ARM because its a "tool" that is more applicable to solving x86-64's drawbacks if you want to maximize performance and not every Intel CPU SKU is about maximum performance.
    Last edited by mdedetrich; 30 May 2022, 09:44 AM.

    Comment


    • #42
      Originally posted by coder View Post
      You're extrapolating from a limited set of examples. That's a flawed experiment.

      A better experiment will be to see how well Intel's E-Core Xeons do, when they launch.
      Well we have several clear datapoints already: Neoverse N1 vs eg. Zen 3 and E vs P-core. In both cases you can fit 4 small cores in the space of one big SMT-2 core, and in both cases you get twice the performance per area. The only tradeoff is that single-threaded performance is slightly lower.

      You say that as if that's only because it lacks SMT, which I'm sure you don't really mean. SMT adds just a few % of die area per thread.
      It sounds like you believe the marketing statements... When adding SMT to a core, all resources are effectively halved. You can do a basic low-cost implementation (like early Hyperthreading implementations) but then you get small performance gains which are definitely not worth it. In order to do SMT-2 properly, you need to increase most resources by 25-50% (including L1/L2/L3 caches) and implement competitive sharing without starvation so that both threads can make good progress.

      What gains does SMT get in the latest high-end server chips? EPYC 7763 gets about 14% on SPECINT, Xeon 8380 just 8%! And for FP it is even worse, EPYC gets 5% and Xeon 2.5%... So SMT simply is not area efficient.

      Comment


      • #43
        Originally posted by PerformanceExpert View Post
        SMT
        AMR says hyperthreading can degrade energy efficiency. And that's all. There is no mention about ISA. Because it doesn't matter at all, it's obvious.

        gets 5% and Xeon 2.5%... So SMT simply is not area efficient
        Look.
        Intel Hyper Threading Performance With A Core i7 On Ubuntu 18.04 LTS - Phoronix

        Hyper Threading Performance & CPU Core Scaling With Intel's Skylake Xeon - Phoronix

        Everything depends. Some times have 40% boost, sometimes 0%.

        Comment


        • #44
          Those are quite old results with low core counts. Here is a more recent test with Zen 2 which runs more benchmarks - average gain of less than 4%. Another with EPYC shows 10-15% gains with Windows and Linux. Yes, you may be lucky if you have workloads that get 40%, but the average gains are pretty marginal on CPUs with high core counts.

          Comment


          • #45
            Originally posted by mdedetrich View Post
            If you do some research you will notice why SMT hasn't been used that much with ARM based architectures, its because the ISA doesn't pose the same amount of pipe-lining issues compared to x86, thats why the benefits of SMT for ARM is a lot less.
            Sounds like you have sources to cite. Either provide evidence or don't attack me for "not doing my research".

            Originally posted by mdedetrich View Post
            One example in the ISA for example is conditional execution. In modern ARM instructions have a feature whereby the individual instruction is not executed if some condition holds.
            Oops, wrong. Aarch64 got rid of instruction predication (except branches). Meanwhile, x86-64 does have cmov.

            Originally posted by mdedetrich View Post
            Modern ARM actually removes a lot of instructions that relied on predication (which is the converse of what I just said) because it makes dependency path in ISA execution very unpredictable.
            Then why are you contradicting yourself?

            Originally posted by mdedetrich View Post
            When your entire ISA has been designed so that dependency flow is very clear because of what I just said, the benefit of SMT drops massively because everything is made to be trivially pipe-linable by design.
            Please cite direct evidence of this.

            Originally posted by mdedetrich View Post
            I can fairly confidently guarantee that Apple M1 will not add SMT even for their higher and desktop workstation CPU's.
            It won't be M1, because that hit the limit of its scalability. It'll be some other SoC. And if they continue reusing phone-oriented cores, then they obviously won't have SMT.

            One thing about SMT is that its benefit increases with core count, because memory latencies do, as well. And when you're not only concerned about perf/W, but also aggregate throughput and perf/$, then SMT becomes much more appealing.

            Originally posted by mdedetrich View Post
            a larger cache is necessary for x86/64 because of the difficulty of pipe-lining CISC style instructions.
            Explain.

            Originally posted by mdedetrich View Post
            For apples sillicon, its actually far more beneficial to not have such a massive cache
            Apple SoCs traditionally feature much larger caches than their competitors. Here are 3 SoC's, all made on the same TSMC 5nm process node (Zen 3 added for comparison):

            Apple A14 Samsung Exynos 2100 Qualcomm Snapdragon 888 AMD Zen 3
            L1D 128 kB 64 kB 64 kB 32 kB
            L2 8 MB per 2 cores 512 kB 1 MB 512 kB
            L3 ??? 4 MB 4 MB 4 MB per core

            Originally posted by mdedetrich View Post
            You can argue that any way you want, you can also say since the are less price sensitive they have even more of an ability to research and add SMT into the cores for performance. This is a circular argument.
            That's a strawman argument. Since I didn't say that, posing it and burning it down looks like a diversionary tactic.

            Originally posted by mdedetrich View Post
            Thats because Intel's E-cores are trading performance for power,
            Yes, and that's the main reason I'm saying it's not popular among ARM cores. The dominant ARM cores are all mobile-first designs, where perf/W remains the primary objective. Even their server cores are just beefed up mobile cores.

            Comment


            • #46
              Originally posted by PerformanceExpert View Post
              Well we have several clear datapoints already:
              No, we don't. In case you're not familiar with the scientific method, the goal is to have just one independent variable. It's no good comparing an ARM core without SMT to x86 cores with, because there are too many other factors. That's what I mean by saying your experiment is flawed.

              To test your hypothesis, we need contemporary CPUs of the same ISA, with small non-SMT and big-SMT cores, targeting roughly the same market.

              Furthermore, your steadfast refusal to refute any ARM affiliation calls into question any comparison you draw between ARM and non-ARM cores.

              Originally posted by PerformanceExpert View Post
              It sounds like you believe the marketing statements...
              Not marketing, but technical analysis.

              Originally posted by PerformanceExpert View Post
              When adding SMT to a core, all resources are effectively halved.
              Except not all threads have the same resource demands.

              Originally posted by PerformanceExpert View Post
              you need to increase most resources by 25-50% (including L1/L2/L3 caches)
              You seem to be operating under the premise that SMT allows each thread to run unimpeded, but that's not how it works.

              Originally posted by PerformanceExpert View Post
              What gains does SMT get in the latest high-end server chips? EPYC 7763 gets about 14% on SPECINT, Xeon 8380 just 8%! And for FP it is even worse, EPYC gets 5% and Xeon 2.5%... So SMT simply is not area efficient.
              That's surely workload-dependent, but they sadly didn't break out the sub-scores for SMT vs. non-SMT. SMT is indeed great for code with low-ILP. This is one of the first things I tried, when I got a Pentium 4 with HT.

              Still, 14% is not a small benefit.

              Comment


              • #47
                Originally posted by coder View Post
                Sounds like you have sources to cite. Either provide evidence or don't attack me for "not doing my research".
                This is like fundamental ISA design, my source would be actually study the ISA's and understand what SMT means and you will very quickly see why this is the case. You asking me for sources on this is like asking for sources as to why Vulkan is a superior API especially when doing multicore game engines, its because the API is designed in a specific way that enables this. You are not going to find a "source" that says this because its kind of "duh its obvious" to anyone that is technical and works with Vulkan.

                Rather than a source (which you can find yourself) lets explain in a simple way why SMT is not needed for modern ARM and at the same time explain why it is needed for a performant x86/64 core (note that this is simplified, in reality it is more complex). First off for background we need to explain what one of the main problem's spaces of modern CPU's is, which is making sure that the CPU is being constantly fed with instructions to execute. The last thing you want is having a CPU' sitting there not executing as many instructions it can in a given clock cycle, especially so for the performance SKU's.

                As I explained previously, the ARM ISA literally has control order flow directly embedded into the ISA as a design (most ARM ISA instructions has conditional execution in their instructions if relevant) so lets have a look at x86/64. Due to a number of factors (general CISC design, not being able to deprecate older badly designed instructions over a 40 year period?) a lot of the instructions that x86/64 execute are not easily pipelinable. Put in a different way its incredibly hard to figure out at the current given time what the control order flow of a running program (composed of x86/64 instructions) will be in the future.

                Now we have this problem whereby they have to work with an instruction set that gives very little information (i.e. a lot of the instructions don't have control order flow information) and we want to avoid (as much as possible) of just having a CPU not executing as many instructions as it could. So engineers came up with this ingenious solution which is basically "well we can't do too much above our current techniques to solve this issue of x86/64 ISA not being easily pipelinable so lets change the problem around, why don't we instead just give the OS the ability to have two virtual cores that sit ontop of a real physical core so that if one virtual core isn't constantly feeding the real core instructions then instead it will get instructions from the second core". Bingo you now have SMT.

                Note that this problem of "constantly feeding the CPU with instructions" is so necessary for performance that modern CPU's also implement techniques like out of order execution. Basically when they hit a branch, they execute both of the branches knowing full well in the future that only the result from one branch is actually needed. When you know the result of the condition you just discard the result of the other branch and this increases average performance because in general branch conditions tend to heavily favor one condition over the other (there are also other techniques like branch prediction which is another tool that helps). tidbit: This same out of order execution is also behind most of the modern day CPU security vulnerabilities because the CPU executes code that actually shouldn't be executed and if you observe the timings of the CPU executing this branch that shouldn't be executed you can figure out properties of a program which you shouldn't, i.e. that branch being the result of a password check and by observing timing you can figure out the necessary hash information to get the original pw.

                And this is why SMT is largely unnecessary on ARM because simply put current ARM cores (and especially Apple's M1) don't have this problem of not being able to constantly feed the CPU instructions. There are other tools that were used to make sure that core is being constantly fed (i.e. branch prediction, out of order execution etc etc) but once all of those other techniques are applied, SMT is going to deliver almost no real tangible benefit. This is also the result of how the ARM ISA works which has revisions that deprecate old instructions that are unsuitable unlike x86/64 and conveniently enough AArch64 in ARMv8 removed all of the instructions which have this problem of not being easily pipelinable and low and behold Apples M1 is using AArch64. Due to this in the past there may have been more of a reason to use SMT because ARM did have instructions which weren't easily pipelinable but with modern ARM this is even less the case.

                This is the reason why I say that I am confident that Apple will not likely ever implement SMT even in their high performance core workstation CPU's, the Apple M1's do not have this problem of not feeding the CPU with instructions. They can definitely improve their performance but it will be in other spaces/areas (larger caches, increased bandwidth, lower latency, higher clock speeds, better IPC. improve branch detection/execution techniques etc etc). You can even see in the Mac m1 mini's (which are getting into the desktop space), they are getting improved performance in other ways (i.e. fusing cores). There is still a left over argument even with an ISA like AArch64 there may be merit in SMT because its impossible to know 100% of the branch paths ahead of time so that SMT can help whenever you have a cache miss because of mispredicted branch but as mentioned earlier there are far more efficient ways to solve this problem (and this is evidenced by the fact of how powerful the M1 chips are). If you put this into context of how expensive SMT is in terms of die space (you basically have to implement a multiplexer on the hardware level since you have to co-ordinate instructions from different virtual cores onto a single physical core), that die budget is far better used off on other things (cache, accelerators etc etc) which is what M1 did.

                Fun fact and another tidbit: Due to the fact that x86/64 instructions generally doesn't contain control order flow modern Intel/AMD cpu's don't actually execute x86/64 directly. Instead they translate at runtime the x86/64 to a microcode which is actually RISC like in design and the microcode is whats executed, the mentality behind this is that is similar of the JVM jit which is that at runtime it tries to determine the best performance strategy since its (even) harder to know this ahead of time. There are also other reasons why they use microcode (i.e. security updates without needing to physically replace the sillicon and also the famous FPU bug in the early 2000's which ocaml compiler discovered) but this is also one of the main reasons.

                If you have lot of time and want to get a real understanding of the problem (and you can code), write an interpreter for AArch64 and you will see when implementing it how easy it is to do this pipelinable type of optimizations. Then try to do the same with x86/64 (albet the ISA for x86/64 is huge but do some basic instructions) and you will very quickly see what I am talking about.

                Ontop of this there are other features of ARM/AArch64 that make the pipelining even easier, for example instructions for AArm64 is a constant 32 bits in size and knowing that instructions are a constant size is incredibly useful because ultimately you have to deal with executing these instructions on actual hardware where you have fixed size and knowing the exact size of any instruction, but most importantly any future instruction avoids issues of having to deal with caching of instructions because your buffer isn't big enough and because you don't know what the size of the future instruction could be. x86/64 have variable width sizes for their instructions and when you don't know what instructions can occur in the future this is also another complication when it comes to optimization.
                Last edited by mdedetrich; 31 May 2022, 11:09 AM.

                Comment


                • #48
                  Originally posted by mdedetrich View Post
                  As I explained previously, the ARM ISA literally has control order flow directly embedded into the ISA as a design (most ARM ISA instructions has conditional execution in their instructions if relevant) so lets have a look at x86/64. Due to a number of factors (general CISC design, not being able to deprecate older badly designed instructions over a 40 year period?) a lot of the instructions that x86/64 execute are not easily pipelinable. Put in a different way its incredibly hard to figure out at the current given time what the control order flow of a running program (composed of x86/64 instructions) will be in the future.
                  ARM actually removed nearly all of the conditional instructions as part of the A32->A64 transition. The only conditional instruction remaining is conditional branch. Your statement is correct for A32 and T32, however.

                  Originally posted by mdedetrich View Post
                  Fun fact and another tidbit: Due to the fact that x86/64 instructions generally doesn't contain control order flow modern Intel/AMD cpu's don't actually execute x86/64 directly. Instead they translate at runtime the x86/64 to a microcode which is actually RISC like in design and the microcode is whats executed, the mentality behind this is that is similar of the JVM jit which is that at runtime it tries to determine the best performance strategy since its (even) harder to know this ahead of time. There are also other reasons why they use microcode (i.e. security updates without needing to physically replace the sillicon and also the famous FPU bug in the early 2000's which ocaml compiler discovered) but this is also one of the main reasons.
                  High end A64 (Arm64) CPUs decode ARM ISA into micro-ops and execute the micro-ops. The latest cores even include a micro-op cache just like x86-64 - they call it the "L0-decoded" cache.

                  For what it's worth, my take on why/where SMT makes sense is quite different from yours, although we might end up with the same conclusion.

                  The primary value of SMT in a modern CPU is the ability to build a very wide (high peak IPC) core that can efficiently execute both well-optimized code and older / less optimized code. Making good use of a wide core's execution resources can be done in a few different ways:

                  - optimize the code so that a reasonably deep OOO execution engine can find enough ready-to-execute micro-ops in a single instruction stream to keep most/all of the execution resources busy

                  - use SMT so that on average you only have to find enough ready-to-execute micro-ops in a single instruction stream to keep 1/2 or 1/4 of the execution resources busy - or put differently you have to find enough ready-to-execute micro-ops in 2 or 4 instruction streams to keep most/all of the execution resources busy

                  - significantly expand the OOO capabilities of the core to improve the chances of finding enough ready-to-execute micro-ops in a single instruction stream - M1 is the poster child for this but so far the ARM-designed cores are more in line with x86-64. IIRC the Neoverse V-1 core in Graviton3 has a 256 entry re-order buffer, the same as Zen3.
                  Last edited by bridgman; 31 May 2022, 12:54 PM.
                  Test signature

                  Comment


                  • #49
                    Originally posted by bridgman View Post
                    ARM actually removed nearly all of the conditional instructions as part of the A32->A64 transition. The only conditional instruction remaining is conditional branch. Your statement is correct for A32 and T32, however.
                    That's not true, AArch64 still has many conditional instructions such as CSEL, CSET, CCMP etc. These cover the majority of use cases and avoid some of the issues predicated instructions have.

                    High end A64 (Arm64) CPUs decode ARM ISA into micro-ops and execute the micro-ops. The latest cores even include a micro-op cache just like x86-64 - they call it the "L0-decoded" cache.
                    All CPUs decode instructions into internal micro-ops - this is not unique to high-end or OoO cores. Recent Arm designs have a micro-op cache but the M1 proves it is not needed for high-end (I bet that dropping Arm/Thumb helps since that means only 32-bit instructions).

                    For what it's worth, my take on why/where SMT makes sense is quite different from yours, although we might end up with the same conclusion.

                    The primary value of SMT in a modern CPU is the ability to build a very wide (high peak IPC) core that can efficiently execute both well-optimized code and older / less optimized code. Making good use of a wide core's execution resources can be done in a few different ways:

                    - optimize the code so that a reasonably deep OOO execution engine can find enough ready-to-execute micro-ops in a single instruction stream to keep most/all of the execution resources busy

                    - use SMT so that on average you only have to find enough ready-to-execute micro-ops in a single instruction stream to keep 1/2 or 1/4 of the execution resources busy - or put differently you have to find enough ready-to-execute micro-ops in 2 or 4 instruction streams to keep most/all of the execution resources busy

                    - significantly expand the OOO capabilities of the core to improve the chances of finding enough ready-to-execute micro-ops in a single instruction stream - M1 is the poster child for this but so far the ARM-designed cores are more in line with x86-64. IIRC the Neoverse V-1 core in Graviton3 has a 256 entry re-order buffer, the same as Zen3.
                    You'd optimize your code already, and compilers improve considerably over time, but people still want better performance. So then you're back to SMT or going deep and wide. Comparing ROB entries is non-trivial, but Neoverse V1 is much smaller than Zen 3, so the ROB is relatively large for its size (ROB doubled since N1).

                    Comment


                    • #50
                      Originally posted by PerformanceExpert View Post
                      That's not true, AArch64 still has many conditional instructions such as CSEL, CSET, CCMP etc. These cover the majority of use cases and avoid some of the issues predicated instructions have.
                      OK, I see the problem. ARM does not call those "conditional instructions" but rather "unconditionally executed instructions that include condition code information as one of the inputs". Looks like they basically removed all the predication capabilities except for conditional branch but added in a few new "conditional data processing instructions". There are a lot of those opcodes but many of them appear to be aliases of other instructions.

                      My statement about conditional branch being the only remaining conditional instruction came directly from ARM programming materials but apparently some of the terminology changed as well so there are a few other instructions with "Conditional" in their name.

                      Originally posted by PerformanceExpert View Post
                      All CPUs decode instructions into internal micro-ops - this is not unique to high-end or OoO cores. Recent Arm designs have a micro-op cache but the M1 proves it is not needed for high-end (I bet that dropping Arm/Thumb helps since that means only 32-bit instructions).
                      I'm trying to write more concisely but always end up regretting it. Agree that most if not all CPUs decode instructions into internal micro-ops, but lower end CPUs execute those micro-ops immediately while OOO CPUs schedule and execute the micro-ops independently of the fetch/decode activities.

                      Anyways, my point was that mdedetrich 's comment about x86-64's use of independently scheduled and executed micro-ops also applied to Arm64.

                      Originally posted by PerformanceExpert View Post
                      You'd optimize your code already, and compilers improve considerably over time, but people still want better performance. So then you're back to SMT or going deep and wide. Comparing ROB entries is non-trivial, but Neoverse V1 is much smaller than Zen 3, so the ROB is relatively large for its size (ROB doubled since N1).
                      Yep, that's fair - although I suspect that a fair amount of the size difference is due to Zen3 being designed to run at significantly higher clocks than N1 on the same fab process.

                      I should note that while compilers tend to improve over time it's less common for SW vendors to recompile and redistribute new binaries unless they are also releasing new or significantly updated SW. It probably is fair to say that x86-64 is impacted by that more than Arm64 simply because the kind of SW that doesn't tend to get new binaries regularly (games) tends to be run a lot more on x86-64 than on Arm64.

                      Anyways, I think we are in agreement - it's not so much that there are fundamental reasons A64 CPUs will never need SMT but more "they've gotten away without it so far". It's not clear to me that the SW running on A64 is necessarily ever going to demand the kind of high peak IPCs that are taken for granted in the x86-64 world so it's possible that "only doing efficiency cores" might suffice.
                      Test signature

                      Comment

                      Working...
                      X