Amazon Graviton3 vs. Intel Xeon vs. AMD EPYC Performance

coder replied

29 May 2022, 06:27 PM
Originally posted by mdedetrich View Post

Its not revisionist, its a fact.

This part is almost the very definition of revisionist history.

"hyperthreading is mainly a result of trying to squeeze out more performance from an older CISC based style ISA that has different word size's for ISA instructions."

SMT has a long history, only part of which is overlapped by x86 adoption. As for why they adopted it, you're again fitting your own rationale onto a selective reading of the facts. For the most part, x86 CPUs do quite well at mitigating the frontend bottlenecks by use of micro-op caches. By itself, further mitigating that shouldn't deliver nearly enough benefit to justify it.

Originally posted by mdedetrich View Post

The amount of ARM CPU's that had SMT capabilities in its entire history you could probably fit on a single hand. The architecture really doesn't need it, people experimented with it and found out that its not necessary.

You're deducing that from a limited amount of information. This approach is fraught. We don't know precisely why it hasn't factored bigger into ARM ISA implementations, but I think Apple's cores provide a useful case to examine.

Originally posted by mdedetrich View Post

Which is why Apple M1's in laptop's/Mac Mini's have SMT....

The cores in their M1 SoCs are the same cores they used in earlier phone SoCs. So far, they haven't made dedicated cores only for their M1's.

Originally posted by mdedetrich View Post

(and yes before you answer, Apple would have totally put SMT into their M1 processor if the performance increase was worth it).

I'll say it again: SMT is a technique (or "tool" if you prefer) used to help tackle a variety of issues. In Apple's case, it seems like they found they can keep the backend of their cores adequately busy with a different set of solutions:
wide frontend

large reorder buffer (indeed, large enough to sometimes even hide full cache hierarchy misses)

large caches

lower clock speed (which reduces the latency hit of a cache miss, in terms of clock cycles)

Through these and other tweaks, they don't need SMT, at the scale they've so far been deploying their cores. It does have drawbacks, including some slight power overhead. When mobile is your main focus, anything with a power overhead is immediately a negative.

Apple is also less price-sensitive than others, due to their vertical integration and focus on the premium market. So, they're less concerned with maximizing perf/mm^2 (and, by extension, perf/$) than other CPUs you're comparing against.

BTW, I can tell you another mobile-first CPU core that doesn't have SMT - Intel's E-cores. And they're x86, with Gracemont having similar IPC as Skylake. So, you can't argue they don't need it by virtue of being slow, or else you'd be arguing that Skylake didn't need it either.
Likes 1
Leave a comment:
leio replied

29 May 2022, 01:24 PM
Originally posted by coder View Post

Edit: in newer versions of GCC -mcpu is deprecated! The manual even states that:
Specifying -march=cpu-type implies -mtune=cpu-type.

Source: https://gcc.gnu.org/onlinedocs/gcc-1...ml#x86-Options

Don't make conclusions about Aarch64 GCC options based on x86 options documentation of GCC. https://gcc.gnu.org/onlinedocs/gcc-1...4-Options.html
-mcpu=native is correct on Aarch64 as you initially wrote.
Likes 1
Leave a comment:
dahauns replied

29 May 2022, 01:11 PM
Originally posted by mdedetrich View Post

Well ARM CPU's don't tend to have hyperthreading due to the fact that hyperthreading is mainly a result of trying to squeeze out more performance from an older CISC based style ISA that has different word size's for ISA instructions. ARM ISA doesn't have this issues so afaik there isn't any modern ARM CPU that has hyperthreading/SMT.

Sooo...because ARM doesn't suffer the same frontend bottlenecks as "older CISC based style ISA" it doesn't need a feature that helps alleviate bottlenecks everywhere *but* the frontend.
That's...surreal in its idiocy.
Leave a comment:
PerformanceExpert replied

29 May 2022, 09:00 AM
Originally posted by AdrianBc View Post

Sorry, but you have replied to something without reading carefully the text to which you were replying, so you have replied to something that has not been said.

You were saying that's it's the Cortex-A76 that is slow rather than the RK3588 board. It's obviously faster and more modern than a PI 4 but it's still a low-end chip. So it's not an indication of the maximum performance of A76 - one could make a desktop chip with larger caches and far better performance, something like Ampere Altra.

I have not compared Cortex-A76 with Graviton 2, but with *Graviton 3*, i.e. with Neoverse V1, which has about the same speed as Cortex-X1 and which is about twice as fast in single-thread as Cortex-A76.

I have no idea where you get that 2x from...

Graviton 2 uses the same microarchitecture as Cortex-A76, so they would obviously have similar performance when using the same memory system. Cortex-X1 is about 53% faster than Cortex-A76. Arm claims Neoverse V1 is about 48% faster than N1 on SPEC - these tests show 43% on a wild variety of benchmarks. These speedups are all consistent.

The Neoverse N1 variant of Cortex-A76, which has better caches than Cortex-A76, which makes it somewhat faster, may have an acceptable single-thread speed in comparison with a server CPU with low clock frequencies, but Cortex-A76 is much weaker when compared with laptop or desktop CPUs, which have high clock frequencies, i.e. it has only two thirds of the speed of old Skylake or Zen 2 CPUs, while newer CPUs like Zen 3, Tiger Lake or Alder Lake are from 2.5 to even 3 times faster in single-thread.

A 5950X is about 2.5 times faster than Kirin 990 5G on SPEC (see my link). Altra gets 91% of the single-threaded performance of the 7763, so ~65% of the higher clocked 5950X. Given the microarchitectures are almost identical, it's the larger L3 cache and better memory system that makes Altra so much faster than Kirin. Ie. Altra shows what a desktop chip using Cortex-A76 could achieve.
Leave a comment:
PerformanceExpert replied

29 May 2022, 07:39 AM
Originally posted by smitty3268 View Post

I think it just depends on what you are curious about comparing.
1-core vs 1-core of each architecture. In this case, I think an SMT "core" should be included for x86 for anything able to take advantage of multi-threading, but there should also be single-threaded tests where it is useless.

Max multi-threading performance of the largest chips with the most cores available for each architecture.

2 chips that each use the same amount of power, at least approximately.

2 chips that cost the same amount of money, at least approximately.

2 chips that have the same amount of performance, at least approximately.

The most power-efficient chips of each architecture. Or most cost-effective.

The most popularly used chips of each.

Probably some others too...

What I don't think makes any sense is taking an 8-core chip with SMT and directly comparing it to a 16-core chip just because they can both run 16 threads. If one of the other things above matches, then sure. But thread count alone is a poor reason for a comparison on it's own.

Most of the above are interesting for us geeks indeed. However the comparison with equal threads is valid too since AWS sells the instances this way, with similar price points and similar performance, matching your comparisons. Generally the hyperscalers prefer lots of cores instead of SMT because 2 cores are smaller and faster than one big SMT-2 core (plus there is the security argument and predictable speedup). Graviton 3 die area is about 450-500mm^2, quite small and cost effective compared to EPYC.
Leave a comment:
mdedetrich replied

29 May 2022, 05:32 AM
Originally posted by coder View Post

As HEL88 already pointed out, plenty of RISC CPUs had it, not least DEC's legendary Alpha EV8

You seem to be taking a very narrow and revisionist view of SMT. SMT is an architectural technique that can be employed for different reasons. GPUs famously employ it quite heavily, and do so for the purpose of hiding memory latency. CPUs can use it for that, but also to mitigate the impact of low-ILP code (as HEL88 pointed out) and mispredicted branches. Finally, in more recent x86 CPUs, we see that the frontend can be a bottleneck, creating a further opportunity for SMT to help save the day..

Its not revisionist, its a fact. The amount of ARM CPU's that had SMT capabilities in its entire history you could probably fit on a single hand. The architecture really doesn't need it, people experimented with it and found out that its not necessary.

Originally posted by coder View Post

ARM CPUs can certainly benefit from SMT, but the amount of benefit depends greatly on the size of the core. The bigger and more capable the core, the more win there is to be had by adding SMT. So far, ARM cores have been comparatively small. That, in conjunction with the negative press SMT has recently garnered on the security front, are the likely reasons ARM's cloud-oriented cores haven't yet joined the SMT set.

Which is why Apple M1's in laptop's/Mac Mini's have SMT.... (and yes before you answer, Apple would have totally put SMT into their M1 processor if the performance increase was worth it).

Last edited by mdedetrich; 29 May 2022, 08:15 AM.
Likes 1
Leave a comment:
coder replied

29 May 2022, 02:48 AM
Originally posted by mdedetrich View Post

Well ARM CPU's don't tend to have hyperthreading due to the fact that hyperthreading is mainly a result of trying to squeeze out more performance from an older CISC based style ISA that has different word size's for ISA instructions. ARM ISA doesn't have this issues so afaik there isn't any modern ARM CPU that has hyperthreading/SMT.

As HEL88 already pointed out, plenty of RISC CPUs had it, not least DEC's legendary Alpha EV8.

You seem to be taking a very narrow and revisionist view of SMT. SMT is an architectural technique that can be employed for different reasons. GPUs famously employ it quite heavily, and do so for the purpose of hiding memory latency. CPUs can use it for that, but also to mitigate the impact of low-ILP code (as HEL88 pointed out) and mispredicted branches. Finally, in more recent x86 CPUs, we see that the frontend can be a bottleneck, creating a further opportunity for SMT to help save the day.

ARM CPUs can certainly benefit from SMT, but the amount of benefit depends greatly on the size of the core. The bigger and more capable the core, the more win there is to be had by adding SMT. So far, ARM cores have been comparatively small. That, in conjunction with the negative press SMT has recently garnered on the security front, are the likely reasons ARM's cloud-oriented cores haven't yet joined the SMT set.
Likes 1
Leave a comment:
AdrianBc replied

29 May 2022, 02:20 AM
Originally posted by PerformanceExpert View Post

No, Cortex-A76 is basically the same core as used in Graviton 2 and Ampere Altra (Max). It achieves ~91% of single-threaded SPECINT2017 of EPYC 7763. So it is a pretty quick core despite being 4 years old... The problem with most of these boards is that they use cost optimized phone CPUs with very little cache and a slow memory system.

Sorry, but you have replied to something without reading carefully the text to which you were replying, so you have replied to something that has not been said.

I have not compared Cortex-A76 with Graviton 2, but with *Graviton 3*, i.e. with Neoverse V1, which has about the same speed as Cortex-X1 and which is about twice as fast in single-thread as Cortex-A76.

It would have made no sense to compare it with Graviton 2, when we are commenting the benchmark results for *Graviton 3*.

The Neoverse N1 variant of Cortex-A76, which has better caches than Cortex-A76, which makes it somewhat faster, may have an acceptable single-thread speed in comparison with a server CPU with low clock frequencies, but Cortex-A76 is much weaker when compared with laptop or desktop CPUs, which have high clock frequencies, i.e. it has only two thirds of the speed of old Skylake or Zen 2 CPUs, while newer CPUs like Zen 3, Tiger Lake or Alder Lake are from 2.5 to even 3 times faster in single-thread.

Like I have said, Cortex-A76 has essentially the same speed as the Intel Jasper Lake Pentium or Celeron processors that are its alternative in cheap computers, but with RK3588 you might still save about $100 on a complete computer vs. the Intel variant.

Besides being cheaper, for software developers who do low-level programming, RK3588, i.e. Cortex-A76, has the advantage of a much more modern ISA, Armv8.2-A, vs. the Intel Jasper Lake, which, even if it was launched at the start of 2021, it still uses an ancient ISA that was already obsolete a decade ago among the Intel products (i.e. it lacks AVX, FMA, BMI etc.).

Last edited by AdrianBc; 29 May 2022, 02:25 AM.
Likes 1
Leave a comment:
coder replied

28 May 2022, 11:13 PM
Originally posted by BlueSwordM View Post

Still, if they can only get similar performance on TSMC N5, I fear Sapphire Lake and especially Genoa will obliterate Graviton3, especially on a cloud provider providing fair pricing inside of the VCPU BS.

Michael did not normalize for power (he lacks the relevant information to do so). However, Amazon claims that a 64-core Graviton3 runs @ 100 W, whereas we know 64-core EPYC and 40-core Ice Lake Xeons both run well into the 200 W territory. Even if Amazon isn't running them at peak clocks, they're absolutely using significantly more than 100 W.

The comparison isn't necessarily fair without taking into account perf/W, depending on your concern. Certainly not, if you want the most direct comparison between the respective cores.
Likes 1
Leave a comment:
smitty3268 replied

28 May 2022, 07:26 PM
Originally posted by PerformanceExpert View Post

The same number of cores doesn't give a good comparison either, you'd have to look at performance per area (and power) in the same process. Consider for example that the Gravitons have less than half the L2/L3 cache of the EPYC instances.

But yes, for customers the only thing that ultimately matters is perf/$, and that's exactly why Graviton is getting so popular.

I think it just depends on what you are curious about comparing.
1-core vs 1-core of each architecture. In this case, I think an SMT "core" should be included for x86 for anything able to take advantage of multi-threading, but there should also be single-threaded tests where it is useless.

Max multi-threading performance of the largest chips with the most cores available for each architecture.

2 chips that each use the same amount of power, at least approximately.

2 chips that cost the same amount of money, at least approximately.

2 chips that have the same amount of performance, at least approximately.

The most power-efficient chips of each architecture. Or most cost-effective.

The most popularly used chips of each.

Probably some others too...

What I don't think makes any sense is taking an 8-core chip with SMT and directly comparing it to a 16-core chip just because they can both run 16 threads. If one of the other things above matches, then sure. But thread count alone is a poor reason for a comparison on it's own.
Last edited by smitty3268; 28 May 2022, 07:30 PM.
Likes 1
Leave a comment:

Announcement

Amazon Graviton3 vs. Intel Xeon vs. AMD EPYC Performance

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: