Announcement

**varikonniemi** · 11 December 2023, 03:19 PM

This just displays how much better ARM is compared to x86. Imagine if this Ampere was refreshed to be manufactured in same factory as the EPYCs, what efficiency benefit it would add to the already superior show.

**milkylainen** · 11 December 2023, 03:53 PM

Originally posted by varikonniemi View Post

This just displays how much better ARM is compared to x86. Imagine if this Ampere was refreshed to be manufactured in same factory as the EPYCs, what efficiency benefit it would add to the already superior show.

It has very little to do with ARMv8/v9 vs x86_64. The CPUs ISA is just a thing in a very large system design.
These direct ISA statements are worth next to nothing unless you prove instruction set or design efficiency on paper, which I assure you, you won't.
And even if you could, it's still just bits in a very big system design.

Lets say you have a more efficient design, then you still can't say that it depends on the instruction set.
It's easier to say that the design teams are better and are providing more efficient designs using their fab capabilities, but that ISA is an be all, end all thing is just stupid.

**and.elf** · 11 December 2023, 04:15 PM

These aarch64-based platforms really seem ideal for all those none-compute-based websites. Some requests may take a bit longer, but it doesn't really matter to anyone. A good trade-off indeed

**coder** · 11 December 2023, 04:22 PM

Originally posted by varikonniemi View Post

This just displays how much better ARM is compared to x86.

Did you bother to look through the results? There are only a handful of cases where it manages to win on efficiency.

Originally posted by varikonniemi View Post

Imagine if this Ampere was refreshed to be manufactured in same factory as the EPYCs,

It's not hard to work out. TSMC publishes the efficiency gains of each process node. According to this, I think it should use about 30% less power at the same performance (i.e. on N5 vs N7).

Source: https://fuse.wikichip.org/news/7048/...-many-flavors/

**HEL88** · 12 December 2023, 12:28 AM

Originally posted by varikonniemi View Post

This just displays how much better ARM is compared to x86. Imagine if this Ampere was refreshed to be manufactured in same factory as the EPYCs, what efficiency benefit it would add to the already superior show.

Performance is no less important than efficiency.

Extensive out-of-order structures, large BTBs, accurate predictors, etc. These all consume power.

That's why in phones the weakest (but most efficient) cores don't even have an out-of-order, and architecturally they are closer to an Intel Pentium than a CoreDuo.

Neoverse N1 from Ampere Altera has all structures much smaller and more primitive than ZEN4 or Raptor Cove - hence its higher efficiency. There is no magic here.

That's why Intel, and probably slowly AMD too, will go into poor-core based processors - where multiprocessing proves itself. Single-threaded performance is too expensive in transistors and energy.

Eg. Raptor Cove is 4 times bigger (needs 4 times more transistors) than Gracemont, and has (from what I remember) only ~40% higher IPC.

**linuxgeex** · 12 December 2023, 01:49 AM

An interesting metric would be to throttle the AMD64 product clock speeds down to the point where they match the Ampere part's geomean performance, and then measure the performance per watt. This is because Intel/AMD server CPUs tend to operate radically more efficiently when you clock them down, ie around 2.4GHz instead of 3.6GHz. The difference, esp for the Intel parts, is huge. And downclocking them by 33% doesn't mean a 33% reduction in performance either, as the memory subsystem keeps operating at full speed, so the relative performance per clock rises.

**trifud** · 12 December 2023, 03:11 AM

We have an Ampere Altra Max M128-30 and may observations are similar to the results of this test. It can't match AMD/Intel, not even Graviton3/3E in performance but it is very power efficient. It performs well for highly parallel workloads. It would be nice if it had more memory channels. Me personally, this was (and still is given the pricing of Nvidia Grace) the best system for the roll we acquired it for.

**coder** · 12 December 2023, 03:45 AM

Originally posted by HEL88 View Post

Performance is no less important than efficiency.

Extensive out-of-order structures, large BTBs, accurate predictors, etc. These all consume power.

That's why in phones the weakest (but most efficient) cores don't even have an out-of-order, and architecturally they are closer to an Intel Pentium than a CoreDuo.

Apple's E-cores are OoO, and have been for a really long time. They're incredibly efficient, too. There are some energy-saving optimizations you can do, if you have all that nice out-of-order machinery. Also, stalls are a higher power-state than idling the core. So, if you can reduce stalls to complete the work sooner and get back to idle, it can provide energy-savings (though not if you burn too much power, in the the race-to-idle).

The thing to keep in mind about ARM's E-cores is they're not only optimizing for energy-efficiency, but also area-efficiency. The two are related, but not synonymous. Apple is willing to build bigger E-cores, in order to make them as efficient as possible (as well as reducing how often you need to wake up the P-cores).

Originally posted by HEL88 View Post

Neoverse N1 from Ampere Altera has all structures much smaller and more primitive than ZEN4 or Raptor Cove - hence its higher efficiency. There is no magic here.

Well... it did come up that Altra's caches can detect an overwrite and avoid the typical write-miss penalty. Best exemplified in their standout Stream Triad performance, in spite of using the exact same speed, number, and type of DIMMs as EPYC.

Source: https://www.anandtech.com/show/16315...altra-review/4

Those cache fetches Altra isn't doing also translate into energy it's not wasting!

**HEL88** · 12 December 2023, 04:27 AM

Originally posted by coder View Post

Apple's E-cores are OoO, and have been for a really long time.

And they represent only a fraction of what is in the performance core. After all, this is what he writes about - the bigger the OoO the more accurate the predictor the bigger the BTB the higher the energy consumption.

That's why energy-efficient cores ALWAYS have these structures smaller than performance cores.

Write how deep the OoO is for e-cores at Apple. Performance core is 640 deep.

Well... it did come up that Altra's caches can detect an overwrite and avoid the typical write-miss penalty.

nice

This does not change the fact that Neoverse N1 is a significantly weaker core, for example:

OoO: 128 vs 320 in ZEN4
8 execution port vs 14 in zen4
etc.

Benchmarking 4 cores performs similar to the old i5 6600k at 3GHz - very weak as of today:

Compilation:

File Compression:

Render time:

Deep Diving Neoverse N1 – Chips and Cheese

Announcement

Ampere Altra Max Continues To Deliver Competitive Power Efficiency To AMD EPYC & Intel Xeon

Ampere Altra Max Continues To Deliver Competitive Power Efficiency To AMD EPYC & Intel Xeon

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment