Announcement

**coder** · 18 December 2020, 09:44 AM

Originally posted by juanrga View Post

So what Apple engineers did was to implement a special strong memory model in the hardware, which is activated when running under x86 emulation mode.

If I understand correctly, mdedetrich was saying that the ported code worked properly when compiled natively to AArch64 code. Now it's not clear to me that it was running natively under MacOS, as opposed to a ARM-mode VM running Linux, in which case maybe the strong-ordering still got enabled.

Originally posted by juanrga View Post

In the near future, when all the software has been recompiled and the transition x86-->ARM is complete, Apple engineers will just eliminate the Roseta2 mode and the hardware will always run in native mode.

Unless they care about the performance of Windows apps & VMs running on it? How much silicon overhead does the strong ordering add? If it's cheap enough, I'd expect they'd just keep it around.

**Weasel** · 18 December 2020, 10:11 AM

Originally posted by PerformanceExpert View Post

It's not a contest of who has the most extensions! What matters is the performance difference. Your link shows that on that set of benchmarks the difference on Zen 3 is about 11%. Given no benchmarks are the same between the different runs we can't conclude anything about how much different it makes in these tests.

Yeah, you are a lost cause.

**coder** · 18 December 2020, 10:30 AM

Originally posted by PerformanceExpert View Post

It's not a contest of who has the most extensions! What matters is the performance difference.

And I listed them so you could get a sense for the level of capabilities that have been added since 2005, and thus appreciate the performance potential.

Originally posted by PerformanceExpert View Post

Your link shows that on that set of benchmarks the difference on Zen 3 is about 11%. Given no benchmarks are the same between the different runs we can't conclude anything about how much different it makes in these tests.

Also, the two sets indeed had one overlap. If you apply the same improvement to Coremark, it'd be enough to put Epyc ahead (at 4181009 vs 4079566).

Originally posted by PerformanceExpert View Post

It may overturn some of them perhaps, but we don't know. It's possible the difference may only be 1 or 2% on these benchmarks.

Did you actually look at them? Because the smallest improvement in any of those tests was at least that much!

Originally posted by PerformanceExpert View Post

Either way AnandTech has published their results, and as expected, Altra beats the 7742 by 14% single threaded and 7.5% multithreaded on SPECINT (128T vs 64 cores).

Nice of you to quote the SPEC INT scores and quietly hope nobody noticed that it lost on Float, both ST and MT! That supports what I've been saying about vectorization and really makes it seem like you care more about winning arguments than arriving at the correct answer.

Originally posted by PerformanceExpert View Post

So again, like I said, Ampere Altra is now the fastest server in the world

It depends on what you're trying to do. If your workload roughly matches those SPECINT tests, then yes. But it lost on float, got pwned on Java and NAMD, and lost on LLVM compilation.

It's nothing like the sort of across-the-board domination you seem to suggest. Heck compiling code is almost purely-integer, yet Epyc 7742 schooled Altra on the very first of Phoronix' benchmarks -- LLVM compilation!

A true performance expert follows the data, even when it doesn't match up with their expectations. In fact, in my career, the times I've learned the most are precisely when the data did not align with my expectations. That's when, instead of brushing away the inconsistencies, you dive in to understand them. Sometimes, you'll find a fixable problem, but it's nearly always a good learning opportunity ...if you don't let it pass by.

**coder** · 18 December 2020, 10:48 AM

For those of us who've raised questions about pricing, the CPU price list is included in Anandtech's article. They sum it up like this:

Where Ampere and the Altra definitely is beating AMD in is TCO, or total cost of ownership. Taking the flagship models as comparison points – the Q80-33 costs only $4050 which generally matching the performance of AMD’s EPYC 7742 which still comes in at $6950, essentially 42% cheaper.

$800 gets you in the door, with a 32-core, 1.7 GHz model. Could possibly tempt a few Pi-heads needing something beefier, if the single-CPU motherboards are cheap enough?

The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster

https://www.anandtech.com/show/16315/the-ampere-altra-review/9

**coder** · 18 December 2020, 10:51 AM

Originally posted by Weasel View Post

Yeah, you are a lost cause.

Happy holidays, Weasel!

**PerformanceExpert** · 18 December 2020, 12:33 PM

Originally posted by coder View Post

And I listed them so you could get a sense for the level of capabilities that have been added since 2005, and thus appreciate the performance potential.

A list doesn't impress me - I know the performance potential and so I know few applications benefit from AVX512 autovectorization (AMD doesn't even have it and still beats Intel).

The issue is that you dismiss the fact that the performance potential exists on Arm too. Both could do better if we tweak the settings or tune the code.

Also, the two sets indeed had one overlap. If you apply the same improvement to Coremark, it'd be enough to put Epyc ahead (at 4181009 vs 4079566).

Those improvements are for Zen 3 and that is a different microarchitecture. If you hope to get the same gain on this set of benchmarks you will be disappointed. The only option is to run the same set of benchmarks tuned for Zen 2 rather than making extrapolations between unrelated benchmarks and different microarchitectures.

Nice of you to quote the SPEC INT scores and quietly hope nobody noticed that it lost on Float, both ST and MT! That supports what I've been saying about vectorization and really makes it seem like you care more about winning arguments than arriving at the correct answer.

Servers are mostly about integer code, especially hyperscalers, so yes, SPECINT is what matters.

Not that SPECFP scores are bad at all, SPECFP_rate is just 2% below 7742, so HPC people will be interested in Altra. Single threaded FP does not look as good, but per-thread scores are more interesting as that shows what you get on a loaded server.

And if you think SPECFP is all about vectorization then why does the Xeon perform so badly? EPYC beats it on single threaded and it doesn't even run at 4GHz!

It depends on what you're trying to do. If your workload roughly matches those SPECINT tests, then yes. But it lost on float, got pwned on Java and NAMD, and lost on LLVM compilation.

Check again - NAMD was actually 30.5% faster on Altra. And SPECFP is within 2%, LLVM is within 4%, so only Java looks bad. And that's something that typically gets tuned a lot for each microarchitecture to get the best results.

It's nothing like the sort of across-the-board domination you seem to suggest. Heck compiling code is almost purely-integer, yet Epyc 7742 schooled Altra on the very first of Phoronix' benchmarks -- LLVM compilation!

I trust AnandTech's results more - they even went to the trouble to use a RAM disk to avoid differences in IO or SSD performance, and it shows they are very close on compilation.

A true performance expert follows the data, even when it doesn't match up with their expectations. In fact, in my career, the times I've learned the most are precisely when the data did not align with my expectations. That's when, instead of brushing away the inconsistencies, you dive in to understand them. Sometimes, you'll find a fixable problem, but it's nearly always a good learning opportunity ...if you don't let it pass by.

I follow the facts and the data, and it's obvious to any expert that Altra is a huge winner here. That's before we consider the lower cost and power usage that is attractive to hyperscalers. Nobody claimed it will win every single benchmark, and it doesn't need to in order to be the best. There are likely major opportunities to tune code for AArch64, and that will only improve it further.

**PerformanceExpert** · 18 December 2020, 01:27 PM

Originally posted by coder View Post

For those of us who've raised questions about pricing, the CPU price list is included in Anandtech's article. They sum it up like this:

$800 gets you in the door, with a 32-core, 1.7 GHz model. Could possibly tempt a few Pi-heads needing something beefier, if the single-CPU motherboards are cheap enough?

https://www.anandtech.com/show/16315...altra-review/9

There will be expensive high-end desktops using Altra like the previous generation. There won't ever be a PI with this CPU since obviously there is no enthusiast market for $10000 boards! In a few years PI will likely use 8cx, chips using Cortex-A78C or whatever else happens to be cheap and available in small volume.

**juanrga** · 18 December 2020, 02:09 PM

Originally posted by coder View Post

Nice of you to quote the SPEC INT scores and quietly hope nobody noticed that it lost on Float, both ST and MT! That supports what I've been saying about vectorization and really makes it seem like you care more about winning arguments than arriving at the correct answer.

Ampere implements N1 cores with 128bit NEON units. AMD Rome implements Zen2 cores with 256bit AVX units.

However, most of HPC code is memory bound and Ampere would perform better due to superior BW/FLOP ratio. In fact, in stock settings Rome ties to Ampere. It is only in the mixT settings that Rome gets a 13% extra performance. That mix mean disabling SMT in Zen2 cores to reduce cache and memory bottlenecks in memory bound codes.

**coder** · 18 December 2020, 02:43 PM

Originally posted by PerformanceExpert View Post

The issue is that you dismiss the fact that the performance potential exists on Arm too. Both could do better if we tweak the settings or tune the code.

I did no such thing. All I did was point out what should be obvious: the fact that there's a substantial disparity between the two.

Originally posted by PerformanceExpert View Post

Those improvements are for Zen 3 and that is a different microarchitecture. If you hope to get the same gain on this set of benchmarks you will be disappointed.

Nonsense. The vast majority of the improvement comes from ISA extensions not present in baseline x86-64. Zen3 added no new instructions, AFAICT. The only other things affected by that option are cost tables that tweak instruction generation, but those sorts of things typically have a very minor effect, as shown by the difference between the zen2 and zen3 optimizations. You need a substantial departure in micro-architecture, like switching all the way to haswell, to reach a point where performance takes a real hit. Even so, haswell is quite often much faster than baseline, likely since it shares virtually all the same ISA extensions.

Originally posted by PerformanceExpert View Post

Servers are mostly about integer code, especially hyperscalers, so yes, SPECINT is what matters.

And what is LLVM compilation? You think that's floating point, or what?

Originally posted by PerformanceExpert View Post

Not that SPECFP scores are bad at all, SPECFP_rate is just 2% below 7742, so HPC people will be interested in Altra.

I didn't say it was bad.

Originally posted by PerformanceExpert View Post

per-thread scores are more interesting as that shows what you get on a loaded server.

Per-thread is pretty irrelevant, unless you're paying by the vCPU, like how Amazon charges.

Originally posted by PerformanceExpert View Post

And if you think SPECFP is all about vectorization then why does the Xeon perform so badly? EPYC beats it on single threaded and it doesn't even run at 4GHz!

I didn't say it's only about vectorization, just that it's relevant.

Originally posted by PerformanceExpert View Post

Check again - NAMD was actually 30.5% faster on Altra.

True. I didn't notice that they only had the single-CPU Altra score on there, so that was 2x Epyc vs. 1x Altra that I saw. It seems the reason was due to problems he encountered with the ARM version of the MPI library.

Originally posted by PerformanceExpert View Post

And SPECFP is within 2%, LLVM is within 4%, so only Java looks bad.

But you didn't say it was merely close. You're talking all about how it dominates, and the truth is that it really doesn't. It's competitive and that's already quite an accomplishment. But that's all it is. Sure, it has some strengths, but also its share of weaknesses.

Originally posted by PerformanceExpert View Post

I trust AnandTech's results more - they even went to the trouble to use a RAM disk to avoid differences in IO or SSD performance,

Michael said he used the same SSD for all machines. I definitely checked that. Anyway, it still doesn't win on LLVM compilation, and that strikes me as a decent approximation of a server workload.

Originally posted by PerformanceExpert View Post

I follow the facts and the data, and it's obvious to any expert that Altra is a huge winner here. That's before we consider the lower cost and power usage that is attractive to hyperscalers. Nobody claimed it will win every single benchmark, and it doesn't need to in order to be the best. There are likely major opportunities to tune code for AArch64, and that will only improve it further.

Yeah, okay. You're hopeless. Even Anandtech is not that big on the power savings. You should be getting paid for pumping ARM/Altra so hard.

The funny thing is that I'm not even down on ARM. I think we even broadly agree, but it's clear that you have some kind of agenda and are trying to spin a real accomplishment into something that it's not.

BTW, another thing Anandtech critiques is their mesh interconnect. I'm expecting to see both the power-efficiency and per-core performance drop significantly, on the 128-core chip. It'll probably still edge out Milan, but it's also going to cost more than this 80-core version. So, I doubt it'll be a more compelling proposition than the 80-core Altra is, today.

**PerformanceExpert** · 18 December 2020, 02:47 PM

Originally posted by juanrga View Post

Ampere implements N1 cores with 128bit NEON units. AMD Rome implements Zen2 cores with 256bit AVX units.

However, most of HPC code is memory bound and Ampere would perform better due to superior BW/FLOP ratio. In fact, in stock settings Rome ties to Ampere. It is only in the mixT settings that Rome gets a 13% extra performance. That mix mean disabling SMT in Zen2 cores to reduce cache and memory bottlenecks in memory bound codes.

Agreed, Altra wins FPrate in 1S and is 2% slower with 2S, plus it has a big 30% win on NAMD. Those wide AVX units don't seem to help very much at all!

You can disable SMT for performance/security, and this improves FP code but you'll lose a lot more on integer: Altra is 26% faster than EPYC without SMT on SPECINT_rate.

Announcement

Ampere Altra Performance Shows It Can Compete With - Or Even Outperform - AMD EPYC & Intel Xeon

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment