Announcement

**coder** · 18 December 2020, 02:49 PM

Originally posted by PerformanceExpert View Post

There won't ever be a PI with this CPU

***sigh***

I did not say there would be a Pi. I said some Pi-users who are itching for something much more powerful could be tempted (i.e. to buy/bouild a workstation with the 32-core version). Just an off-hand remark to highlight how relatively affordable it was. I mean, that's comparable to a Ryzen 9 5950X, although the mobo will be more expensive and you'll probably need RDIMMs.

Originally posted by PerformanceExpert View Post

In a few years PI will likely use 8cx, chips

Raspberry Pi seems wed to Broadcom and maintaining an entry price of $35. However, I'm looking forward to somebody making a NUC-like mini PC, with that SoC or similar.

**coder** · 18 December 2020, 02:55 PM

Originally posted by PerformanceExpert View Post

You can disable SMT for performance/security, and this improves FP code but you'll lose a lot more on integer: Altra is 26% faster than EPYC without SMT on SPECINT_rate.

There's no need to disable SMT for security -- Google has a patch that avoids scheduling foreign threads on the same physical core:

Google Publishes Latest Linux Core Scheduling Patches So Only Trusted Tasks Share A Core - Phoronix

https://www.phoronix.com/scan.php?page=news_item&px=Google-Core-Scheduling-v9

Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

**PerformanceExpert** · 18 December 2020, 03:05 PM

Originally posted by coder View Post

I did no such thing. All I did was point out what should be obvious: the fact that there's a substantial disparity between the two.

Yes you did so, repeatedly, listing those stupid feature lists, various name calling, and even your sentence just above. Claiming there is a substantial disparity is absolutely and utterly ridiculous. These are simply base runs which is how most real-world software is built. Nothing wrong with that. You are so biased that you don't even consider the possibility that all those extensions might not have much effect on this particular set of benchmarks. Or that Arm has a similar disadvantage. You can always tune further.

**coder** · 18 December 2020, 11:20 PM

Originally posted by PerformanceExpert View Post

Yes you did so, repeatedly, listing those stupid feature lists, various name calling, and even your sentence just above.

Sounds like your feelings are hurt. My sympathies... or maybe I was just onto something and you're trying to deflect.

Originally posted by PerformanceExpert View Post

Claiming there is a substantial disparity is absolutely and utterly ridiculous.

Okay, why do you feel they should be equivalent? What's your rationale for assuming they are, or are you simply taking the position that you're going to make the most favorable assumptions until shown evidence to the contrary, while heaping scorn and derision on anyone who questions that stance? I would have to question the expertise of anyone in the field of performance measurement and optimization who prizes assumptions and ignorance above logic, reason, and data.

Originally posted by PerformanceExpert View Post

These are simply base runs which is how most real-world software is built. Nothing wrong with that. You are so biased that you don't even consider the possibility that all those extensions might not have much effect on this particular set of benchmarks.

That's not accurate. I use deep learning in the real world, and no one I know who does anything on CPUs would use a stock x86-64 distro build of a deep learning framework. It's stupid to let that wide AVX/AVX2/AVX-512 unit go to waste and just limp along with SSE/SSE2.

Furthermore, I'm pretty confident the same is true of most HPC users. You don't spend millions on a big supercomputer, and then waste most of your chips' computational resources just because you can't be bothered to rebuild a couple libraries. I'm sure it happens, but anyone who's waiting a substantial amount of time for their code to run is naturally going to do whatever they can to speed it up.

Originally posted by PerformanceExpert View Post

Or that Arm has a similar disadvantage. You can always tune further.

Explain how this could be so. In most cases, you can't just substitute single-precision arithmetic with half-precision. So, where is ARMv8.2-A code going to nearly double its floating point throughput, in the same way as going from 128-bit SSE to 256-bit AVX? And what about the further improvements from FMA? And besides half-precision arithmetic, what other instructions does ARMv8.2-A have that offer potential performance improvements over 8.0? When I look at the list, I just don't see much else. The saturating multiply, atomics, and CRC instructions seem aimed at niche use cases.

Here's what RealWordTech said about some of the x86-64 ISA extensions I mentioned:

With AVX, each Sandy Bridge core can have up to 2X the FLOP/cycle

AVX2 doubles integer SIMD to 256-bits, while FMA doubles the number of operations for FP by chaining a multiply and add. Crucially, AVX2 also encompasses instructions for gathering non-contiguous data from memory, which aids compilers and programmers using the x86 SIMD extensions.

As for some of the more minor extensions:

Bit Manipulation Instructions would seem to primarily benefit high-speed networking and perhaps imaging operations.
F16C enables the same sort of support for half-precision load/store that ARMv8-A already has!
AVX-512 doubles the number and width of vector registers
Deep Learning Boost adds VNNI AVX-512 instructions, which are comparable to ARMv8.2-A's saturating fp multiply-accumulate, except also with integer support. The latter point is important, because inferencing is principally done with integer arithmetic.

Now, as you pointed out, there are the Anandtech benchmarks. Maybe you're trying to go back and focus on hypotheticals, because that's easier ground to hold than trying to defend the actual benchmarks that they reported, which aren't quite as rosy as you'd like.

**mdedetrich** · 19 December 2020, 04:14 AM

Originally posted by coder View Post

If I understand correctly, mdedetrich was saying that the ported code worked properly when compiled natively to AArch64 code. Now it's not clear to me that it was running natively under MacOS, as opposed to a ARM-mode VM running Linux, in which case maybe the strong-ordering still got enabled.

Correct, this is apparently the case. Here is the tweet I was referring to

https://twitter.com/djspiewak/status/1334002663361417217

**baryluk** · 19 December 2020, 11:46 AM

Michael,

presentation of these benchmarks is really confusing.

I.e. LZ4 benchmarks. Are these single core and thread, or multiple threads (how many), or multiple independent processes? Hard to tell.

Similar with some other benchmarks, i.e. the encoding ones. Are these using fixed thread counts, or as much as possible, what is the effective number of cores utilized? Are they working on same video stream, or separate independent ones, etc? A lot of encoders is only moderately threaded, and can't scale forever. As well highly relay on assembly and vectorization to achieve the good performance.

I would love to separate article that focused just on ST performance.

**coder** · 19 December 2020, 11:59 AM

Originally posted by baryluk View Post

Michael,

I doubt you'll get his attention without an @ or quoting one of his earlier posts.

Originally posted by baryluk View Post

I.e. LZ4 benchmarks. Are these single core and thread, or multiple threads (how many), or multiple independent processes? Hard to tell.

Some of that info might be specified in Phoronix Test Suite, itself. If not, that's probably where it belongs.

Originally posted by baryluk View Post

I would love to separate article that focused just on ST performance.

Some scaling analysis for different sorts of workloads could be very interesting. Anyway, you might find Anandtech's review worth a look, if you haven't already read it.

**AmericanLocomotive** · 19 December 2020, 12:26 PM

Originally posted by coder View Post

But you didn't say it was merely close. You're talking all about how it dominates, and the truth is that it really doesn't. It's competitive and that's already quite an accomplishment. But that's all it is. Sure, it has some strengths, but also its share of weaknesses.

...and that's it, right there. This CPU is very competitive. It might be a "game changer" as far as enterprise ARM CPUs go (especially compared to AMD's feeble ARM Opteron back in 2016), however in the general enterprise scheme of things - it's "competitive".

**juanrga** · 19 December 2020, 01:33 PM

Originally posted by mdedetrich View Post

Correct, this is apparently the case. Here is the tweet I was referring to

https://twitter.com/djspiewak/status...02663361417217

He offers an alternative explanation in a following tweet.

The M1 is a full ARM CPU with a special mode to help translation from x86 and for some reason his code activated emulation mode.

**baryluk** · 19 December 2020, 01:49 PM

Originally posted by coder View Post

I doubt you'll get his attention without an @ or quoting one of his earlier posts.

Some of that info might be specified in Phoronix Test Suite, itself. If not, that's probably where it belongs.

Some scaling analysis for different sorts of workloads could be very interesting. Anyway, you might find Anandtech's review worth a look, if you haven't already read it.

Yeah, I just found the Anandtech review. It basically had everything covered, what I was hoping for. They did a really good job.

Announcement

Ampere Altra Performance Shows It Can Compete With - Or Even Outperform - AMD EPYC & Intel Xeon

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment