Announcement

**BillBroadley** · 20 November 2020, 09:48 PM

Originally posted by Jumbotron View Post

See my above post to evergreen. The M1 is chock a block full of cache. More so than any other CPU either Intel, AMD, Nvidia or ARM with the possible exception of IBM's upcoming POWER 10 CPU. And that ain't showing up in any laptop anytime soon.

As a matter of fact, the sheer amount of Cache in each and every M1 is historical. There has never been a consumer chip with as much cache. On top of that it is the ONLY CPU in consumer tech history to also have low latency High Bandwidth Memory directly attached to each core with a zero memory copy, cache coherent interconnect. Also, with the exception of the POWER 10, the M1 may be the ONLY SoC in the world with this architecture.

The memory parallelization in the M1 is simply beastly and an engineering marvel. Where AMD and ARM pioneered Heterogeneous System Architecture and until the abandonment of HSA by AMD of the FUSION line of APU's to pursue ZEN, Apple seems to have picked up that mantle and perfected it and did it in a package with a 13.8 watt max thermal profile.

Actually the bandwidth is pretty normal. It's 128 bits wide, and various laptops and desktops have very similar bandwidth. You can buy faster ram on newegg for the AM4 CPU of your choice.

What is unique is that while an AMD desktop has 2 channels of 64 bits, each can handle a pending memory request. I believe the M1 has 8 channels that are 16 bits wide. So it can handle 8 outstanding memory requests. This allows the CPU (4 fast + 4 slow), GPU cores, and Neural cores all to make outstanding requests and get good throughput. I've written a microbenchmark to explore this if anyone has a M1 around.

**Jumbotron** · 20 November 2020, 09:50 PM

Originally posted by Volta View Post

So, what were your points? Ok, doesn't matter. Linux is much younger on desktops than Apple. Wake me up when Apple becomes relevant, because Chrome OS is probably going to destroy it like Android did. Btw. comparing apple to apple isn't so meaningful in terms of performance. It's not a mystery macOS is slow mess.

You must be less than 30 years old. Only an ignorant young buck could say something historically so stupid.

Google creates, and then shits on and stops support of a dozen initiatives every year. Google products are the apotheosis of fragmentation which kills support and developer support. Apple was ready to be sold off for scrap before Steve Jobs came back and in the course of one year recently became not only the FIRST company in human history of ANY sort to reach 1 Trillion dollars in value, they decided to go ahead and reach 2 Trillion dollars in value that same year.

With basically 4 product lines.

A computer ( desktop, laptop, tablet ), a phone, a watch and some ear buds.

Chromebooks.....LOL. Please. Nobody even buys GOOGLE'S OWN OVERPRICED Chromebooks. For that price you can go ahead and get a REAL computer like the M1 Laptop that runs the same software that's on your Watch, your Phone, your Tablet, your Laptop and next year your Desktop.

That said....Chromebooks through the use of containerization across the board has become more interesting. You can now run PWA web apps on Chrome, Android apps in a container, Linux in a container but it sucks right now and not officially supported and on SOME industrial models you can run Windows for ARM.

But please....Google is an Advertising Company posing as a consumer tech company. Not only posing but posing without a coherent product plan nor long term support.

**Jumbotron** · 20 November 2020, 10:03 PM

Originally posted by BillBroadley View Post

Actually the bandwidth is pretty normal. It's 128 bits wide, and various laptops and desktops have very similar bandwidth. You can buy faster ram on newegg for the AM4 CPU of your choice.

What is unique is that while an AMD desktop has 2 channels of 64 bits, each can handle a pending memory request. I believe the M1 has 8 channels that are 16 bits wide. So it can handle 8 outstanding memory requests. This allows the CPU (4 fast + 4 slow), GPU cores, and Neural cores all to make outstanding requests and get good throughput. I've written a microbenchmark to explore this if anyone has a M1 around.

From the excellent AnandTech article on the M1

Many, Many Execution Units

Having high ILP also means that these instructions need to be executed in parallel by the machine, and here we also see Apple’s back-end execution engines feature extremely wide capabilities. On the Integer side, whose in-flight instructions and renaming physical register file capacity we estimate at around 354 entries, we find at least 7 execution ports for actual arithmetic operations. These include 4 simple ALUs capable of ADD instructions, 2 complex units which feature also MUL (multiply) capabilities, and what appears to be a dedicated integer division unit. The core is able to handle 2 branches per cycle, which I think is enabled by also one or two dedicated branch forwarding ports, but I wasn’t able to 100% confirm the layout of the design here.

The Firestorm core here doesn’t appear to have major changes on the Integer side of the design, as the only noteworthy change was an apparent slight increase (yes) in the integer division latency of that unit.

On the floating point and vector execution side of things, the new Firestorm cores are actually more impressive as they a 33% increase in capabilities, enabled by Apple’s addition of a fourth execution pipeline. The FP rename registers here seem to land at 384 entries, which is again comparatively massive. The four 128-bit NEON pipelines thus on paper match the current throughput capabilities of desktop cores from AMD and Intel, albeit with smaller vectors. Floating-point operations throughput here is 1:1 with the pipeline count, meaning Firestorm can do 4 FADDs and 4 FMULs per cycle with respectively 3 and 4 cycles latency. That’s quadruple the per-cycle throughput of Intel CPUs and previous AMD CPUs, and still double that of the recent Zen3, of course, still running at lower frequency. This might be one reason why Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).

Vector abilities of the 4 pipelines seem to be identical, with the only instructions that see lower throughput being FP divisions, reciprocals and square-root operations that only have an throughput of 1, on one of the four pipes.

On the load-store front, we’re seeing what appears to be four execution ports: One load/store, one dedicated store and two dedicated load units. The core can do at max 3 loads per cycle and two stores per cycle, but a maximum of only 2 loads and 2 stores concurrently.

What’s interesting here is again the depth of which Apple can handle outstanding memory transactions. We’re measuring up to around 148-154 outstanding loads and around 106 outstanding stores, which should be the equivalent figures of the load-queues and store-queues of the memory subsystem. To not surprise, this is also again deeper than any other microarchitecture on the market. Interesting comparisons are AMD’s Zen3 at 44/64 loads & stores, and Intel’s Sunny Cove at 128/72. The Intel design here isn’t far off from Apple and actually the throughput of these latest microarchitecture is relatively matched – it would be interesting to see where Apple is going to go once they deploy the design to non-mobile memory subsystems and DRAM.

One large improvement on the part of the Firestorm cores this generation has been on the side of the TLBs. The L1 TLB has been doubled from 128 pages to 256 pages, and the L2 TLB goes up from 2048 pages to 3072 pages. On today’s iPhones this is an absolutely overkill change as the page size is 16KB, which means that the L2 TLB covers 48MB which is well beyond the cache capacity of even the A14. With Apple moving the microarchitecture onto Mac systems, having compatibility with 4KB pages and making sure the design still offers enough performance would be a key part as to why Apple chose to make such a large upgrade this generation.

On the cache hierarchy side of things, we’ve known for a long time that Apple’s designs are monstrous, and the A14 Firestorm cores continue this trend. Last year we had speculated that the A13 had 128KB L1 Instruction cache, similar to the 128KB L1 Data cache for which we can test for, however following Darwin kernel source dumps Apple has confirmed that it’s actually a massive 192KB instruction cache. That’s absolutely enormous and is 3x larger than the competing Arm designs, and 6x larger than current x86 designs, which yet again might explain why Apple does extremely well in very high instruction pressure workloads, such as the popular JavaScript benchmarks.

The huge caches also appear to be extremely fast – the L1D lands in at a 3-cycle load-use latency. We don’t know if this is clever load-load cascading such as described on Samsung’s cores, but in any case, it’s very impressive for such a large structure. AMD has a 32KB 4-cycle cache, whilst Intel’s latest Sunny Cove saw a regression to 5 cycles when they grew the size to 48KB. Food for thought on the advantages or disadvantages of slow of fast frequency designs.

On the L2 side of things, Apple has been employing an 8MB structure that’s shared between their two big cores. This is an extremely unusual cache hierarchy and contrasts to everybody else’s use of an intermediary sized private L2 combined with a larger slower L3. Apple here disregards the norms, and chooses a large and fast L2. Oddly enough, this generation the A14 saw the L2 of the big cores make a regression in terms of access latency, going back from 14 cycles to 16 cycles, reverting the improvements that had been made with the A13. We don’t know for sure why this happened, I do see higher parallel access bandwidth into the cache for scalar workloads, however peak bandwidth still seems to be the same as the previous generation. Another point of hypothesis is that because Apple shares the L2 amongst cores, that this might be an indicator of changes for Apple Silicon SoCs with more than just two cores connected to a single cache, much like the A12X generation.

Apple has had employed a large LLC on their SoCs for many generations now. On the A14 this appears to be again a 16MB cache that is serving all the IP blocks on the SoC, most useful of course for the CPU and GPU. Comparatively speaking, this cache hierarchy isn’t nearly as fast as the actual CPU-cluster L3s of other designs out there, and in recent years we’ve seen more mobile SoC vendors employ such LLC in front of the memory controllers for the sake of power efficiency. What Apple would do in a larger laptop or desktop chip remains unclear, but I do think we’d see similar designs there.

Apple Announces The Apple Silicon M1: Ditching x86 - What to Expect, Based on A14

https://www.anandtech.com/print/16226/apple-silicon-m1-a14-deep-dive

**edwaleni** · 20 November 2020, 11:17 PM

Originally posted by Jumbotron View Post

Chromebooks.....LOL. Please. Nobody even buys GOOGLE'S OWN OVERPRICED Chromebooks.

According to Statisica:

"In 2020, it is forecast that a total of 20 million Chromebooks will be shipped globally, an increase of 17 percent from the 17 million units that were shipped in 2019."

According to Gartner:

For Apple specifically, Gartner's data indicates shipments of 5.26 million units, which is down 3% year-over-year, and a marketshare of 7.5%.

**creoflux** · 20 November 2020, 11:31 PM

The ability to run iOS apps this device is begging to be packaged as a 2-in-1. I am not the target market either way, it is not for me.

**geearf** · 21 November 2020, 12:10 AM

Originally posted by blacknova View Post

Consider this, normally on x86 you have native compiler->x86 ISA code->x86 CPU frontend CISC to RISC translator->internal RISC core execution path.
On new macs instead of CPU's CISC->RISC translator you have Rosetta.

Both CPU's frontend and Rosetta get optimized x86 code at their input. What differ is that they operate under different constraints, Rosetta can act as both AOT and JIT compiler and provided hot-spot optimization. Native translator run only in sequential mode and do not provide any additional optimizations.

So yeah, Rosetta can potentially run code at least as fast as on native processor, but in some cases it can provide better optimization than native front-end translator.

In that case shouldn't we also use some dynamic translator, or even better output the correct binary in the first place? What's the reason for not doing that? (I understand needing to support older/more generic stuff, but is not possible to support both?).

**dave_boo** · 21 November 2020, 01:17 AM

Originally posted by starshipeleven View Post

For chrissake all embedded devices and APUs share the same RAM between their processors and accelerators. Smartphones from 5 years ago do that. AMD APUs do that, Intel CPUs with graphics do that.

HSA is an inter-operatibility standard and software middleware designed to be used to run computing tasks on hardware of different vendors, which is completely different from this that is a SINGLE system created by a SINGLE vendor that controls the full stack from the OS down to the hardware design.

Wow, it's using a feature DirectX 11.1 had in 2018 and is a core feature in both DirectX 12 and Vulkan, and calling it a new and revolutionary thing! Much excitement! Very Apple! Wow!

Yay, apple has made an embedded device that is using all embedded device technologies everyone else uses! Much innovation! Very Excitement!

FWIW, the Dreamcast had tile based deferred rendering from its PowerVR chip back in 1998.

I still miss that console.

**dave_boo** · 21 November 2020, 01:30 AM

Originally posted by Jumbotron View Post

See my above post to evergreen. The M1 is chock a block full of cache. More so than any other CPU either Intel, AMD, Nvidia or ARM with the possible exception of IBM's upcoming POWER 10 CPU. And that ain't showing up in any laptop anytime soon.

As a matter of fact, the sheer amount of Cache in each and every M1 is historical. There has never been a consumer chip with as much cache. On top of that it is the ONLY CPU in consumer tech history to also have low latency High Bandwidth Memory directly attached to each core with a zero memory copy, cache coherent interconnect. Also, with the exception of the POWER 10, the M1 may be the ONLY SoC in the world with this architecture.

The memory parallelization in the M1 is simply beastly and an engineering marvel. Where AMD and ARM pioneered Heterogeneous System Architecture and until the abandonment of HSA by AMD of the FUSION line of APU's to pursue ZEN, Apple seems to have picked up that mantle and perfected it and did it in a package with a 13.8 watt max thermal profile.

I'm not a CPU designer, but I do know that the Power9 has 64 kB of L1 and 64 kB of L1 data cache for a total of 1.5 MB. Perhaps someone will explain the difference between a data cache and an instruction cache.

**tildearrow** · 21 November 2020, 02:11 AM

Originally posted by BillBroadley View Post

Not really. Just means that the x86-64 port of the video encode has some special ASM code to make it efficient, and the arm version does not. With native M1 code it does quite well. People have benchmarked exporting 4k HDR from the apple tools (premier) and it did quite well.

You have to compare apple to apple if you want to be fair.

Michael should really have tested x264, which does have more ARM optimizations...

**RavFX** · 21 November 2020, 02:48 AM

Originally posted by PerformanceExpert View Post

It outperforms the fastest x86 desktops including Zen 3 at a fraction of the power. The only conclusion: game over.

It outperform the 4500U Sometime

And the difference is only 1.2w (13.8 for the M1, 15 for the tested Zen 3 chip). If it's what you call "fraction"
Now imagin that that same AMD chip where made in the same 5nm process... It would have a lower power consumption than the M1. (should come soon)

Announcement

Apple M1 ARM Performance With A 2020 Mac Mini

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment