Announcement

**coder** · 06 November 2021, 08:26 PM

Originally posted by bridgman View Post

#1 - wider deeper cores can give higher single thread performance if you can afford the R&D and the silicon area

#2 - Apple can afford the R&D and TSMC 5nm helps a lot with affording the silicon area

#3 - Apple is vertically integrated and charges premium prices, which gives it bigger budgets for its silicon.

#4 - Apple doesn't need to scale up to high core-count server CPUs, which I think is a factor that puts additional downward pressure on Intel & AMD's core sizes.

Originally posted by bridgman View Post

Nothing to do with X86 vs ARM, it's about how wide and deep the execution part of the core is

No, we can't say that. As is often pointed out, their front-end is 8-wide, while Zen3 is 4-wide and Golden Cove is 5-wide. Of course a x86 instruction != AArch64 instruction, in terms of the amount of work they each represent, but the delta (for most typical instructions) is not enough to offset the difference.

Also, x86 has half as many GP registers, which limits the amount of parallelism compilers can exploit before generating wasteful spills.

**bridgman** · 06 November 2021, 10:08 PM

Originally posted by coder View Post

This M1 is a 4+4 design, meaning it has 4 performance and 4 efficiency cores. It's telling to look how well just the 4 performance cores stack up against the 4800U, still beating it in FP. And then the full 4+4 configuration beats 4900HS at FP, while using less power.

So, I think there's a reasonable case to be made that Apple's P-cores are both faster and more efficient than Zen2.

You are working hard to disprove things I didn't say

Yes, the M1 P-core is faster and more efficient than a Zen2 core. The points I'm trying to make are (a) the differences fit within what you would expect from a wider & deeper core on a faster fab process, and (b) the differences have very little to do with ARM vs x86.

Again, I'm not trying to say that the M1 isn't an interesting and performant chip, just disagreeing with the "ooh it's 10:1 better or 100:1 better because it's ARM" conclusions.

I haven't seen much in the way of analysis on the M1 efficiency cores but Alder Lake e-cores are supposed to be comparable to Skylake, so it's more "huge + big" than "big + little". The little cores aren't all that little.

**Developer12** · 06 November 2021, 10:23 PM

Originally posted by bridgman View Post

it's about how wide and deep the execution part of the core is

x86 has the tiniest bottleneck in CPU design history right at the front of every core.

Even with all their fancy uop caching and prediction (which an ARM core can do too, btw), the decode at the front is strangling the execution engines and limiting how many can be added before returns diminish. They'll probably be forced to copy DEC VAX and invent MicroVAX, cutting down the instruction set to make it easier to decode. [1] Otherwise they are never going to reach the kind of decode parallelism apple has already demonstrated and stands to improve on.

[1] In addition to all the usual CISC mistakes, it's *theoretically* possible to have an *infinitely long* x86 instruction. Like the VAX, x86 was born in the naive 70's between Cray's CDC 6600 and Berkeley's RISC when we forgot that simpler is faster. DEC eventually realized they needed alpha. Intel's exit was supposed to be itanium.

**bridgman** · 06 November 2021, 10:27 PM

Originally posted by coder View Post

No, we can't say that. As is often pointed out, their front-end is 8-wide, while Zen3 is 4-wide and Golden Cove is 5-wide. Of course a x86 instruction != AArch64 instruction, in terms of the amount of work they each represent, but the delta (for most typical instructions) is not enough to offset the difference.

OK, you've lost me. I said "Nothing to do with X86 vs ARM, it's about how wide and deep the execution part of the core is" then you responded with a statement about the microarchitecture (decoder width) rather than something to do with ISA. Agree that it's more work to make a wide decoder for a variable length ISA but I hope everyone agrees that the "oh you can't have more than a 4-wide decoder with x86" limit has been debunked now ?

Isn't Golden Cove 6-wide decode ? There are a couple of articles floating around that have a picture of Sunny Cove (5-wide) in the middle of a Golden Cove decode discussion but AFAIK both Gracemont and Golden Cove are 6-wide.

**coder** · 06 November 2021, 11:03 PM

Originally posted by bridgman View Post

OK, you've lost me. I said "Nothing to do with X86 vs ARM, it's about how wide and deep the execution part of the core is" then you responded with a statement about the microarchitecture (decoder width) rather than something to do with ISA. Agree that it's more work to make a wide decoder for a variable length ISA but I hope everyone agrees that the "oh you can't have more than a 4-wide decoder with x86" limit has been debunked now ?

It's not just more work - the power & complexity of doing it should scale nonlinearly. And the mere fact that Intel pushed beyond 4-wide doesn't negate that.

Originally posted by bridgman View Post

Isn't Golden Cove 6-wide decode ?

I've not seen detailed analysis of it, but even 4-wide decoders tend to have limitations like 1 complex + 3 simple instructions. Or, it might be partially-segmented, in similar ways as Gracemont.

Originally posted by bridgman View Post

AFAIK both Gracemont and Golden Cove are 6-wide.

Gracemont supports 3-wide decode from each of 2 instruction streams. Since it doesn't support SMT, that means branch targets. So, when the OoO engine is speculatively fetching instructions from branch targets, they each get allocated to one or the other decoder block.

**bridgman** · 06 November 2021, 11:50 PM

Originally posted by coder View Post

It's not just more work - the power & complexity of doing it should scale nonlinearly. And the mere fact that Intel pushed beyond 4-wide doesn't negate that.

Agree, but pretty much everything in CPU core design scales non-linearly - that's the main reason core size moves forward in baby steps instead of everyone making humongous cores

Originally posted by coder View Post

I've not seen detailed analysis of it, but even 4-wide decoders tend to have limitations like 1 complex + 3 simple instructions. Or, it might be partially-segmented, in similar ways as Gracemont.

Yep - AFAIK the Golden Cove decoder is 1 complex + 5 simple.

Originally posted by coder View Post

Gracemont supports 3-wide decode from each of 2 instruction streams. Since it doesn't support SMT, that means branch targets. So, when the OoO engine is speculatively fetching instructions from branch targets, they each get allocated to one or the other decoder block.

Interesting - I was wondering how that 2x3 decoder fit into the overall picture on a single thread core.

I'm still wondering if there might be more to those decoders than parallel decode of branch targets - the Gracemont execution pipeline is *really* wide and there's no micro-op cache to provide a cheap source of 8-wide decoded instructions. They are tagging instruction boundaries in the $I cache but AFAIK we have been doing that for almost a decade - it certainly helps with loop performance but it's not a magic bullet.

Skylake had a wider decoder (1 complex + 4 simple although it might have been either-or between 1 complex and 4 simple) and a micro-op cache so it's not obvious how Gracemont would get the same performance out of a 3-ish-wide decoder with simultaneous decode of a branch target. On the other hand Alder Lake performance is pretty good on gcc benchmarks which IIRC are about as branchy as code gets, so maybe there is something there.

**coder** · 07 November 2021, 02:40 AM

Originally posted by bridgman View Post

I'm still wondering if there might be more to those decoders than parallel decode of branch targets - the Gracemont execution pipeline is *really* wide and there's no micro-op cache to provide a cheap source of 8-wide decoded instructions.

It can issue 5 micro-ops per cycle and retire up to 8. I don't know how many micro-ops are typically generated per x86 instruction, but the reason you retire more than you can issue is that their differing latency means that stage needs to be able to handle bursts, so as to avoid backpressuring some of the pipelines.

Originally posted by bridgman View Post

They are tagging instruction boundaries in the $I cache but AFAIK we have been doing that for almost a decade - it certainly helps with loop performance but it's not a magic bullet.

Because it's efficiency-optimized, I guess they don't want effectively multiple I-caches. As a matter of fact, I've wondered why you wouldn't just augment your I-cache to hold decoded micro-ops. Perhaps what they're really doing is something somewhere in between.

**Michael_S** · 07 November 2021, 06:17 PM

Originally posted by pWe00Iri3e7Z9lHOX2Qx View Post

7 year old MacBook Airs with 8GB RAM and a 256GB SSD still sell for over $250 on eBay. The battery will also be shit by then so you get to have fun extracting it from all that glue. But I agree with your sentiment. It's funny that Apple reigns supreme for mobile OS updates/longevity, but are the worst on the desktop/laptop side. Hopefully in a year we start seeing some good ARM alternative hardware from NUVIA/QC that's more Linux friendly.

The M1 went on sale in 2020 and I said 2028, so I'm thinking of a 8-9 year time window. Hopefully that will bring the price below $250 - but as CommunityMember pointed out, a pair of Nike sneakers can cost that much (not that I would pay that much for sneakers).

And laptop battery life doesn't matter to me, mine spend 99% of their life plugged in. As long as it doesn't break, I'm okay.

**timofonic** · 11 November 2021, 11:03 AM

Originally posted by ihatemichael View Post

It doesn't have to be 100% perfect for it to be useful, it will continue to be improved like everything else.

Anyway, not having day zero or nearly day 0 hardware drivers available nakes it impractical...

Announcement

Apple M1 PCIe Driver Leads The PCI Changes For Linux 5.16

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment