Announcement

**chuckula** · 30 August 2023, 07:59 AM

Originally posted by coder View Post

How do you know AMD's chiplet link uses serial connections? Even DRAM DIMMs don't, nor does HBM. So, why would they take such a hit for their inter-chiplet communication?

BTW, I think people place too much emphasis on core-to-core latency. Data sharing between cores, via L3, is probably rather rare, in practice. The bigger issue would be all memory traffic has to incur a latency penalty of traversing a pair of SERDES.

It has I/O dies, even if the memory controllers are integrated into the compute dies.

Not if you allow for more than one hop. Intel likes meshes, which is what Sapphire Rapids and Granite Rapids both use.

"Intel’s mesh has to connect 56 cores with 56 L3 slices. Because L3 accesses are evenly hashed across all slices, there’s a lot of traffic going across that mesh. SPR’s memory controllers, accelerators, and other IO are accessed via ring stops too, so the mesh is larger than the core count alone would suggest. Did I mention it crosses die boundaries too? Intel is no stranger to large meshes, but the complexity increase in SPR seems remarkable."

https://chipsandcheese.com/2023/03/1...pphire-rapids/

As for Granite Rapids, it carries on the approach of using mesh interconnects for cross-die communication:

https://www.tomshardware.com/news/in...e-xeon-roadmap

The fact that Granite Rapids has I/O dies for OFF CHIP COMMUNICATION proves the point I was making and shows that you don't understand what "glue" actually means. Of course Granite Rapids has I/O -- duh -- but those I/O dies are used exclusively for communicating OFF PACKAGE. with other components in the system. To say that any chip that has I/O capabilities is "glued together" shows you don't understand the meaning of the term.

And yes, of course there are ways for different cores to talk to each other in Granite Rapid -- once again, duh -- but the mesh network is NOT re-encoding every piece of data into a quasi-PCIe packet protocol and shoving it through a SERDES to another transceiver just to move some bits between L3 slices. Note that those I/O chiplets are NOT placed between every single compute tile because they aren't necessary, they are only used to communicate off-package with your GPUs, disks, networking, etc. etc.

I will admit that I was not wrong but incomplete on one point: There IS glue on Granite Rapids... for communicating with other sockets in a multi-socket system. That's what the UPI links are and UPI is basically what AMD copied (UPI was around for years prior to Zen) as "Infinity Fabric" for its chiplets. So there is glue for multi-socket systems, and if you actually understand what glue means you'd know that every Zen system is basically a large multi-socket setup on a logical level even if the dies are physically placed in single sockets.

**chuckula** · 30 August 2023, 08:15 AM

Originally posted by AdrianBc View Post

Core 2 cannot be compared with any modern CPUs, either glued or non-glued, because it did not use point-to-point communication links, like all modern devices, but it still used a shared bus for communication.

About Granite Rapids, I have not seen any information published by Intel about how the tiles communicate, so unless you have access to confidential information you do not know whether the communication links between tiles use SERDES or no.

The only alternative to using SERDES is to use communication links with many parallel connections and separate clock signals, exactly like in the HyperTransport links used by the old AMD CPUs, before they have switched to using PCIe.

The HT-like links have the advantage of lower latency, by skipping the SERDES, but they can be used only up to a certain combination of clock frequency and link length. Increasing either the clock frequency or the link length results in excessive skew between the data lines that cannot be compensated, when it becomes necessary to insert SERDES, which increases the communication latency, which is bad, but it also increases the communication throughput, which is good.

So the choice between parallel links like HT and serial links with SERDES, like PCIe or the inter-socket links of both Intel and AMD, is just a trade-off between latency and throughput, which in most modern systems has been decided in the favor of throughput. In any case, this is not a choice that deserves to be named as a difference between glued and non-glued devices.

Regarding the necessity of a central I/O hub, a switch included in the central hub will accelerate the communication between tiles, unless each tile includes sufficient separate links with the other tiles for a complete interconnection. This means that if there are no more than 4 tiles, each tile must have 3 inter-tile links. For more tiles the number of links grows proportionally and their cost becomes unacceptable, so the only solution is to have a central hub with a switch.

Sapphire Rapids has only 4 tiles, so like Zen 1 it does not need a central hub. Whenever Intel will increase the number of tiles, they will have to add a central hub like AMD.

Core 2 absolutely can be compared to Zen 3 for this conversation because the fact that Core 2 multiplexed the physical layer (the bus) for other I/O in addition to the glue logic is an incidental implementation detail, the overall logic remains the same and any engineer who understands layer isolation can see that. You don't say that visiting Phoronix on your laptop with WiFi is "fundamentally different" from visiting it on a PC with Ethernet either.

Intel is clearly not using glue inside of Granite Rapids and the I/O fabric chiplets literally prove it: They are only used for off-package communication and are not sandwiched between each I/O die. There is no need for packetizing each L3 cache query/transfer into a quasi PCIe protocol and shipping it through transceivers just to move data between L3 caches. That's what UPI does for other sockets but it's clearly not needed here when EMIB makes moving data between dies transparent vs. moving it between cores on the same die. Unlike the differences between shared buses vs. point to point (e.g. more wires), this is a fundamental difference because there is no need to marshall bits into packets and use dedicated transceivers to transmit them to the next die for each transaction, instead the EMIB makes the external chiplet interfaces effectively identical to the on-die interfaces. Put another way, AMD can't show that an I/O transaction inside the L3 cache of a since core complex is effectively identical to the I/O transaction between chiplet #2 and chiplet #7 on a Zen package.

**coder** · 30 August 2023, 05:55 PM

Originally posted by chuckula View Post

The fact that Granite Rapids has I/O dies for OFF CHIP COMMUNICATION proves the point I was making and shows that you don't understand what "glue" actually means. Of course Granite Rapids has I/O -- duh -- but those I/O dies are used exclusively for communicating OFF PACKAGE. with other components in the system. To say that any chip that has I/O capabilities is "glued together" shows you don't understand the meaning of the term.

The lady doth protest too much, methinks.

You're making such a big deal over whether AMD routes inter-die traffic through a central crossbar (which happens to be housed in the IO Die) that you're missing the fact that Intel's meshes aren't point-to-point, either.

If you wouldn't be so rabidly defensive, you might learn a thing or two. That ChipsAndCheese article I quoted explained how Intel and AMD arrived at their different approaches:

"Intel engineers now have an order of magnitude more bandwidth going across EMIB stops. The mesh is even larger, and has to support a pile of accelerators too. L3 capacity per slice has gone up too, from 1.25 MB on Ice Lake SP to 1.875 MB on SPR.

From that perspective, Intel has done an impressive job. SPR has similar L3 latency to Ampere Altra and Graviton 3, while providing several times as much caching capacity. Intel has done this despite having to power through a pile of engineering challenges. But from another perspective, why solve such a hard problem when you don’t have to?

In contrast, AMD has opted to avoid the giant interconnect problem entirely. EPYC and Ryzen split cores into clusters, and each cluster gets its own L3. Cross-cluster cache accesses are avoided except when necessary to ensure cache coherency. That means the L3 interconnect only has to link eight cache slices with eight cores. The result is a very high performance L3, enabled by solving a much simpler interconnect problem than Intel."

They seem to suggest that Intel's use of a monolithic mesh is more of a sledgehammer solution. AMD takes a hit on CCD-to-CCD latency, but their hierarchical interconnect seems decidedly more efficient and scales more than well enough.

A key detail not addressed in that article is the energy tax of die-to-die communication, and this is perhaps the biggest deficiency of Intel's approach.

Originally posted by chuckula View Post

the mesh network is NOT re-encoding every piece of data into a quasi-PCIe packet protocol and shoving it through a SERDES to another transceiver just to move some bits between L3 slices.

I asked for evidence. Instead you just repeat the claim. I can only conclude that you have none. To repeat your claim without evidence is arguing in bad faith.

**coder** · 30 August 2023, 06:06 PM

Originally posted by chuckula View Post

There is no need for packetizing each L3 cache query/transfer

Oh, yes there is. Intel's Mesh definitely does use packets!

A packet follows a simple routing algorithm:

Packets are 1st routed vertically
Packets are then routed horizontally

A packet originates at a tile (e.g. from the CHA) or some an I/O peripheral. It enters the fabric at its local Mesh Stop (CMS). The packet is then routed along the vertical half ring, either north or south, always taking the shortest path. Once the packet reaches its destination row, it will be taken off the vertical half ring and placed on the horizontal half ring where it will continue to the destination tile. Once the packet reaches the destination tile, it will interface back with the tile via Mesh Stop.

Mesh Interconnect Architecture - Intel - WikiChip

https://en.wikichip.org/wiki/intel/mesh_interconnect_architecture#Operations

Announcement

Intel Talks Up 2024 Xeon Sierra Forest & Granite Rapids At Hot Chips

Comment

Comment

Comment

Comment