Would the DSA on a Sapphire Rapids function as the primary controller for transferring data among multiple GPU accelerators, for example in the Argonne nodes with the six GPUs? It seems something has to orchestrate that.
Announcement
Collapse
No announcement yet.
Intel Details New Data Streaming Accelerator For Future CPUs - Linux Support Started
Collapse
X
-
Originally posted by jayN View PostWould the DSA on a Sapphire Rapids function as the primary controller for transferring data among multiple GPU accelerators, for example in the Argonne nodes with the six GPUs? It seems something has to orchestrate that.
Whatever is orchestrating it probably resides on each GPU, with the interconnectivity (e.g. CXL, PICe, or whatever) simply acting as a switched network. That would scale best and have the lowest latency. Did Intel hint at having DSAs in their GPUs?
Comment
-
Originally posted by coder View PostHow are they connected?
Whatever is orchestrating it probably resides on each GPU, with the interconnectivity (e.g. CXL, PICe, or whatever) simply acting as a switched network. That would scale best and have the lowest latency. Did Intel hint at having DSAs in their GPUs?
All the coherency is handled in the host CPU caches. That's one of the main features of CXL... the accelerators can be relatively simple.
Comment
-
Originally posted by jayN View PostAll the coherency is handled in the host CPU caches. That's one of the main features of CXL... the accelerators can be relatively simple.
Comment
-
Originally posted by coder View PostYeah, but even with bandwidth equivalent to PCIe 5.0, CXL will still be a bottleneck if it's the only way GPUs are connected. Nvidia and AMD both have multi-link inter-GPU connectivity solutions that don't route through the CPU.
According to the above MI100 specs, each accelerator features 3 IF links, though I'm not 100% certain if they're each at 92 GB/s or if that's across all links. It sounds like the former, one can't be too careful in interpreting mfg specs.
The current Infinity fabric data rate is below pcie5. I believe AMD currently uses IF on top of PCIE4 lanes.
According to the article link, the current IF GPU config is limited by the number of IF links to 4, and does not include the CPU.
Symmetric coherency becomes a bottleneck ... not scalable, and complicates the GPUs, requiring them to have AMD's proprietary home agent. CXL GPUs would use whatever home agent of the host cpus ... so would be compatible with different hosts. CXl biased coherency is more scalable for bus bandwidth since the snoops are not used while the bias is held... in other words, the GPU local memory pool is considered private while bias is held.
PCIE5 connections are private between CPU and GPU. I haven't seen a spec of how many DMAs are simultaneously supported by the DSA, but Intel has two Sapphire Rapids CPU on the Aurora nodes, each with 80 pcie5 lanes, so looks like they could support up to 10 GPUs with 16 lane bi-directional pcie5 data rate, which is 2x the PCIE4 data rate.
The Aurora topology only has 6 Xe-HPC GPUs per node, leaving four 16 lane, bidirectional cxl connections to communicate to other nodes.
Last edited by jayN; 12 April 2021, 09:29 AM.
- Likes 1
Comment
-
Originally posted by coder View PostjayN , in case you didn't see this yet, here's what Nvidia just announced:
Source: https://www.anandtech.com/show/16611...0am-pt1630-utc
Intel bought Habana for its Gaudi training architecture, which was preferred by FB. Intel has the tile manufacturing capability and silicon optics technology to help Habana reduce the power and increase the interconnect bandwidth.
Comment
-
Originally posted by jayN View PostIt's a pretty marketing picture, but doesn't make any sense as an AI training topology.
Originally posted by jayN View PostIntel bought Habana for its Gaudi training architecture,
Comment
-
Originally posted by coder View PostOkay, sure. Nvidia, who's been leading the way for basically the whole AI revolution, doesn't know where the bottlenecks are, how to scale up, or what makes sense. Right.
Got any references, so we can compare?
Habana GAUDI Training Whitepaper v1.2.pdf
- Likes 1
Comment
-
Originally posted by jayN View Post
So, is the data movement all push-based? It seems like they'd be at a disadvantage for reads.
BTW, I think NVLink is cache-coherent, which is just unnecessary overhead in some important cases (but nice, in others).
One thing I'll say is definitely working against Nvidia, in the future, is their apparent desire to be all things to all users. AI training doesn't need fp64, so why exactly does the A100 still have it?
Comment
Comment