Intel Details New Data Streaming Accelerator For Future CPUs - Linux Support Started

coder replied

14 April 2021, 05:50 AM
Originally posted by jayN View Post

NVDA probably gets their fp64 by fusing or pipelining FP16 or FP32 operations.

I think I recall something about that in the Pascal (P100) whitepaper, but that still doesn't mean it's without substantial cost.

Originally posted by jayN View Post

I believe Intel's Xe-HPC has dedicated FP64.

AMD's next CDNA compute processor (can't really call it a GPU if it has no graphics units, right?) seems to be going for a dedicated 64-bit approach.

Originally posted by jayN View Post

The HPC people are using AI now, but they also want their 64 bit operations. I saw a good presentation on a project from CERN.

I don't doubt that, but I'm sure the market for AI training is far larger than HPC. So, it's more that Nvidia should make a processor optimized for AI training, without the HPC baggage. Then, if HPC people want AI, they can use a mix of processors that are each better at the functions they support.

Don't get me wrong: for the enthusiast, that would be bad. I have a Radeon VII at home and a Titan V at work. I love that these are all-purpose accelerators, from fp64 to AI and graphics. And it was AI that motivated the purchase of the Titan V, so we also got an awesome graphics card as a bonus. I'm just saying that it seems time for Nvidia to finally do a more targeted AI training processor, to continue to remain competitive in that market.
Leave a comment:
jayN replied

13 April 2021, 02:16 PM
Originally posted by coder View Post

Oh, that's right. I forgot it was all Ethernet.

So, is the data movement all push-based? It seems like they'd be at a disadvantage for reads.

BTW, I think NVLink is cache-coherent, which is just unnecessary overhead in some important cases (but nice, in others).

One thing I'll say is definitely working against Nvidia, in the future, is their apparent desire to be all things to all users. AI training doesn't need fp64, so why exactly does the A100 still have it?

The AI operations don't need to know anything about the source of the data. It makes sense for all the data to be push based...

I haven't seen the Habana Gaudi training code. I'd guess a controlling thread initiates a bunch of ethernet DMA transfers using the ROCE feature. while the ai operation threads just wait for inputs to arrive. The weights probably get stored off in HBME blocks.

NVDA probably gets their fp64 by fusing or pipelining FP16 or FP32 operations. I believe Intel's Xe-HPC has dedicated FP64.

The HPC people are using AI now, but they also want their 64 bit operations. I saw a good presentation on a project from CERN. There's a write-up here:

https://www.intel.com/content/www/us...mer-story.html
Likes 1
Leave a comment:
coder replied

13 April 2021, 01:15 PM
Originally posted by jayN View Post

Habana GAUDI Training Whitepaper v1.2.pdf

Oh, that's right. I forgot it was all Ethernet.

So, is the data movement all push-based? It seems like they'd be at a disadvantage for reads.

BTW, I think NVLink is cache-coherent, which is just unnecessary overhead in some important cases (but nice, in others).

One thing I'll say is definitely working against Nvidia, in the future, is their apparent desire to be all things to all users. AI training doesn't need fp64, so why exactly does the A100 still have it?
Leave a comment:
jayN replied

13 April 2021, 09:28 AM
Originally posted by coder View Post

Okay, sure. Nvidia, who's been leading the way for basically the whole AI revolution, doesn't know where the bottlenecks are, how to scale up, or what makes sense. Right.

Got any references, so we can compare?

nvda knows what they're doing. They apparently just let their marketing come up with that graphic... which doesn't give away anything.

Habana GAUDI Training Whitepaper v1.2.pdf
Likes 1
Leave a comment:
coder replied

13 April 2021, 04:36 AM
Originally posted by jayN View Post

It's a pretty marketing picture, but doesn't make any sense as an AI training topology.

Okay, sure. Nvidia, who's been leading the way for basically the whole AI revolution, doesn't know where the bottlenecks are, how to scale up, or what makes sense. Right.

Originally posted by jayN View Post

Intel bought Habana for its Gaudi training architecture,

Got any references, so we can compare?
Leave a comment:
jayN replied

13 April 2021, 03:21 AM
Originally posted by coder View Post

jayN , in case you didn't see this yet, here's what Nvidia just announced:

Source: https://www.anandtech.com/show/16611...0am-pt1630-utc

It's a pretty marketing picture, but doesn't make any sense as an AI training topology.

Intel bought Habana for its Gaudi training architecture, which was preferred by FB. Intel has the tile manufacturing capability and silicon optics technology to help Habana reduce the power and increase the interconnect bandwidth.
Leave a comment:
coder replied

12 April 2021, 03:43 PM
jayN , in case you didn't see this yet, here's what Nvidia just announced:

Source: https://www.anandtech.com/show/16611...0am-pt1630-utc
Leave a comment:
jayN replied

12 April 2021, 09:26 AM
Originally posted by coder View Post

Yeah, but even with bandwidth equivalent to PCIe 5.0, CXL will still be a bottleneck if it's the only way GPUs are connected. Nvidia and AMD both have multi-link inter-GPU connectivity solutions that don't route through the CPU.

According to the above MI100 specs, each accelerator features 3 IF links, though I'm not 100% certain if they're each at 92 GB/s or if that's across all links. It sounds like the former, one can't be too careful in interpreting mfg specs.

https://www.nextplatform.com/2020/11...-accelerators/

The current Infinity fabric data rate is below pcie5. I believe AMD currently uses IF on top of PCIE4 lanes.

According to the article link, the current IF GPU config is limited by the number of IF links to 4, and does not include the CPU.

Symmetric coherency becomes a bottleneck ... not scalable, and complicates the GPUs, requiring them to have AMD's proprietary home agent. CXL GPUs would use whatever home agent of the host cpus ... so would be compatible with different hosts. CXl biased coherency is more scalable for bus bandwidth since the snoops are not used while the bias is held... in other words, the GPU local memory pool is considered private while bias is held.

PCIE5 connections are private between CPU and GPU. I haven't seen a spec of how many DMAs are simultaneously supported by the DSA, but Intel has two Sapphire Rapids CPU on the Aurora nodes, each with 80 pcie5 lanes, so looks like they could support up to 10 GPUs with 16 lane bi-directional pcie5 data rate, which is 2x the PCIE4 data rate.

The Aurora topology only has 6 Xe-HPC GPUs per node, leaving four 16 lane, bidirectional cxl connections to communicate to other nodes.

Last edited by jayN; 12 April 2021, 09:29 AM.
Likes 1
Leave a comment:
coder replied

12 April 2021, 05:05 AM
Originally posted by jayN View Post

All the coherency is handled in the host CPU caches. That's one of the main features of CXL... the accelerators can be relatively simple.

Yeah, but even with bandwidth equivalent to PCIe 5.0, CXL will still be a bottleneck if it's the only way GPUs are connected. Nvidia and AMD both have multi-link inter-GPU connectivity solutions that don't route through the CPU.
https://www.nvidia.com/en-us/data-center/nvlink/

https://en.wikipedia.org/wiki/NVLink

https://www.amd.com/system/files/doc...on-pro-vii.pdf

https://www.amd.com/en/products/serv...#product-specs

According to the above MI100 specs, each accelerator features 3 IF links, though I'm not 100% certain if they're each at 92 GB/s or if that's across all links. It sounds like the former, one can't be too careful in interpreting mfg specs.
Leave a comment:
jayN replied

12 April 2021, 01:06 AM
Originally posted by coder View Post

How are they connected?

Whatever is orchestrating it probably resides on each GPU, with the interconnectivity (e.g. CXL, PICe, or whatever) simply acting as a switched network. That would scale best and have the lowest latency. Did Intel hint at having DSAs in their GPUs?

Slides leaked during the last week show DSA in Sapphire Rapids.

All the coherency is handled in the host CPU caches. That's one of the main features of CXL... the accelerators can be relatively simple.
Leave a comment:

Announcement

Intel Details New Data Streaming Accelerator For Future CPUs - Linux Support Started

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: