Intel Details New Data Streaming Accelerator For Future CPUs - Linux Support Started

coder replied

12 April 2021, 12:07 AM
Originally posted by jayN View Post

Would the DSA on a Sapphire Rapids function as the primary controller for transferring data among multiple GPU accelerators, for example in the Argonne nodes with the six GPUs? It seems something has to orchestrate that.

How are they connected?

Whatever is orchestrating it probably resides on each GPU, with the interconnectivity (e.g. CXL, PICe, or whatever) simply acting as a switched network. That would scale best and have the lowest latency. Did Intel hint at having DSAs in their GPUs?
Leave a comment:
jayN replied

11 April 2021, 08:32 PM
Would the DSA on a Sapphire Rapids function as the primary controller for transferring data among multiple GPU accelerators, for example in the Argonne nodes with the six GPUs? It seems something has to orchestrate that.
Leave a comment:
jayN replied

10 April 2021, 12:59 PM
Originally posted by coder View Post

With CPUs having SMT and so many cores, we don't need things like DMA engines, any more.

A couple of interesting features in there ... handles Optane, an operation for flushing caches, create and apply Delta

Intel added CXL on Sapphire Rapids also. It has biased cache coherency, but there may need to be some dma transfers between processor cache and accelerator memory when the bias is flipped. I wonder if they plan to use the DSA to do those transfers.

The operations of creating and applying Delta records is interesting, too. Perhaps it can be used to minimize writes to NVM.
Likes 1
Leave a comment:
coder replied

10 April 2021, 11:46 AM
Originally posted by jayN View Post

Sapphire Rapids does have a DSA, according to recent slide leaks.

There is also an Oct 2020 detailed spec available at this link
https://software.intel.com/content/w...ification.html

Cool. Thanks for sharing!

I have to wonder how much of that can just be handled by a few CPU threads. With CPUs having SMT and so many cores, we don't need things like DMA engines, any more. Sure, it's a little bit of a waste to burn a big CPU core on that stuff, but a win for programmability.

If I had to chose between an Intel CPU with those engines but fewer cores, or an AMD/ARM CPU with more cores for the same or less $$$, my choice wouldn't be the Intel CPU.
Leave a comment:
jayN replied

10 April 2021, 11:09 AM
Sapphire Rapids does have a DSA, according to recent slide leaks.

There is also an Oct 2020 detailed spec available at this link

Access Denied

https://software.intel.com/content/www/us/en/develop/articles/intel-data-streaming-accelerator-architecture-specification.html
Likes 1
Leave a comment:
starshipeleven replied

23 November 2019, 07:14 AM
Originally posted by mrugiero View Post

I have no idea if ARM does DMA tho.

DMA is there if the protocol requires it. PCIe and Sata/SAS have DMA while USB 3.0 and lesser versions do not.

DMA is also very much there in any SoC as all processors in the SoC (CPU, GPU, modems, hardware decoding for media, and more) are sharing the same RAM.

One of the reasons projects like Purism's phone have the modem on USB bus (electrical USB interface) instead than integrated in the SoC is just that. The modem will have its own RAM and its own stuff and will have no access to the "app processor" (the main CPU running the OS) world.
Likes 1
Leave a comment:
coder replied

22 November 2019, 10:19 AM
Originally posted by mrugiero View Post

There are use cases, though.

Yeah, like I think one thing they might be targeting is routing traffic between CPUs in a mesh, or something like that. Anyway, there was that reference to clustering, and it made me think of Nvidia's GPU interconect technology, NVLink.

Originally posted by mrugiero View Post

For example, deep packet processing at line rate on high speed interfaces requires saturating all cores, and not everyone has many cores either,

Good points. I think datacenter networking is starting to embrace 400 Gbps(!). Also, toward the lower-end of core counts, there maybe some embedded use cases, where power-efficiency could benefit from using simpler, lower-clocked cores for data movement.
Leave a comment:
mrugiero replied

22 November 2019, 10:07 AM
Originally posted by coder View Post

Oh, quite simply. If you only have one CPU with one hardware thread, then the idea of tying it up with data movement is very unpalatable. However, if your CPU has 8 cores with 16 hardware threads, and one of them is tied up doing data movement across PCIe to a slow device, then you almost don't notice or much care - especially since that thread might be paired with a compute-heavy thread that keeps most of the core's functional units busy, anyhow.

So, the value proposition of a dedicated DMA engine is much lower. Not to speak of a 28-core CPU with 56 threads, or a 64-core CPU with 128 threads.

Oh, that makes sense. I thought you meant something like DMA not working properly or being slowed down by hyper threading, so I was confused.
Further, big.LITTLE in the ARM world can be seen that way, maybe even more than HT.
There are use cases, though. For example, deep packet processing at line rate on high speed interfaces requires saturating all cores, and not everyone has many cores either, specially in the in-development world.
For example, I live in Argentina, and 2-4 threads are still common, even in retail computers, and that's also still very common in cellphones AFAIK.
But yeah, I see your point.
I have no idea if ARM does DMA tho.
Leave a comment:
coder replied

22 November 2019, 09:47 AM
Originally posted by mrugiero View Post

What do you mean here? How are those related?

Oh, quite simply. If you only have one CPU with one hardware thread, then the idea of tying it up with data movement is very unpalatable. However, if your CPU has 8 cores with 16 hardware threads, and one of them is tied up doing data movement across PCIe to a slow device, then you almost don't notice or much care - especially since that thread might be paired with a compute-heavy thread that keeps most of the core's functional units busy, anyhow.

So, the value proposition of a dedicated DMA engine is much lower. Not to speak of a 28-core CPU with 56 threads, or a 64-core CPU with 128 threads.
Leave a comment:
mrugiero replied

22 November 2019, 09:33 AM
Originally posted by coder View Post

Wow, I figured Hyper Threading killed DMA

What do you mean here? How are those related?
Leave a comment:

Announcement

Intel Details New Data Streaming Accelerator For Future CPUs - Linux Support Started

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: