Announcement

**fitzie** · 11 July 2023, 01:47 PM

Isn't nvme over tcp already doing this for storage? I don't understand that use case, or why i needs a facility in dri for it either.

- Distributed raw block storage applications transfer large amounts of data with
remote SSDs, much of this data does not require host processing.

But it's obvious that so much of ML is ripe for optimization, there's probably at least an order of magnitude of performance gains coming just from driving the transistors optimally.

**aviallon** · 11 July 2023, 02:50 PM

Originally posted by fitzie View Post

Isn't nvme over tcp already doing this for storage? I don't understand that use case, or why i needs a facility in dri for it either.

But it's obvious that so much of ML is ripe for optimization, there's probably at least an order of magnitude of performance gains coming just from driving the transistors optimally.

Ah, I can tell you we have a use for that in very high bandwidth signal processing.
We are already doing this, but having it standardized at least in part will reduce our development burden.

**coder** · 11 July 2023, 11:30 PM

Originally posted by fitzie View Post

Isn't nvme over tcp already doing this for storage? I don't understand that use case, or why i needs a facility in dri for it either.

I can't comment on the first part, other than to say that maybe they're not using NVMe over TCP, for some reason.

Regarding the second part, what they're saying is they want the NIC to write incoming data directly into the GPU's memory, rather than first having it go to host memory and then having to take a second trip from host to GPU memory. Makes sense to me.

**fwyzard** · 12 July 2023, 03:27 AM

Originally posted by coder View Post

Regarding the second part, what they're saying is they want the NIC to write incoming data directly into the GPU's memory, rather than first having it go to host memory and then having to take a second trip from host to GPU memory. Makes sense to me.

It does make sense - but isn't something that InfiniBand and Ethernet with RoCE already do, via their DMA/RDMA capabilities ?

/puzzled

**pegasus** · 12 July 2023, 03:27 AM

We already have RDMA ... why bother with additional complexity of TCP? To keep cpu cores busy?

**coder** · 12 July 2023, 03:53 AM

Originally posted by pegasus View Post

We already have RDMA ... why bother with additional complexity of TCP?

First, I think RDMA only handles directly writing to userspace memory - not device memory.

I don't know about why they're using TCP.

Originally posted by pegasus View Post

To keep cpu cores busy?

For this to make any sense, the NIC would have to implement 100% TCP offload. Otherwise, you couldn't avoid a pass through the host CPU, which is the entire point.

**fwyzard** · 12 July 2023, 04:06 AM

Originally posted by coder View Post

First, I think RDMA only handles directly writing to userspace memory - not device memory.

RDMA definitely supports reading and writing to device memory: for example, you can write data from the NIC directly into a GPU memory buffer.

**dragorth** · 12 July 2023, 04:12 AM

Isn't the point of this for enterprise CXL? They want to be able to outfit a compute node with access to memory located somewhere on the network rather than being limited to the 2-4TB on board, so they make a PCI Memory board with more memory and have multiple machines able to use it.

**coder** · 12 July 2023, 04:13 AM

Originally posted by fwyzard View Post

RDMA definitely supports reading and writing to device memory: for example, you can write data from the NIC directly into a GPU memory buffer.

Okay, then I guess we can conclude that they were not able (or allowed) to use RDMA, for whatever reason.

Announcement

Google Posts Experimental Linux Code For "Device Memory TCP" - Network To/From Accelerator RAM

Google Posts Experimental Linux Code For "Device Memory TCP" - Network To/From Accelerator RAM

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment