Facebook Still Pursuing "NetGPU" - Working On AMD GPU Support In Addition To NVIDIA
It was the recent Facebook patches for implementing NetGPU that with one of the NVIDIA-focused patches led to the recent controversy around "GPL condoms" in the kernel and ultimately leading to new protections with Linux 5.9. That NetGPU code is still being worked on by Facebook with upstream hopes but now in addition to the NVIDIA driver support they are also working on AMD GPU support with the open-source driver.
NetGPU as a reminder is the Facebook work-in-progress code for supporting zero-copy DMA transfers between the network adapter and graphics processor. This RDMA alternative still leads to protocol processing on the host CPU but would allow for much faster data processing on the GPU thanks to the zero-copy direct memory access between the NIC and GPU. Facebook is looking to make use of NetGPU for their machine learning clusters with plans to use 200 Gbps NICs and GPUs attached to a PCI Express switch. The CPU alone can't handle the dataset traffic for their intense machine learning workloads, but NetGPU should make their design feasible.
NetGPU itself is quite interesting and will hopefully make it into the mainline Linux kernel. It's just that the dependence on the NVIDIA proprietary driver for GPU usage with the previously proposed patches and the driver shim is what caused controversy.
The good news is that AMD GPU support for NetGPU is a work in progress. Unfortunately though the Radeon Open eCosystem (ROCm) stack in its current form isn't sufficient. Some changes to the ROCm code are currently being looked at due to DMA-BUF support currently not being exposed by their thunk driver.
Having the AMD GPU support working off an open-source compute stack will also clear an obstacle for NetGPU getting review and approval from other upstream kernel developers rather than being contingent upon the NVIDIA proprietary driver.
More details on NetGPU via this slide deck by Facebook engineer Jonathan Lemon.
NetGPU as a reminder is the Facebook work-in-progress code for supporting zero-copy DMA transfers between the network adapter and graphics processor. This RDMA alternative still leads to protocol processing on the host CPU but would allow for much faster data processing on the GPU thanks to the zero-copy direct memory access between the NIC and GPU. Facebook is looking to make use of NetGPU for their machine learning clusters with plans to use 200 Gbps NICs and GPUs attached to a PCI Express switch. The CPU alone can't handle the dataset traffic for their intense machine learning workloads, but NetGPU should make their design feasible.
NetGPU itself is quite interesting and will hopefully make it into the mainline Linux kernel. It's just that the dependence on the NVIDIA proprietary driver for GPU usage with the previously proposed patches and the driver shim is what caused controversy.
The good news is that AMD GPU support for NetGPU is a work in progress. Unfortunately though the Radeon Open eCosystem (ROCm) stack in its current form isn't sufficient. Some changes to the ROCm code are currently being looked at due to DMA-BUF support currently not being exposed by their thunk driver.
Having the AMD GPU support working off an open-source compute stack will also clear an obstacle for NetGPU getting review and approval from other upstream kernel developers rather than being contingent upon the NVIDIA proprietary driver.
More details on NetGPU via this slide deck by Facebook engineer Jonathan Lemon.
19 Comments