Habana Labs' Gaudi NIC Support Being Worked On For Linux Kernel
Intel-owned AI startup Habana Labs is working on expanding their "Gaudi" support to now include the NIC network interface found on this AI training accelerator hardware.
Back for Linux 5.8 there was Gaudi support added to the Habana Labs accelerator driver. Previously the Hababa Labs open-source Linux driver only supported their Goya AI inference accelerator but now with the latest stable Linux kernel release there is support for the Gaudi AI training accelerator.
One important missing piece though with the current driver support was lacking NIC support for scaling out in connecting multiple accelerators. But now in patch form there is Gaudi NIC support and it could be mainlined for Linux 5.10.
The 15 patches allow for the NIC support to handle scale-out interconnect for distributed deep learning training. As many as "tens of thousands" of Gaudi accelerators can be connected using RDMA-over-converged-Ethernet for this distributed deep learning training.
Upstream driver maintainer Oded Gabbay of Habana Labs explained, "Each GAUDI exposes 10x100GbE ports that are designed to scale-out the inter-GAUDI communication by integrating a complete communication engine on-die. This native integration allows users to use the same scaling technology, both inside the server and rack (termed as scale-up), as well as for scaling across racks (scale-out). The racks can be connected directly between GAUDI processors, or through any number of standard Ethernet switches. The driver exposes the NIC ports to the user as standard Ethernet ports by registering each port to the networking subsystem. This allows the user to manage the ports with standard tools such as ifconfig, ethtool, etc. It also enables us to connect to the Linux networking stack and thus support standard networking protocols, such as IPv4, IPv6, TCP, etc. In addition, we can also leverage protocols such as DCB for dynamically configuring priorities to avoid congestion."
Great seeing all of the Habana Labs open-source support work continue.
Back for Linux 5.8 there was Gaudi support added to the Habana Labs accelerator driver. Previously the Hababa Labs open-source Linux driver only supported their Goya AI inference accelerator but now with the latest stable Linux kernel release there is support for the Gaudi AI training accelerator.
One important missing piece though with the current driver support was lacking NIC support for scaling out in connecting multiple accelerators. But now in patch form there is Gaudi NIC support and it could be mainlined for Linux 5.10.
The 15 patches allow for the NIC support to handle scale-out interconnect for distributed deep learning training. As many as "tens of thousands" of Gaudi accelerators can be connected using RDMA-over-converged-Ethernet for this distributed deep learning training.
Upstream driver maintainer Oded Gabbay of Habana Labs explained, "Each GAUDI exposes 10x100GbE ports that are designed to scale-out the inter-GAUDI communication by integrating a complete communication engine on-die. This native integration allows users to use the same scaling technology, both inside the server and rack (termed as scale-up), as well as for scaling across racks (scale-out). The racks can be connected directly between GAUDI processors, or through any number of standard Ethernet switches. The driver exposes the NIC ports to the user as standard Ethernet ports by registering each port to the networking subsystem. This allows the user to manage the ports with standard tools such as ifconfig, ethtool, etc. It also enables us to connect to the Linux networking stack and thus support standard networking protocols, such as IPv4, IPv6, TCP, etc. In addition, we can also leverage protocols such as DCB for dynamically configuring priorities to avoid congestion."
Great seeing all of the Habana Labs open-source support work continue.
Add A Comment