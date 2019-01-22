Habana Labs is one of the companies working on an "AI" processor for speeding up deep learning inference and training workloads. Their initial product is the Goya processor that is already production-qualified. Today they published initial open-source Linux kernel driver patches for review to potentially include in the mainline kernel moving forward.The Habana Labs start-up has published quite compelling AI benchmarks that for popular inference workloads puts its Goya performance ahead of the likes of the NVIDIA Tesla T4, Intel Cascade Lake, Xilinx Alveo, and other competing platforms. They claim this AI processor can achieve 15,000 images per second on ResNet-50. The Goya HL1000 is primarily catered to inference workloads while for training they will also be releasing the Gaudi HL-2000, which is expected to begin sampling next quarter.

The driver currently exposes a total of five IOCTLs. One IOCTL allows the application to submit workloads to the device, and another to wait on completion of submitted workloads. The other three IOCTLs are used for memory management, command buffer creation and information/status retrieval.



In addition, the driver exposes several sensors through the hwmon subsystem and provides various system-level information in sysfs for system administrators.



The first step for an application process is to open the correct hlX device it wants to work with. Calls to open create a new "context" for that application in the driver's internal structures and a unique ASID is assigned to that context. The context object lives until the process releases the file descriptor AND its command submissions have finished executing on the device.



Next step is for the application to request information about the device, such as amount of DDR4 memory. The application then can go on to create command buffers for its command submissions and allocate and map device or host memory (host memory can only be mapped) to the internal device's MMU subsystem.



At this point the application can load various deep learning topologies to the device DDR memory. After that, it can start to submit inference workloads using those topologies. For each workload, the the application receives a sequence number that represents the workload. The application can then query the driver regarding the status of the workload using that sequence number.