The Growing Number Of AI Accelerator Drivers Reignites Linux Kernel Driver Debate
The latest example is over the ongoing effort for Intel working on their Gaussian and Neural Accelerator "GNA" Linux driver for the mainline kernel for this feature already found in current mobile SoCs. Intel's GNA Linux driver is one of several AI-related accelerator drivers currently being worked on by the company for Linux -- Intel is still working on their Nervana NNPI Linux driver and there is also all of the Habana Labs AI training/inference accelerator code in the kernel with Intel having acquired that firm. Even among all of Intel's different AI kernel drivers for Linux, there isn't a uniform API or any concerted effort around supporting all of them at a low-level but have their different kernel driver components. Granted, up the stack they are pushing oneAPI and software efforts like oneDNN for application programmers, but at the kernel level there is fragmentation among their multiple kernel drivers in this area, not to mention vastly different interfaces for drivers from other hardware vendors.
On the multiple Intel AI-related drivers all being vastly different, Linux's second in command Greg Kroah-Hartman commented, "Please work with those developers to share code and userspace api and tools. Having the community review two totally different apis and drivers for the same type of functionality from the same company is totally wasteful of our time and energy." Other veteran kernel maintainers like Arnd Bergmann are also in agreement with that statement and even calling for a cross-vendor subsystem for machine learning and neural networks that could share more interfaces and code.
In the past there has been talk of potentially introducing a hardware accelerator subsystem for the growing number of AI training/inference accelerators and similar devices. So far though such subsystem has yet to be formalized so these drivers tend to be volleyed into "char/misc" or some are more akin to DRM/GPU drivers area.
Even between the char/misc and Direct Rendering Manager areas of the kernel there are different standards to what drivers / coding standards are allowed. As Intel's Daniel Vetter, who is also a DRM co-maintainer, brought up there are different requirements between the GPU/DRM driver areas and char/misc. On the graphics driver side they require fully open-source user-space implementations for code to be accepted mainline. But for the char/misc area overseen by Greg Kroah-Hartman, "Greg doesn't, he's happy if all he has is the runtime library with some tests." Basically the DRM/GPU drivers require a fully working user-space implementation while under the char/misc area there is a lower standard of simply exercising the kernel interfaces with an open-source user-space but not necessarily a complete working implementation for users.
Thus in situations like these differing views quickly get expressed, "I'm unfortuantely not the CEO of [Intel, as for why there are these completely different drivers]. Also you're the one who keeps accepting drivers that the accel folks (aka dri-devel community) said shouldn't be merged, so my internal bargaining power is zero to force something reasonable here. So please don't blame me for this mess, this is yours entirely...Like I said, if the 'g' [of the drivers/gpu area] really annoys you that much, feel free to send
in a patch to rename drivers/gpu to drivers/xpu."
Greg Kroah-Hartman also ended up realizing that the Habana Labs AI driver he pulled a long while ago into char/misc doesn't even have a fully open-source compiler as something that wasn't widely noted previously.
Ultimately there doesn't seem to be any easy and straight-forward solution for dealing with the growing number of diverse AI/accelerator devices to be supported by Linux by the mainline/upstream kernel. Many of the accelerators coming to market are from start-ups that may not be familiar with Linux kernel upstreaming processes and the ways of the open-source community, this hyper competitive space has some reluctance on opening up all of their code for competitive reasons, and their diverse architectures lead to quite different driver designs. We'll see how the situation plays out over the months/years ahead while ultimately we are likely to see some more focused XPU/accelerator subsystem and ideally clearer and more uniform rules over upstreaming driver requirements rather than lodging even more code under char/misc in a hodgepodge manner to avoid this growing dumpster fire.