AMD GPU Operator Announced For Automated Driver Installation & Kubernetes Support

These new software tools from AMD are designed to help ease the setup and ongoing maintenance for server administrators managing clusters of AMD GPU/accelerator enabled servers in the data center.
AMD GPU Operator allows for the automated driver installation and management for the AMD driver / ROCm compute stack, easy deployment of AMD GPU device plug-ins, simplified GPU resource allocation for containers, automatic worker node labeling, and support for the upstream/vanilla Kubernetes.
The AMD Device Metrics Exporter provides Prometheus-formatted metrics collections for AMD GPUs within HPC and AI environments for various GPU telemetry data, Kubernetes integration, and more. Among the metrics collected by the AMD Device Metrics Exporter are for operating temperatures, performance/utiilization data, clock speeds, power consumption, device memory statistics, and PCI Express metrics.
AMD GPU Operator aims to deliver a "zero-touch GPU setup" with its automatic ROCm driver management while being paired with enterprise-minded features to make the initial deployment and ongoing maintenance much easier for AMD hardware within varying sizes of AI and HPC deployments.
So far with these new software tools from AMD only the Instinct MI300X / MI250 / MI210 hardware is supported. The Kubernetes support covers Ubuntu 22.04 LTS and Ubuntu 24.04 LTS while on Red Hat Core OS is Red Hat OpenShift support.
More details on these new AMD GPU enterprise software tools via rocm.blogs.amd.com. These new tools arrive one day after the release of ROCm 6.3.2. The new tools are open-source with the code available via device-metrics-exporter and gpu-operator on GitHub.
AMD GPU Operator quietly saw its v1.0 release this past November and the AMD Device Metrics Exporter celebrated its v1.0 release in December but the software was only "announced" today via the ROCm blog.
4 Comments