Available with limitations in: CSE Lite (1.73), CSE Pro (1.73)
Available without limitations in: EE
The module lifecycle stage: General Availability
The module has requirements for installation
The module brings NVIDIA GPU management to Deckhouse Kubernetes Platform. It supports two stacks (modes):
- Device Plugin mode (default): NFD/GFD, NVIDIA device plugin (Exclusive/TimeSlicing/MIG), MIG manager, DCGM/Exporter with Grafana dashboards. Deployed into
d8-nvidia-gpu. - DRA mode (
dra.enabled: true): Dynamic Resource Allocation —gpu-controller,gpu-node-agent,nvidia-adapter, DCGM/Exporter. Deployed intod8-nvidia-gpu. Requires Kubernetes ≥ 1.34.
DRA mode is experimental. Do not enable it in production clusters.
Main features
- Automatic containerd configuration and NVIDIA runtime setup on GPU nodes via NodeGroupConfiguration — no manual containerd tuning required.
- Three GPU sharing modes: Exclusive (one Pod per GPU), TimeSlicing (multiple Pods share one GPU), and MIG (hardware partitioning for A100/H100).
- DRA mode (Dynamic Resource Allocation) for fine-grained GPU scheduling using the Kubernetes
ResourceClaim/ResourceSliceAPI (requires Kubernetes ≥ 1.34). - Automatic MIG profile reconfiguration with controlled node drain and reboot when required.
- DCGM Exporter with pre-built Grafana dashboards for GPU health, utilisation, and MIG state monitoring.
- PhysicalGPU custom resource (DRA mode) — inventory of physical GPUs per node with PCI details, driver state, and partition capabilities.
- Seamless migration from Device Plugin mode to DRA without manual namespace cleanup.
Prerequisites
- NVIDIA driver and NVIDIA Container Toolkit are installed on target nodes (containerd/runtime is configured by this module via NodeGroupConfiguration).
- spec.gpu is set in the target NodeGroup (sharing: Exclusive, TimeSlicing, or MIG).
- For DRA mode: Kubernetes ≥ 1.34 and dra.enabled: true in ModuleConfig.
Enable DRA mode
DRA mode is experimental. Do not enable it in production clusters.
To switch to DRA, set dra.enabled: true in ModuleConfig gpu and configure the NodeGroup:
apiVersion: deckhouse.io/v1alpha1
kind: ModuleConfig
metadata:
name: gpu
spec:
enabled: true
version: 1
settings:
dra:
enabled: trueThe module will automatically remove the Device Plugin stack from d8-nvidia-gpu and deploy the DRA stack into the same namespace. No manual cleanup is required.
What the module deploys
Device Plugin mode (namespace d8-nvidia-gpu):
- Node Feature Discovery (NFD) and GPU Feature Discovery (GFD) for GPU labeling.
- NVIDIA device plugin supporting Exclusive/TimeSlicing/MIG modes.
- MIG manager and configs for MIG-capable GPUs.
- DCGM Exporter and ready Grafana dashboards for GPU health.
DRA mode (namespace d8-nvidia-gpu):
gpu-controller— central DRA controller managingResourceClaim/ResourceSliceAPI.gpu-node-agent— per-node GPU discovery andResourceSlicepublishing.nvidia-adapter— per-node CDI device injection into containers.- DCGM Exporter and Grafana dashboards.
See module configuration and architecture for a full component breakdown.
Monitoring
DCGM Exporter publishes metrics; Grafana dashboards show GPU load and MIG state.