Available with limitations in CSE Lite (1.73), CSE Pro (1.73)

Available without limitations in:  EE

The module lifecycle stageGeneral Availability
The module has requirements for installation

The module brings NVIDIA GPU management to Deckhouse Kubernetes Platform. It supports two stacks (modes):

  • Device Plugin mode (default): NFD/GFD, NVIDIA device plugin (Exclusive/TimeSlicing/MIG), MIG manager, DCGM/Exporter with Grafana dashboards. Deployed into d8-nvidia-gpu.
  • DRA mode (dra.enabled: true): Dynamic Resource Allocation — gpu-controller, gpu-node-agent, nvidia-adapter, DCGM/Exporter. Deployed into d8-nvidia-gpu. Requires Kubernetes ≥ 1.34.

DRA mode is experimental. Do not enable it in production clusters.

Main features

  • Automatic containerd configuration and NVIDIA runtime setup on GPU nodes via NodeGroupConfiguration — no manual containerd tuning required.
  • Three GPU sharing modes: Exclusive (one Pod per GPU), TimeSlicing (multiple Pods share one GPU), and MIG (hardware partitioning for A100/H100).
  • DRA mode (Dynamic Resource Allocation) for fine-grained GPU scheduling using the Kubernetes ResourceClaim/ResourceSlice API (requires Kubernetes ≥ 1.34).
  • Automatic MIG profile reconfiguration with controlled node drain and reboot when required.
  • DCGM Exporter with pre-built Grafana dashboards for GPU health, utilisation, and MIG state monitoring.
  • PhysicalGPU custom resource (DRA mode) — inventory of physical GPUs per node with PCI details, driver state, and partition capabilities.
  • Seamless migration from Device Plugin mode to DRA without manual namespace cleanup.

Prerequisites

  • NVIDIA driver and NVIDIA Container Toolkit are installed on target nodes (containerd/runtime is configured by this module via NodeGroupConfiguration).
  • spec.gpu is set in the target NodeGroup (sharing: Exclusive, TimeSlicing, or MIG).
  • For DRA mode: Kubernetes ≥ 1.34 and dra.enabled: true in ModuleConfig.

Enable DRA mode

DRA mode is experimental. Do not enable it in production clusters.

To switch to DRA, set dra.enabled: true in ModuleConfig gpu and configure the NodeGroup:

apiVersion: deckhouse.io/v1alpha1
kind: ModuleConfig
metadata:
  name: gpu
spec:
  enabled: true
  version: 1
  settings:
    dra:
      enabled: true

The module will automatically remove the Device Plugin stack from d8-nvidia-gpu and deploy the DRA stack into the same namespace. No manual cleanup is required.

What the module deploys

Device Plugin mode (namespace d8-nvidia-gpu):

  • Node Feature Discovery (NFD) and GPU Feature Discovery (GFD) for GPU labeling.
  • NVIDIA device plugin supporting Exclusive/TimeSlicing/MIG modes.
  • MIG manager and configs for MIG-capable GPUs.
  • DCGM Exporter and ready Grafana dashboards for GPU health.

DRA mode (namespace d8-nvidia-gpu):

  • gpu-controller — central DRA controller managing ResourceClaim/ResourceSlice API.
  • gpu-node-agent — per-node GPU discovery and ResourceSlice publishing.
  • nvidia-adapter — per-node CDI device injection into containers.
  • DCGM Exporter and Grafana dashboards.

See module configuration and architecture for a full component breakdown.

Monitoring

DCGM Exporter publishes metrics; Grafana dashboards show GPU load and MIG state.