Description | gpu | Deckhouse

Available with limitations in: CSE Lite (1.73), CSE Pro (1.73)

Available without limitations in: EE

The module lifecycle stage: General Availability
The module has requirements for installation

The module brings NVIDIA GPU management to Deckhouse Kubernetes Platform. It supports two stacks (modes):

Device Plugin mode (default): NFD/GFD, NVIDIA device plugin (Exclusive/TimeSlicing/MIG), MIG manager, DCGM/Exporter with Grafana dashboards. Deployed into d8-nvidia-gpu.
DRA mode (dra.enabled: true): Dynamic Resource Allocation — gpu-controller, gpu-node-agent, nvidia-adapter, DCGM/Exporter. Deployed into d8-nvidia-gpu. Requires Kubernetes ≥ 1.34.

DRA mode is experimental. Do not enable it in production clusters.

Main features

Automatic containerd configuration and NVIDIA runtime setup on GPU nodes via NodeGroupConfiguration — no manual containerd tuning required.
Three GPU sharing modes: Exclusive (one Pod per GPU), TimeSlicing (multiple Pods share one GPU), and MIG (hardware partitioning for A100/H100).
DRA mode (Dynamic Resource Allocation) for fine-grained GPU scheduling using the Kubernetes ResourceClaim/ResourceSlice API (requires Kubernetes ≥ 1.34).
Automatic MIG profile reconfiguration with controlled node drain and reboot when required.
DCGM Exporter with pre-built Grafana dashboards for GPU health, utilisation, and MIG state monitoring.
PhysicalGPU custom resource (DRA mode) — inventory of physical GPUs per node with PCI details, driver state, and partition capabilities.
Seamless migration from Device Plugin mode to DRA without manual namespace cleanup.

Prerequisites

NVIDIA driver and NVIDIA Container Toolkit are installed on target nodes (containerd/runtime is configured by this module via NodeGroupConfiguration).
spec.gpu is set in the target NodeGroup (sharing: Exclusive, TimeSlicing, or MIG).
For DRA mode: Kubernetes ≥ 1.34 and dra.enabled: true in ModuleConfig.

Enable DRA mode

DRA mode is experimental. Do not enable it in production clusters.

To switch to DRA, set dra.enabled: true in ModuleConfig gpu and configure the NodeGroup:

apiVersion: deckhouse.io/v1alpha1
kind: ModuleConfig
metadata:
  name: gpu
spec:
  enabled: true
  version: 1
  settings:
    dra:
      enabled: true

The module will automatically remove the Device Plugin stack from d8-nvidia-gpu and deploy the DRA stack into the same namespace. No manual cleanup is required.

What the module deploys

Device Plugin mode (namespace d8-nvidia-gpu):

Node Feature Discovery (NFD) and GPU Feature Discovery (GFD) for GPU labeling.
NVIDIA device plugin supporting Exclusive/TimeSlicing/MIG modes.
MIG manager and configs for MIG-capable GPUs.
DCGM Exporter and ready Grafana dashboards for GPU health.

DRA mode (namespace d8-nvidia-gpu):

gpu-controller — central DRA controller managing ResourceClaim/ResourceSlice API.
gpu-node-agent — per-node GPU discovery and ResourceSlice publishing.
nvidia-adapter — per-node CDI device injection into containers.
DCGM Exporter and Grafana dashboards.

See module configuration and architecture for a full component breakdown.

Monitoring

DCGM Exporter publishes metrics; Grafana dashboards show GPU load and MIG state.

Module gpu

Main features

Prerequisites

Enable DRA mode

What the module deploys

Monitoring

An error has occurred

Tell us what you didn’t like.

Module gpu

Main features

Prerequisites

Enable DRA mode

What the module deploys

Monitoring

An error has occurred

Tell us what you didn’t like.

Request trial access

Thank you

Error

Request callback

Thank you

Something went wrong

Book your sessions

Thank you

Error

Request demo

Thank you

Error

Get the PCI SSC Compliance Report

Thank you

Error