The module lifecycle stageGeneral Availability
The module has requirements for installation

The module operates in two modes depending on the dra.enabled configuration parameter: Device Plugin mode (default) and DRA mode (experimental).

Device Plugin mode

In this mode dra.enabled is set to false (default).

Deployed into namespace d8-nvidia-gpu:

Component Kind What it does
node-feature-discovery-master Deployment Collects hardware feature labels from workers and publishes them as node labels (feature.node.kubernetes.io/*, nvidia.com/*).
node-feature-discovery-gc Deployment Removes stale NFD labels from deleted nodes.
node-feature-discovery-worker DaemonSet Runs on every GPU node, detects PCI/USB devices and writes feature files for NFD master.
gpu-feature-discovery-<ng> DaemonSet Per-NodeGroup GFD: queries the GPU driver via NVML, writes detailed GPU feature files consumed by NFD master (nvidia.com/gpu.count, nvidia.com/mig.capable, etc.).
nvidia-device-plugin-<ng> DaemonSet Per-NodeGroup device plugin: exposes nvidia.com/gpu (Exclusive/TimeSlicing) or nvidia.com/mig-<profile> (MIG) resources to the Kubernetes scheduler.
nvidia-mig-manager DaemonSet Manages MIG reconfiguration on A100/H100 nodes: drains the node, applies the MIG profile, reboots if required, then returns the node to service.
nvidia-dcgm DaemonSet NVIDIA DCGM daemon: collects raw GPU telemetry (health, ECC, power, utilisation).
nvidia-dcgm-exporter DaemonSet Converts DCGM metrics into Prometheus format; scraped by the in-cluster Prometheus.

DRA mode

In this mode dra.enabled is set to true.

DRA mode is experimental. Do not enable it in production clusters.

Deployed into namespace d8-nvidia-gpu:

Component Kind What it does
gpu-controller Deployment Central DRA controller: implements the ResourceClaim / ResourceSlice Kubernetes API, manages GPU allocation across the cluster, runs admission webhooks for DRA objects. Runs HA support (up to 3 replicas).
gpu-node-agent DaemonSet Runs on every managed node with supported GPUs detected.
nvidia-adapter DaemonSet Runs on every managed node with supported GPUs detected.
gpu-dcgm DaemonSet DCGM daemon (same role as in Device Plugin mode, integrated into the DRA stack).
gpu-dcgm-exporter DaemonSet Prometheus exporter for DCGM metrics (same role as in Device Plugin mode).

Shared infrastructure (both modes)

Component Namespace What it does
NodeGroupConfiguration scripts (Device Plugin mode) node gpu-check.sh validates the driver; gpu-runtime.sh configures containerd with default_runtime=nvidia; gpu-sysctl.sh applies kernel parameters and sets node.deckhouse.io/gpu-setup-complete when ready.
NodeGroupConfiguration scripts (DRA) node gpu-setup.sh verifies driver and VFIO readiness, sets node.deckhouse.io/gpu-setup-complete; gpu-dra-feature-gates.sh patches control-plane static pod manifests to enable the DynamicResourceAllocation feature gate; gpu-vfio.sh loads VFIO kernel modules and sets node.deckhouse.io/gpu-vfio-ready; gpu-vfio-iommu.yaml configures IOMMU kernel parameters.
physicalgpus.gpu.deckhouse.io CRD cluster DRA mode: represents a physical GPU discovered by gpu-node-agent; used by gpu-controller for allocation decisions.

How it works

Below is the full sequence and the key labels/taints involved.

Key labels (who sets them and why)

  • node.deckhouse.io/gpu="" — set by the module hook on GPU NodeGroup nodes; together with node.deckhouse.io/gpu-setup-complete="" it participates in scheduling “global” stack DaemonSets (NFD worker, DCGM/Exporter, etc.).
  • node.deckhouse.io/device-gpu.config=<Exclusive|TimeSlicing|MIG> — set by the hook; used by config-manager in GFD and the device plugin to select the right config.
  • node.deckhouse.io/gpu-setup-complete="" — set by gpu-sysctl.sh (Device Plugin mode) or gpu-setup.sh (DRA) after local checks/config; until it appears, the stack is not scheduled onto the node (NFD worker, GFD, device plugin, DCGM/Exporter).
  • feature.node.kubernetes.io/pci-*.present=true — published by NFD; used (together with the NodeGroup selector) to schedule GFD onto nodes with NVIDIA PCI devices.
  • nvidia.com/* (for example nvidia.com/gpu.count, nvidia.com/mig.capable=true|false) — published by NFD from feature files generated by GFD (NFD master is allowed to write into nvidia.com). The nvidia.com/gpu.count>0 label participates in scheduling the device plugin.
  • nvidia.com/mig.config=<profile|all-disabled> — set by the hook: desired MIG profile (e.g., all-1g.5gb) or all-disabled for MIG rollback/disable.
  • nvidia.com/mig.config.state=<pending|rebooting|success|failed> — set by nvidia-mig-manager during reconfiguration.
  • node.deckhouse.io/gpu-vfio-ready="" — set by gpu-vfio.sh (DRA mode) when VFIO kernel modules are loaded and IOMMU groups are available; used as a readiness gate for VFIO-based GPU passthrough workloads.
  • taint mig-reconfigure=true:NoSchedule — set/removed by nvidia-mig-manager while the operation is running.
  • annotations update.node.deckhouse.io/disruption-approved, update.node.deckhouse.io/draining, update.node.deckhouse.io/drained — used for controlled node drain during MIG.

Common flow (for nvidia-operator mode)

  1. You set/change spec.gpu in the target NodeGroup.
  2. Helm deploys (or removes) components in d8-nvidia-gpu:
    • GFD and device plugin as per-NodeGroup DaemonSets;
    • NFD master/gc on master nodes; NFD worker/DCGM/Exporter on GPU nodes.
  3. The module hook updates node labels:
    • adds/updates node.deckhouse.io/gpu="" and node.deckhouse.io/device-gpu.config=...;
    • for MIG it adds/updates nvidia.com/mig.config=...;
    • on GPU disable it removes node.deckhouse.io/gpu and node.deckhouse.io/device-gpu.config; if the node had nvidia.com/mig.config it sets it to all-disabled to trigger MIG rollback.
  4. NodeGroupConfiguration scripts run on the node (by weight): gpu-check.shgpu-runtime.shgpu-sysctl.sh.
    • on success the node gets node.deckhouse.io/gpu-setup-complete="";
    • on cleanup/errors the label is cleared, the runtime drop-in is removed, and sysctl is restored.
  5. After that, the stack “converges” on the node:
    • NFD publishes feature.node.kubernetes.io/pci-* and serves custom feature files;
    • GFD writes GPU feature files into /etc/kubernetes/node-feature-discovery/features.d, and NFD publishes them as nvidia.com/*;
    • the device plugin exposes resources (depending on the mode), DCGM/Exporter start exposing metrics.

Exclusive mode (for nvidia-operator mode)

Enable

  1. Set spec.gpu.sharing: Exclusive.
  2. The module sets node.deckhouse.io/device-gpu.config=Exclusive; after node.deckhouse.io/gpu-setup-complete is present, it brings up GFD and the device plugin.
  3. The node exposes nvidia.com/gpu; each Pod gets a full GPU.

Disable

  1. Remove spec.gpu from the NodeGroup (or move nodes to a non-GPU NodeGroup).
  2. The module removes node.deckhouse.io/gpu and node.deckhouse.io/device-gpu.config from the node; NodeGroupConfiguration clears node.deckhouse.io/gpu-setup-complete and rolls back runtime/sysctl changes.
  3. GPU stack DaemonSets stop being scheduled onto the node; NFD GC eventually cleans nvidia.com/* labels.

TimeSlicing mode (for nvidia-operator mode)

Enable

  1. Set spec.gpu.sharing: TimeSlicing and optionally spec.gpu.timeSlicing.partitionCount (default is 4).
  2. The module sets node.deckhouse.io/device-gpu.config=TimeSlicing; the device plugin applies the time-slicing config.
  3. The node still exposes nvidia.com/gpu, but with more “virtual” slots (by partitionCount).

Disable

  1. Switch to Exclusive or remove spec.gpu from the NodeGroup.
  2. On mode switch, node.deckhouse.io/device-gpu.config changes and GFD/device plugin reload configs; on full GPU disable follow Exclusive → Disable.

MIG mode (for nvidia-operator mode)

Enable

  1. Set:
    • spec.gpu.sharing: MIG
    • spec.gpu.mig.partedConfig: <profile name> (e.g., all-1g.5gb)
  2. The module sets node.deckhouse.io/device-gpu.config=MIG and nvidia.com/mig.config=<profile>.
  3. Once GFD/NFD publishes nvidia.com/mig.capable=true (GPU supports MIG), nvidia-mig-manager is scheduled to the node (it runs as a DaemonSet and reacts to changes of the nvidia.com/mig.config label).
  4. When it needs to reconfigure MIG, it:
    • sets nvidia.com/mig.config.state=pending;
    • “pauses” GPU clients by setting deployment labels to paused-for-mig-change:
      • nvidia.com/gpu.deploy.device-plugin
      • nvidia.com/gpu.deploy.gpu-feature-discovery
      • nvidia.com/gpu.deploy.dcgm-exporter
      • nvidia.com/gpu.deploy.dcgm
      • nvidia.com/gpu.deploy.nvsm
    • sets taint mig-reconfigure=true:NoSchedule;
    • waits for update.node.deckhouse.io/disruption-approved (or an already started drain via update.node.deckhouse.io/draining/update.node.deckhouse.io/drained), then sets update.node.deckhouse.io/draining=bashible and waits for update.node.deckhouse.io/drained;
    • deletes (and waits for shutdown of) GPU client pods on the node: device plugin, GFD, DCGM Exporter, DCGM, plus validators (cuda/plugin);
    • applies the selected MIG profile (may set nvidia.com/mig.config.state=rebooting and reboot if needed);
    • finishes with nvidia.com/mig.config.state=success (or failed), removes the mig-reconfigure taint, runs uncordon, removes update.node.deckhouse.io/drained/update.node.deckhouse.io/disruption-approved, restores nvidia.com/gpu.deploy.* to true, and returns the node back to service.
  5. After applying the profile, the cluster exposes resources like nvidia.com/mig-<profile> (e.g., nvidia.com/mig-1g.5gb).
  6. If the nvidia-mig-manager Pod is restarted/removed, the preStop hook waits for an active operation to finish (/processing file), then runs uncordon, removes the mig-reconfigure=true:NoSchedule taint, and removes update.node.deckhouse.io/drained/update.node.deckhouse.io/disruption-approved (best-effort, may not finish on forced termination).

Disable (MIG → non-MIG, or full GPU disable)

  1. When switching away from MIG (Exclusive/TimeSlicing) or removing spec.gpu, the module sets nvidia.com/mig.config=all-disabled to roll back MIG on the node.
  2. nvidia-mig-manager applies all-disabled in the same way (taint + drain + operation). If nvidia.com/mig.capable is already missing (e.g., after GPU/NFD labels were removed), the manager is still scheduled while nvidia.com/mig.config=all-disabled exists to finish the rollback and remove the taint.
  3. After successful MIG disable (nvidia.com/mig.config.state=success), the script removes nvidia.com/mig.config and nvidia.com/mig.config.state if the node is no longer in MIG mode or GPU is disabled, so the manager does not “hang” on the node waiting for a label change.
  4. The manager does not run on GPUs without MIG support (nvidia.com/mig.capable=false); use Exclusive or TimeSlicing for such nodes.

DRA mode: control-plane feature gates

When dra.enabled is set to true, the NodeGroupConfiguration script gpu-dra-feature-gates.sh (weight 72) runs on all nodes and patches static pod manifests of the kube-scheduler and kube-apiserver to add --feature-gates=DynamicResourceAllocation=true. The patch is idempotent and version-aware (the script reads kubernetesVersion from cluster configuration and applies only the gates required for that version).

The script modifies /etc/kubernetes/manifests/kube-scheduler.yaml and related control-plane manifests directly on the node. kube-scheduler and kube-apiserver are automatically restarted by kubelet after the manifest changes. This happens once when DRA mode is first enabled.

DRA mode: VFIO support

In DRA mode two additional NodeGroupConfiguration scripts prepare nodes for VFIO-based GPU passthrough:

  • gpu-vfio.sh (weight 40) — loads vfio and vfio_iommu_type1 kernel modules and sets the label node.deckhouse.io/gpu-vfio-ready="" when IOMMU groups are present.
  • gpu-vfio-iommu.yaml — configures IOMMU kernel parameters.

Full GPU passthrough requires IOMMU to be enabled in BIOS and in the kernel command line (for example intel_iommu=on or amd_iommu=on). This change requires a node reboot. Without IOMMU, the VFIO modules are loaded but the gpu-vfio-ready label is not set and passthrough workloads are not admitted.

DRA mode: pre-delete hook

When the module is disabled or uninstalled while dra.enabled: true, a Helm pre-delete hook Job runs in namespace d8-nvidia-gpu. It deletes all physicalgpus.gpu.deckhouse.io custom resources cluster-wide and waits up to 600 s for their removal. This ensures gpu-controller finalizers are resolved before the CRD is deleted.

If the Job times out (for example, a finalizer is stuck), inspect the remaining PhysicalGPU objects:

d8 k get physicalgpus -A