The module lifecycle stage: General Availability
The module has requirements for installation
The module operates in two modes depending on the dra.enabled configuration parameter: Device Plugin mode (default) and DRA mode (experimental).
Device Plugin mode
In this mode dra.enabled is set to false (default).
Deployed into namespace d8-nvidia-gpu:
| Component | Kind | What it does |
|---|---|---|
node-feature-discovery-master |
Deployment | Collects hardware feature labels from workers and publishes them as node labels (feature.node.kubernetes.io/*, nvidia.com/*). |
node-feature-discovery-gc |
Deployment | Removes stale NFD labels from deleted nodes. |
node-feature-discovery-worker |
DaemonSet | Runs on every GPU node, detects PCI/USB devices and writes feature files for NFD master. |
gpu-feature-discovery-<ng> |
DaemonSet | Per-NodeGroup GFD: queries the GPU driver via NVML, writes detailed GPU feature files consumed by NFD master (nvidia.com/gpu.count, nvidia.com/mig.capable, etc.). |
nvidia-device-plugin-<ng> |
DaemonSet | Per-NodeGroup device plugin: exposes nvidia.com/gpu (Exclusive/TimeSlicing) or nvidia.com/mig-<profile> (MIG) resources to the Kubernetes scheduler. |
nvidia-mig-manager |
DaemonSet | Manages MIG reconfiguration on A100/H100 nodes: drains the node, applies the MIG profile, reboots if required, then returns the node to service. |
nvidia-dcgm |
DaemonSet | NVIDIA DCGM daemon: collects raw GPU telemetry (health, ECC, power, utilisation). |
nvidia-dcgm-exporter |
DaemonSet | Converts DCGM metrics into Prometheus format; scraped by the in-cluster Prometheus. |
DRA mode
In this mode dra.enabled is set to true.
DRA mode is experimental. Do not enable it in production clusters.
Deployed into namespace d8-nvidia-gpu:
| Component | Kind | What it does |
|---|---|---|
gpu-controller |
Deployment | Central DRA controller: implements the ResourceClaim / ResourceSlice Kubernetes API, manages GPU allocation across the cluster, runs admission webhooks for DRA objects. Runs HA support (up to 3 replicas). |
gpu-node-agent |
DaemonSet | Runs on every managed node with supported GPUs detected. |
nvidia-adapter |
DaemonSet | Runs on every managed node with supported GPUs detected. |
gpu-dcgm |
DaemonSet | DCGM daemon (same role as in Device Plugin mode, integrated into the DRA stack). |
gpu-dcgm-exporter |
DaemonSet | Prometheus exporter for DCGM metrics (same role as in Device Plugin mode). |
Shared infrastructure (both modes)
| Component | Namespace | What it does |
|---|---|---|
| NodeGroupConfiguration scripts (Device Plugin mode) | node | gpu-check.sh validates the driver; gpu-runtime.sh configures containerd with default_runtime=nvidia; gpu-sysctl.sh applies kernel parameters and sets node.deckhouse.io/gpu-setup-complete when ready. |
| NodeGroupConfiguration scripts (DRA) | node | gpu-setup.sh verifies driver and VFIO readiness, sets node.deckhouse.io/gpu-setup-complete; gpu-dra-feature-gates.sh patches control-plane static pod manifests to enable the DynamicResourceAllocation feature gate; gpu-vfio.sh loads VFIO kernel modules and sets node.deckhouse.io/gpu-vfio-ready; gpu-vfio-iommu.yaml configures IOMMU kernel parameters. |
physicalgpus.gpu.deckhouse.io CRD |
cluster | DRA mode: represents a physical GPU discovered by gpu-node-agent; used by gpu-controller for allocation decisions. |
How it works
Below is the full sequence and the key labels/taints involved.
Key labels (who sets them and why)
node.deckhouse.io/gpu=""— set by the module hook on GPU NodeGroup nodes; together withnode.deckhouse.io/gpu-setup-complete=""it participates in scheduling “global” stack DaemonSets (NFD worker, DCGM/Exporter, etc.).node.deckhouse.io/device-gpu.config=<Exclusive|TimeSlicing|MIG>— set by the hook; used by config-manager in GFD and the device plugin to select the right config.node.deckhouse.io/gpu-setup-complete=""— set bygpu-sysctl.sh(Device Plugin mode) orgpu-setup.sh(DRA) after local checks/config; until it appears, the stack is not scheduled onto the node (NFD worker, GFD, device plugin, DCGM/Exporter).feature.node.kubernetes.io/pci-*.present=true— published by NFD; used (together with the NodeGroup selector) to schedule GFD onto nodes with NVIDIA PCI devices.nvidia.com/*(for examplenvidia.com/gpu.count,nvidia.com/mig.capable=true|false) — published by NFD from feature files generated by GFD (NFD master is allowed to write intonvidia.com). Thenvidia.com/gpu.count>0label participates in scheduling the device plugin.nvidia.com/mig.config=<profile|all-disabled>— set by the hook: desired MIG profile (e.g.,all-1g.5gb) orall-disabledfor MIG rollback/disable.nvidia.com/mig.config.state=<pending|rebooting|success|failed>— set bynvidia-mig-managerduring reconfiguration.node.deckhouse.io/gpu-vfio-ready=""— set bygpu-vfio.sh(DRA mode) when VFIO kernel modules are loaded and IOMMU groups are available; used as a readiness gate for VFIO-based GPU passthrough workloads.- taint
mig-reconfigure=true:NoSchedule— set/removed bynvidia-mig-managerwhile the operation is running. - annotations
update.node.deckhouse.io/disruption-approved,update.node.deckhouse.io/draining,update.node.deckhouse.io/drained— used for controlled node drain during MIG.
Common flow (for nvidia-operator mode)
- You set/change
spec.gpuin the target NodeGroup. - Helm deploys (or removes) components in
d8-nvidia-gpu:- GFD and device plugin as per-NodeGroup DaemonSets;
- NFD master/gc on master nodes; NFD worker/DCGM/Exporter on GPU nodes.
- The module hook updates node labels:
- adds/updates
node.deckhouse.io/gpu=""andnode.deckhouse.io/device-gpu.config=...; - for MIG it adds/updates
nvidia.com/mig.config=...; - on GPU disable it removes
node.deckhouse.io/gpuandnode.deckhouse.io/device-gpu.config; if the node hadnvidia.com/mig.configit sets it toall-disabledto trigger MIG rollback.
- adds/updates
- NodeGroupConfiguration scripts run on the node (by weight):
gpu-check.sh→gpu-runtime.sh→gpu-sysctl.sh.- on success the node gets
node.deckhouse.io/gpu-setup-complete=""; - on cleanup/errors the label is cleared, the runtime drop-in is removed, and sysctl is restored.
- on success the node gets
- After that, the stack “converges” on the node:
- NFD publishes
feature.node.kubernetes.io/pci-*and serves custom feature files; - GFD writes GPU feature files into
/etc/kubernetes/node-feature-discovery/features.d, and NFD publishes them asnvidia.com/*; - the device plugin exposes resources (depending on the mode), DCGM/Exporter start exposing metrics.
- NFD publishes
Exclusive mode (for nvidia-operator mode)
Enable
- Set
spec.gpu.sharing: Exclusive. - The module sets
node.deckhouse.io/device-gpu.config=Exclusive; afternode.deckhouse.io/gpu-setup-completeis present, it brings up GFD and the device plugin. - The node exposes
nvidia.com/gpu; each Pod gets a full GPU.
Disable
- Remove
spec.gpufrom the NodeGroup (or move nodes to a non-GPU NodeGroup). - The module removes
node.deckhouse.io/gpuandnode.deckhouse.io/device-gpu.configfrom the node; NodeGroupConfiguration clearsnode.deckhouse.io/gpu-setup-completeand rolls back runtime/sysctl changes. - GPU stack DaemonSets stop being scheduled onto the node; NFD GC eventually cleans
nvidia.com/*labels.
TimeSlicing mode (for nvidia-operator mode)
Enable
- Set
spec.gpu.sharing: TimeSlicingand optionallyspec.gpu.timeSlicing.partitionCount(default is 4). - The module sets
node.deckhouse.io/device-gpu.config=TimeSlicing; the device plugin applies the time-slicing config. - The node still exposes
nvidia.com/gpu, but with more “virtual” slots (bypartitionCount).
Disable
- Switch to
Exclusiveor removespec.gpufrom the NodeGroup. - On mode switch,
node.deckhouse.io/device-gpu.configchanges and GFD/device plugin reload configs; on full GPU disable follow Exclusive → Disable.
MIG mode (for nvidia-operator mode)
Enable
- Set:
spec.gpu.sharing: MIGspec.gpu.mig.partedConfig: <profile name>(e.g.,all-1g.5gb)
- The module sets
node.deckhouse.io/device-gpu.config=MIGandnvidia.com/mig.config=<profile>. - Once GFD/NFD publishes
nvidia.com/mig.capable=true(GPU supports MIG),nvidia-mig-manageris scheduled to the node (it runs as a DaemonSet and reacts to changes of thenvidia.com/mig.configlabel). - When it needs to reconfigure MIG, it:
- sets
nvidia.com/mig.config.state=pending; - “pauses” GPU clients by setting deployment labels to
paused-for-mig-change:nvidia.com/gpu.deploy.device-pluginnvidia.com/gpu.deploy.gpu-feature-discoverynvidia.com/gpu.deploy.dcgm-exporternvidia.com/gpu.deploy.dcgmnvidia.com/gpu.deploy.nvsm
- sets taint
mig-reconfigure=true:NoSchedule; - waits for
update.node.deckhouse.io/disruption-approved(or an already started drain viaupdate.node.deckhouse.io/draining/update.node.deckhouse.io/drained), then setsupdate.node.deckhouse.io/draining=bashibleand waits forupdate.node.deckhouse.io/drained; - deletes (and waits for shutdown of) GPU client pods on the node: device plugin, GFD, DCGM Exporter, DCGM, plus validators (cuda/plugin);
- applies the selected MIG profile (may set
nvidia.com/mig.config.state=rebootingand reboot if needed); - finishes with
nvidia.com/mig.config.state=success(orfailed), removes themig-reconfiguretaint, runsuncordon, removesupdate.node.deckhouse.io/drained/update.node.deckhouse.io/disruption-approved, restoresnvidia.com/gpu.deploy.*totrue, and returns the node back to service.
- sets
- After applying the profile, the cluster exposes resources like
nvidia.com/mig-<profile>(e.g.,nvidia.com/mig-1g.5gb). - If the
nvidia-mig-managerPod is restarted/removed, thepreStophook waits for an active operation to finish (/processingfile), then runsuncordon, removes themig-reconfigure=true:NoScheduletaint, and removesupdate.node.deckhouse.io/drained/update.node.deckhouse.io/disruption-approved(best-effort, may not finish on forced termination).
Disable (MIG → non-MIG, or full GPU disable)
- When switching away from MIG (
Exclusive/TimeSlicing) or removingspec.gpu, the module setsnvidia.com/mig.config=all-disabledto roll back MIG on the node. nvidia-mig-managerappliesall-disabledin the same way (taint + drain + operation). Ifnvidia.com/mig.capableis already missing (e.g., after GPU/NFD labels were removed), the manager is still scheduled whilenvidia.com/mig.config=all-disabledexists to finish the rollback and remove the taint.- After successful MIG disable (
nvidia.com/mig.config.state=success), the script removesnvidia.com/mig.configandnvidia.com/mig.config.stateif the node is no longer in MIG mode or GPU is disabled, so the manager does not “hang” on the node waiting for a label change. - The manager does not run on GPUs without MIG support (
nvidia.com/mig.capable=false); useExclusiveorTimeSlicingfor such nodes.
DRA mode: control-plane feature gates
When dra.enabled is set to true, the NodeGroupConfiguration script gpu-dra-feature-gates.sh (weight 72) runs on all nodes and patches static pod manifests of the kube-scheduler and kube-apiserver to add --feature-gates=DynamicResourceAllocation=true. The patch is idempotent and version-aware (the script reads kubernetesVersion from cluster configuration and applies only the gates required for that version).
The script modifies /etc/kubernetes/manifests/kube-scheduler.yaml and related control-plane manifests directly on the node. kube-scheduler and kube-apiserver are automatically restarted by kubelet after the manifest changes. This happens once when DRA mode is first enabled.
DRA mode: VFIO support
In DRA mode two additional NodeGroupConfiguration scripts prepare nodes for VFIO-based GPU passthrough:
gpu-vfio.sh(weight 40) — loadsvfioandvfio_iommu_type1kernel modules and sets the labelnode.deckhouse.io/gpu-vfio-ready=""when IOMMU groups are present.gpu-vfio-iommu.yaml— configures IOMMU kernel parameters.
Full GPU passthrough requires IOMMU to be enabled in BIOS and in the kernel command line (for example intel_iommu=on or amd_iommu=on). This change requires a node reboot. Without IOMMU, the VFIO modules are loaded but the gpu-vfio-ready label is not set and passthrough workloads are not admitted.
DRA mode: pre-delete hook
When the module is disabled or uninstalled while dra.enabled: true, a Helm pre-delete hook Job runs in namespace d8-nvidia-gpu. It deletes all physicalgpus.gpu.deckhouse.io custom resources cluster-wide and waits up to 600 s for their removal. This ensures gpu-controller finalizers are resolved before the CRD is deleted.
If the Job times out (for example, a finalizer is stuck), inspect the remaining PhysicalGPU objects:
d8 k get physicalgpus -A