Architecture

The module lifecycle stage: General Availability
The module has requirements for installation

The module operates in two modes depending on the dra.enabled configuration parameter: Device Plugin mode (default) and DRA mode (experimental).

Device Plugin mode

In this mode dra.enabled is set to false (default).

Deployed into namespace d8-nvidia-gpu:

Component	Kind	What it does
`node-feature-discovery-master`	Deployment	Collects hardware feature labels from workers and publishes them as node labels (`feature.node.kubernetes.io/`, `nvidia.com/`).
`node-feature-discovery-gc`	Deployment	Removes stale NFD labels from deleted nodes.
`node-feature-discovery-worker`	DaemonSet	Runs on every GPU node, detects PCI/USB devices and writes feature files for NFD master.
`gpu-feature-discovery-<ng>`	DaemonSet	Per-NodeGroup GFD: queries the GPU driver via NVML, writes detailed GPU feature files consumed by NFD master (`nvidia.com/gpu.count`, `nvidia.com/mig.capable`, etc.).
`nvidia-device-plugin-<ng>`	DaemonSet	Per-NodeGroup device plugin: exposes `nvidia.com/gpu` (Exclusive/TimeSlicing) or `nvidia.com/mig-<profile>` (MIG) resources to the Kubernetes scheduler.
`nvidia-mig-manager`	DaemonSet	Manages MIG reconfiguration on A100/H100 nodes: drains the node, applies the MIG profile, reboots if required, then returns the node to service.
`nvidia-dcgm`	DaemonSet	NVIDIA DCGM daemon: collects raw GPU telemetry (health, ECC, power, utilisation).
`nvidia-dcgm-exporter`	DaemonSet	Converts DCGM metrics into Prometheus format; scraped by the in-cluster Prometheus.

DRA mode

In this mode dra.enabled is set to true.

DRA mode is experimental. Do not enable it in production clusters.

Deployed into namespace d8-nvidia-gpu:

Component	Kind	What it does
`gpu-controller`	Deployment	Central DRA controller: implements the `ResourceClaim` / `ResourceSlice` Kubernetes API, manages GPU allocation across the cluster, runs admission webhooks for DRA objects. Runs HA support (up to 3 replicas).
`gpu-node-agent`	DaemonSet	Runs on every managed node with supported GPUs detected.
`nvidia-adapter`	DaemonSet	Runs on every managed node with supported GPUs detected.
`gpu-dcgm`	DaemonSet	DCGM daemon (same role as in Device Plugin mode, integrated into the DRA stack).
`gpu-dcgm-exporter`	DaemonSet	Prometheus exporter for DCGM metrics (same role as in Device Plugin mode).

Shared infrastructure (both modes)

Component	Namespace	What it does
NodeGroupConfiguration scripts (Device Plugin mode)	node	`gpu-check.sh` validates the driver; `gpu-runtime.sh` configures containerd with `default_runtime=nvidia`; `gpu-sysctl.sh` applies kernel parameters and sets `node.deckhouse.io/gpu-setup-complete` when ready.
NodeGroupConfiguration scripts (DRA)	node	`gpu-setup.sh` verifies driver and VFIO readiness, sets `node.deckhouse.io/gpu-setup-complete`; `gpu-dra-feature-gates.sh` patches control-plane static pod manifests to enable the `DynamicResourceAllocation` feature gate; `gpu-vfio.sh` loads VFIO kernel modules and sets `node.deckhouse.io/gpu-vfio-ready`; `gpu-vfio-iommu.yaml` configures IOMMU kernel parameters.
`physicalgpus.gpu.deckhouse.io` CRD	cluster	DRA mode: represents a physical GPU discovered by `gpu-node-agent`; used by `gpu-controller` for allocation decisions.

How it works

Below is the full sequence and the key labels/taints involved.

Key labels (who sets them and why)

node.deckhouse.io/gpu="" — set by the module hook on GPU NodeGroup nodes; together with node.deckhouse.io/gpu-setup-complete="" it participates in scheduling “global” stack DaemonSets (NFD worker, DCGM/Exporter, etc.).
node.deckhouse.io/device-gpu.config=<Exclusive|TimeSlicing|MIG> — set by the hook; used by config-manager in GFD and the device plugin to select the right config.
node.deckhouse.io/gpu-setup-complete="" — set by gpu-sysctl.sh (Device Plugin mode) or gpu-setup.sh (DRA) after local checks/config; until it appears, the stack is not scheduled onto the node (NFD worker, GFD, device plugin, DCGM/Exporter).
feature.node.kubernetes.io/pci-*.present=true — published by NFD; used (together with the NodeGroup selector) to schedule GFD onto nodes with NVIDIA PCI devices.
nvidia.com/* (for example nvidia.com/gpu.count, nvidia.com/mig.capable=true|false) — published by NFD from feature files generated by GFD (NFD master is allowed to write into nvidia.com). The nvidia.com/gpu.count>0 label participates in scheduling the device plugin.
nvidia.com/mig.config=<profile|all-disabled> — set by the hook: desired MIG profile (e.g., all-1g.5gb) or all-disabled for MIG rollback/disable.
nvidia.com/mig.config.state=<pending|rebooting|success|failed> — set by nvidia-mig-manager during reconfiguration.
node.deckhouse.io/gpu-vfio-ready="" — set by gpu-vfio.sh (DRA mode) when VFIO kernel modules are loaded and IOMMU groups are available; used as a readiness gate for VFIO-based GPU passthrough workloads.
taint mig-reconfigure=true:NoSchedule — set/removed by nvidia-mig-manager while the operation is running.
annotations update.node.deckhouse.io/disruption-approved, update.node.deckhouse.io/draining, update.node.deckhouse.io/drained — used for controlled node drain during MIG.

Common flow (for nvidia-operator mode)

You set/change spec.gpu in the target NodeGroup.
Helm deploys (or removes) components in d8-nvidia-gpu:
- GFD and device plugin as per-NodeGroup DaemonSets;
- NFD master/gc on master nodes; NFD worker/DCGM/Exporter on GPU nodes.
The module hook updates node labels:
- adds/updates node.deckhouse.io/gpu="" and node.deckhouse.io/device-gpu.config=...;
- for MIG it adds/updates nvidia.com/mig.config=...;
- on GPU disable it removes node.deckhouse.io/gpu and node.deckhouse.io/device-gpu.config; if the node had nvidia.com/mig.config it sets it to all-disabled to trigger MIG rollback.
NodeGroupConfiguration scripts run on the node (by weight): gpu-check.sh → gpu-runtime.sh → gpu-sysctl.sh.
- on success the node gets node.deckhouse.io/gpu-setup-complete="";
- on cleanup/errors the label is cleared, the runtime drop-in is removed, and sysctl is restored.
After that, the stack “converges” on the node:
- NFD publishes feature.node.kubernetes.io/pci-* and serves custom feature files;
- GFD writes GPU feature files into /etc/kubernetes/node-feature-discovery/features.d, and NFD publishes them as nvidia.com/*;
- the device plugin exposes resources (depending on the mode), DCGM/Exporter start exposing metrics.

Exclusive mode (for nvidia-operator mode)

Enable

Set spec.gpu.sharing: Exclusive.
The module sets node.deckhouse.io/device-gpu.config=Exclusive; after node.deckhouse.io/gpu-setup-complete is present, it brings up GFD and the device plugin.
The node exposes nvidia.com/gpu; each Pod gets a full GPU.

Disable

Remove spec.gpu from the NodeGroup (or move nodes to a non-GPU NodeGroup).
The module removes node.deckhouse.io/gpu and node.deckhouse.io/device-gpu.config from the node; NodeGroupConfiguration clears node.deckhouse.io/gpu-setup-complete and rolls back runtime/sysctl changes.
GPU stack DaemonSets stop being scheduled onto the node; NFD GC eventually cleans nvidia.com/* labels.

TimeSlicing mode (for nvidia-operator mode)

Enable

Set spec.gpu.sharing: TimeSlicing and optionally spec.gpu.timeSlicing.partitionCount (default is 4).
The module sets node.deckhouse.io/device-gpu.config=TimeSlicing; the device plugin applies the time-slicing config.
The node still exposes nvidia.com/gpu, but with more “virtual” slots (by partitionCount).

Disable

Switch to Exclusive or remove spec.gpu from the NodeGroup.
On mode switch, node.deckhouse.io/device-gpu.config changes and GFD/device plugin reload configs; on full GPU disable follow Exclusive → Disable.

MIG mode (for nvidia-operator mode)

Enable

Set:
- spec.gpu.sharing: MIG
- spec.gpu.mig.partedConfig: <profile name> (e.g., all-1g.5gb)
The module sets node.deckhouse.io/device-gpu.config=MIG and nvidia.com/mig.config=<profile>.
Once GFD/NFD publishes nvidia.com/mig.capable=true (GPU supports MIG), nvidia-mig-manager is scheduled to the node (it runs as a DaemonSet and reacts to changes of the nvidia.com/mig.config label).
When it needs to reconfigure MIG, it:
- sets nvidia.com/mig.config.state=pending;
- “pauses” GPU clients by setting deployment labels to paused-for-mig-change:
  - nvidia.com/gpu.deploy.device-plugin
  - nvidia.com/gpu.deploy.gpu-feature-discovery
  - nvidia.com/gpu.deploy.dcgm-exporter
  - nvidia.com/gpu.deploy.dcgm
  - nvidia.com/gpu.deploy.nvsm
- sets taint mig-reconfigure=true:NoSchedule;
- waits for update.node.deckhouse.io/disruption-approved (or an already started drain via update.node.deckhouse.io/draining/update.node.deckhouse.io/drained), then sets update.node.deckhouse.io/draining=bashible and waits for update.node.deckhouse.io/drained;
- deletes (and waits for shutdown of) GPU client pods on the node: device plugin, GFD, DCGM Exporter, DCGM, plus validators (cuda/plugin);
- applies the selected MIG profile (may set nvidia.com/mig.config.state=rebooting and reboot if needed);
- finishes with nvidia.com/mig.config.state=success (or failed), removes the mig-reconfigure taint, runs uncordon, removes update.node.deckhouse.io/drained/update.node.deckhouse.io/disruption-approved, restores nvidia.com/gpu.deploy.* to true, and returns the node back to service.
After applying the profile, the cluster exposes resources like nvidia.com/mig-<profile> (e.g., nvidia.com/mig-1g.5gb).
If the nvidia-mig-manager Pod is restarted/removed, the preStop hook waits for an active operation to finish (/processing file), then runs uncordon, removes the mig-reconfigure=true:NoSchedule taint, and removes update.node.deckhouse.io/drained/update.node.deckhouse.io/disruption-approved (best-effort, may not finish on forced termination).

Disable (MIG → non-MIG, or full GPU disable)

When switching away from MIG (Exclusive/TimeSlicing) or removing spec.gpu, the module sets nvidia.com/mig.config=all-disabled to roll back MIG on the node.
nvidia-mig-manager applies all-disabled in the same way (taint + drain + operation). If nvidia.com/mig.capable is already missing (e.g., after GPU/NFD labels were removed), the manager is still scheduled while nvidia.com/mig.config=all-disabled exists to finish the rollback and remove the taint.
After successful MIG disable (nvidia.com/mig.config.state=success), the script removes nvidia.com/mig.config and nvidia.com/mig.config.state if the node is no longer in MIG mode or GPU is disabled, so the manager does not “hang” on the node waiting for a label change.
The manager does not run on GPUs without MIG support (nvidia.com/mig.capable=false); use Exclusive or TimeSlicing for such nodes.

DRA mode: control-plane feature gates

When dra.enabled is set to true, the NodeGroupConfiguration script gpu-dra-feature-gates.sh (weight 72) runs on all nodes and patches static pod manifests of the kube-scheduler and kube-apiserver to add --feature-gates=DynamicResourceAllocation=true. The patch is idempotent and version-aware (the script reads kubernetesVersion from cluster configuration and applies only the gates required for that version).

The script modifies /etc/kubernetes/manifests/kube-scheduler.yaml and related control-plane manifests directly on the node. kube-scheduler and kube-apiserver are automatically restarted by kubelet after the manifest changes. This happens once when DRA mode is first enabled.

DRA mode: VFIO support

In DRA mode two additional NodeGroupConfiguration scripts prepare nodes for VFIO-based GPU passthrough:

gpu-vfio.sh (weight 40) — loads vfio and vfio_iommu_type1 kernel modules and sets the label node.deckhouse.io/gpu-vfio-ready="" when IOMMU groups are present.
gpu-vfio-iommu.yaml — configures IOMMU kernel parameters.

Full GPU passthrough requires IOMMU to be enabled in BIOS and in the kernel command line (for example intel_iommu=on or amd_iommu=on). This change requires a node reboot. Without IOMMU, the VFIO modules are loaded but the gpu-vfio-ready label is not set and passthrough workloads are not admitted.

DRA mode: pre-delete hook

When the module is disabled or uninstalled while dra.enabled: true, a Helm pre-delete hook Job runs in namespace d8-nvidia-gpu. It deletes all physicalgpus.gpu.deckhouse.io custom resources cluster-wide and waits up to 600 s for their removal. This ensures gpu-controller finalizers are resolved before the CRD is deleted.

If the Job times out (for example, a finalizer is stuck), inspect the remaining PhysicalGPU objects:

d8 k get physicalgpus -A

Device Plugin mode

DRA mode

Shared infrastructure (both modes)

How it works

Key labels (who sets them and why)

Common flow (for nvidia-operator mode)

Exclusive mode (for nvidia-operator mode)

TimeSlicing mode (for nvidia-operator mode)

MIG mode (for nvidia-operator mode)

DRA mode: control-plane feature gates

DRA mode: VFIO support

DRA mode: pre-delete hook

An error has occurred

Tell us what you didn’t like.

Architecture

Device Plugin mode

DRA mode

Shared infrastructure (both modes)

How it works

Key labels (who sets them and why)

Common flow (for nvidia-operator mode)

Exclusive mode (for nvidia-operator mode)

TimeSlicing mode (for nvidia-operator mode)

MIG mode (for nvidia-operator mode)

DRA mode: control-plane feature gates

DRA mode: VFIO support

DRA mode: pre-delete hook

An error has occurred

Tell us what you didn’t like.

Request trial access

Thank you

Error

Request callback

Thank you

Something went wrong

Book your sessions

Thank you

Error

Request demo

Thank you

Error

Get the PCI SSC Compliance Report

Thank you

Error