Available in: EE
The module lifecycle stage: General Availability
The module has requirements for installation
GPU module
The module brings the NVIDIA stack to Deckhouse Kubernetes Platform for GPU workloads: NFD/GFD, device plugin (Exclusive/TimeSlicing/MIG), MIG manager, and DCGM/Exporter with Grafana dashboards.
Prerequisites
- NVIDIA driver and NVIDIA Container Toolkit are installed on target nodes (containerd/runtime is configured by this module via NodeGroupConfiguration).
spec.gpuis set in the target NodeGroup (sharing: Exclusive, TimeSlicing, or MIG).
What the module deploys
- Node Feature Discovery (NFD) and GPU Feature Discovery (GFD) for GPU labeling.
- NVIDIA device plugin supporting Exclusive/TimeSlicing/MIG modes.
- MIG manager and configs for MIG-capable GPUs.
- DCGM Exporter and ready Grafana dashboards for GPU health.
How to enable
Set GPU parameters in the required NodeGroup — the module will label nodes and deploy components in d8-nvidia-gpu:
apiVersion: deckhouse.io/v1
kind: NodeGroup
metadata:
name: gpu
spec:
cri: Containerd
gpu:
sharing: MIG # or Exclusive / TimeSlicing
mig:
partedConfig: all-1g.5gbHow it works (step-by-step)
Below is the full sequence and the key labels/taints involved.
Key labels (who sets them and why)
node.deckhouse.io/gpu=""— set by the module hook on GPU NodeGroup nodes; together withnode.deckhouse.io/gpu-setup-complete=""it participates in scheduling “global” stack DaemonSets (NFD worker, DCGM/Exporter, etc.).node.deckhouse.io/device-gpu.config=<Exclusive|TimeSlicing|MIG>— set by the hook; used by config-manager in GFD and the device plugin to select the right config.node.deckhouse.io/gpu-setup-complete=""— set bygpu-sysctl.sh(NodeGroupConfiguration) after local checks/config; until it appears, the stack is not scheduled onto the node (NFD worker, GFD, device plugin, DCGM/Exporter).feature.node.kubernetes.io/pci-*.present=true— published by NFD; used (together with the NodeGroup selector) to schedule GFD onto nodes with NVIDIA PCI devices.nvidia.com/*(for examplenvidia.com/gpu.count,nvidia.com/mig.capable=true|false) — published by NFD from feature files generated by GFD (NFD master is allowed to write intonvidia.com). Thenvidia.com/gpu.count>0label participates in scheduling the device plugin.nvidia.com/mig.config=<profile|all-disabled>— set by the hook: desired MIG profile (e.g.,all-1g.5gb) orall-disabledfor MIG rollback/disable.nvidia.com/mig.config.state=<pending|rebooting|success|failed>— set bynvidia-mig-managerduring reconfiguration.- taint
mig-reconfigure=true:NoSchedule— set/removed bynvidia-mig-managerwhile the operation is running. - annotations
update.node.deckhouse.io/disruption-approved,update.node.deckhouse.io/draining,update.node.deckhouse.io/drained— used for controlled node drain during MIG.
Common flow (all modes)
- You set/change
spec.gpuin the target NodeGroup. - Helm deploys (or removes) components in
d8-nvidia-gpu:- GFD and device plugin as per-NodeGroup DaemonSets;
- NFD master/gc on master nodes; NFD worker/DCGM/Exporter on GPU nodes.
- The module hook updates node labels:
- adds/updates
node.deckhouse.io/gpu=""andnode.deckhouse.io/device-gpu.config=...; - for MIG it adds/updates
nvidia.com/mig.config=...; - on GPU disable it removes
node.deckhouse.io/gpuandnode.deckhouse.io/device-gpu.config; if the node hadnvidia.com/mig.configit sets it toall-disabledto trigger MIG rollback.
- adds/updates
- NodeGroupConfiguration scripts run on the node (by weight):
gpu-check.sh→gpu-runtime.sh→gpu-sysctl.sh.- on success the node gets
node.deckhouse.io/gpu-setup-complete=""; - on cleanup/errors the label is cleared, the runtime drop-in is removed, and sysctl is restored.
- on success the node gets
- After that, the stack “converges” on the node:
- NFD publishes
feature.node.kubernetes.io/pci-*and serves custom feature files; - GFD writes GPU feature files into
/etc/kubernetes/node-feature-discovery/features.d, and NFD publishes them asnvidia.com/*; - the device plugin exposes resources (depending on the mode), DCGM/Exporter start exposing metrics.
- NFD publishes
Exclusive mode
Enable
- Set
spec.gpu.sharing: Exclusive. - The module sets
node.deckhouse.io/device-gpu.config=Exclusive; afternode.deckhouse.io/gpu-setup-completeis present, it brings up GFD and the device plugin. - The node exposes
nvidia.com/gpu; each Pod gets a full GPU.
Disable
- Remove
spec.gpufrom the NodeGroup (or move nodes to a non-GPU NodeGroup). - The module removes
node.deckhouse.io/gpuandnode.deckhouse.io/device-gpu.configfrom the node; NodeGroupConfiguration clearsnode.deckhouse.io/gpu-setup-completeand rolls back runtime/sysctl changes. - GPU stack DaemonSets stop being scheduled onto the node; NFD GC eventually cleans
nvidia.com/*labels.
TimeSlicing mode
Enable
- Set
spec.gpu.sharing: TimeSlicingand optionallyspec.gpu.timeSlicing.partitionCount(default is 4). - The module sets
node.deckhouse.io/device-gpu.config=TimeSlicing; the device plugin applies the time-slicing config. - The node still exposes
nvidia.com/gpu, but with more “virtual” slots (bypartitionCount).
Disable
- Switch to
Exclusiveor removespec.gpufrom the NodeGroup. - On mode switch,
node.deckhouse.io/device-gpu.configchanges and GFD/device plugin reload configs; on full GPU disable follow Exclusive → Disable.
MIG mode
Enable
- Set:
spec.gpu.sharing: MIGspec.gpu.mig.partedConfig: <profile name>(e.g.,all-1g.5gb)
- The module sets
node.deckhouse.io/device-gpu.config=MIGandnvidia.com/mig.config=<profile>. - Once GFD/NFD publishes
nvidia.com/mig.capable=true(GPU supports MIG),nvidia-mig-manageris scheduled to the node (it runs as a DaemonSet and reacts to changes of thenvidia.com/mig.configlabel). - When it needs to reconfigure MIG, it:
- sets
nvidia.com/mig.config.state=pending; - “pauses” GPU clients by setting deployment labels to
paused-for-mig-change:nvidia.com/gpu.deploy.device-pluginnvidia.com/gpu.deploy.gpu-feature-discoverynvidia.com/gpu.deploy.dcgm-exporternvidia.com/gpu.deploy.dcgmnvidia.com/gpu.deploy.nvsm
- sets taint
mig-reconfigure=true:NoSchedule; - waits for
update.node.deckhouse.io/disruption-approved(or an already started drain viaupdate.node.deckhouse.io/draining/update.node.deckhouse.io/drained), then setsupdate.node.deckhouse.io/draining=bashibleand waits forupdate.node.deckhouse.io/drained; - deletes (and waits for shutdown of) GPU client pods on the node: device plugin, GFD, DCGM Exporter, DCGM, plus validators (cuda/plugin);
- applies the selected MIG profile (may set
nvidia.com/mig.config.state=rebootingand reboot if needed); - finishes with
nvidia.com/mig.config.state=success(orfailed), removes themig-reconfiguretaint, runskubectl uncordon, removesupdate.node.deckhouse.io/drained/update.node.deckhouse.io/disruption-approved, restoresnvidia.com/gpu.deploy.*totrue, and returns the node back to service.
- sets
- After applying the profile, the cluster exposes resources like
nvidia.com/mig-<profile>(e.g.,nvidia.com/mig-1g.5gb). - If the
nvidia-mig-managerPod is restarted/removed, thepreStophook waits for an active operation to finish (/processingfile), then runskubectl uncordon, removes themig-reconfigure=true:NoScheduletaint, and removesupdate.node.deckhouse.io/drained/update.node.deckhouse.io/disruption-approved(best-effort, may not finish on forced termination).
Disable (MIG → non-MIG, or full GPU disable)
- When switching away from MIG (
Exclusive/TimeSlicing) or removingspec.gpu, the module setsnvidia.com/mig.config=all-disabledto roll back MIG on the node. nvidia-mig-managerappliesall-disabledin the same way (taint + drain + operation). Ifnvidia.com/mig.capableis already missing (e.g., after GPU/NFD labels were removed), the manager is still scheduled whilenvidia.com/mig.config=all-disabledexists to finish the rollback and remove the taint.- After successful MIG disable (
nvidia.com/mig.config.state=success), the script removesnvidia.com/mig.configandnvidia.com/mig.config.stateif the node is no longer in MIG mode or GPU is disabled, so the manager does not “hang” on the node waiting for a label change. - The manager does not run on GPUs without MIG support (
nvidia.com/mig.capable=false); useExclusiveorTimeSlicingfor such nodes.
Monitoring
DCGM Exporter publishes metrics; Grafana dashboards show GPU load and MIG state.