Available in:  EE

The module lifecycle stageGeneral Availability
The module has requirements for installation

GPU module

The module brings the NVIDIA stack to Deckhouse Kubernetes Platform for GPU workloads: NFD/GFD, device plugin (Exclusive/TimeSlicing/MIG), MIG manager, and DCGM/Exporter with Grafana dashboards.

Prerequisites

  • NVIDIA driver and NVIDIA Container Toolkit are installed on target nodes (containerd/runtime is configured by this module via NodeGroupConfiguration).
  • spec.gpu is set in the target NodeGroup (sharing: Exclusive, TimeSlicing, or MIG).

What the module deploys

  • Node Feature Discovery (NFD) and GPU Feature Discovery (GFD) for GPU labeling.
  • NVIDIA device plugin supporting Exclusive/TimeSlicing/MIG modes.
  • MIG manager and configs for MIG-capable GPUs.
  • DCGM Exporter and ready Grafana dashboards for GPU health.

How to enable

Set GPU parameters in the required NodeGroup — the module will label nodes and deploy components in d8-nvidia-gpu:

apiVersion: deckhouse.io/v1
kind: NodeGroup
metadata:
  name: gpu
spec:
  cri: Containerd
  gpu:
    sharing: MIG # or Exclusive / TimeSlicing
    mig:
      partedConfig: all-1g.5gb

How it works (step-by-step)

Below is the full sequence and the key labels/taints involved.

Key labels (who sets them and why)

  • node.deckhouse.io/gpu="" — set by the module hook on GPU NodeGroup nodes; together with node.deckhouse.io/gpu-setup-complete="" it participates in scheduling “global” stack DaemonSets (NFD worker, DCGM/Exporter, etc.).
  • node.deckhouse.io/device-gpu.config=<Exclusive|TimeSlicing|MIG> — set by the hook; used by config-manager in GFD and the device plugin to select the right config.
  • node.deckhouse.io/gpu-setup-complete="" — set by gpu-sysctl.sh (NodeGroupConfiguration) after local checks/config; until it appears, the stack is not scheduled onto the node (NFD worker, GFD, device plugin, DCGM/Exporter).
  • feature.node.kubernetes.io/pci-*.present=true — published by NFD; used (together with the NodeGroup selector) to schedule GFD onto nodes with NVIDIA PCI devices.
  • nvidia.com/* (for example nvidia.com/gpu.count, nvidia.com/mig.capable=true|false) — published by NFD from feature files generated by GFD (NFD master is allowed to write into nvidia.com). The nvidia.com/gpu.count>0 label participates in scheduling the device plugin.
  • nvidia.com/mig.config=<profile|all-disabled> — set by the hook: desired MIG profile (e.g., all-1g.5gb) or all-disabled for MIG rollback/disable.
  • nvidia.com/mig.config.state=<pending|rebooting|success|failed> — set by nvidia-mig-manager during reconfiguration.
  • taint mig-reconfigure=true:NoSchedule — set/removed by nvidia-mig-manager while the operation is running.
  • annotations update.node.deckhouse.io/disruption-approved, update.node.deckhouse.io/draining, update.node.deckhouse.io/drained — used for controlled node drain during MIG.

Common flow (all modes)

  1. You set/change spec.gpu in the target NodeGroup.
  2. Helm deploys (or removes) components in d8-nvidia-gpu:
    • GFD and device plugin as per-NodeGroup DaemonSets;
    • NFD master/gc on master nodes; NFD worker/DCGM/Exporter on GPU nodes.
  3. The module hook updates node labels:
    • adds/updates node.deckhouse.io/gpu="" and node.deckhouse.io/device-gpu.config=...;
    • for MIG it adds/updates nvidia.com/mig.config=...;
    • on GPU disable it removes node.deckhouse.io/gpu and node.deckhouse.io/device-gpu.config; if the node had nvidia.com/mig.config it sets it to all-disabled to trigger MIG rollback.
  4. NodeGroupConfiguration scripts run on the node (by weight): gpu-check.shgpu-runtime.shgpu-sysctl.sh.
    • on success the node gets node.deckhouse.io/gpu-setup-complete="";
    • on cleanup/errors the label is cleared, the runtime drop-in is removed, and sysctl is restored.
  5. After that, the stack “converges” on the node:
    • NFD publishes feature.node.kubernetes.io/pci-* and serves custom feature files;
    • GFD writes GPU feature files into /etc/kubernetes/node-feature-discovery/features.d, and NFD publishes them as nvidia.com/*;
    • the device plugin exposes resources (depending on the mode), DCGM/Exporter start exposing metrics.

Exclusive mode

Enable

  1. Set spec.gpu.sharing: Exclusive.
  2. The module sets node.deckhouse.io/device-gpu.config=Exclusive; after node.deckhouse.io/gpu-setup-complete is present, it brings up GFD and the device plugin.
  3. The node exposes nvidia.com/gpu; each Pod gets a full GPU.

Disable

  1. Remove spec.gpu from the NodeGroup (or move nodes to a non-GPU NodeGroup).
  2. The module removes node.deckhouse.io/gpu and node.deckhouse.io/device-gpu.config from the node; NodeGroupConfiguration clears node.deckhouse.io/gpu-setup-complete and rolls back runtime/sysctl changes.
  3. GPU stack DaemonSets stop being scheduled onto the node; NFD GC eventually cleans nvidia.com/* labels.

TimeSlicing mode

Enable

  1. Set spec.gpu.sharing: TimeSlicing and optionally spec.gpu.timeSlicing.partitionCount (default is 4).
  2. The module sets node.deckhouse.io/device-gpu.config=TimeSlicing; the device plugin applies the time-slicing config.
  3. The node still exposes nvidia.com/gpu, but with more “virtual” slots (by partitionCount).

Disable

  1. Switch to Exclusive or remove spec.gpu from the NodeGroup.
  2. On mode switch, node.deckhouse.io/device-gpu.config changes and GFD/device plugin reload configs; on full GPU disable follow Exclusive → Disable.

MIG mode

Enable

  1. Set:
    • spec.gpu.sharing: MIG
    • spec.gpu.mig.partedConfig: <profile name> (e.g., all-1g.5gb)
  2. The module sets node.deckhouse.io/device-gpu.config=MIG and nvidia.com/mig.config=<profile>.
  3. Once GFD/NFD publishes nvidia.com/mig.capable=true (GPU supports MIG), nvidia-mig-manager is scheduled to the node (it runs as a DaemonSet and reacts to changes of the nvidia.com/mig.config label).
  4. When it needs to reconfigure MIG, it:
    • sets nvidia.com/mig.config.state=pending;
    • “pauses” GPU clients by setting deployment labels to paused-for-mig-change:
      • nvidia.com/gpu.deploy.device-plugin
      • nvidia.com/gpu.deploy.gpu-feature-discovery
      • nvidia.com/gpu.deploy.dcgm-exporter
      • nvidia.com/gpu.deploy.dcgm
      • nvidia.com/gpu.deploy.nvsm
    • sets taint mig-reconfigure=true:NoSchedule;
    • waits for update.node.deckhouse.io/disruption-approved (or an already started drain via update.node.deckhouse.io/draining/update.node.deckhouse.io/drained), then sets update.node.deckhouse.io/draining=bashible and waits for update.node.deckhouse.io/drained;
    • deletes (and waits for shutdown of) GPU client pods on the node: device plugin, GFD, DCGM Exporter, DCGM, plus validators (cuda/plugin);
    • applies the selected MIG profile (may set nvidia.com/mig.config.state=rebooting and reboot if needed);
    • finishes with nvidia.com/mig.config.state=success (or failed), removes the mig-reconfigure taint, runs kubectl uncordon, removes update.node.deckhouse.io/drained/update.node.deckhouse.io/disruption-approved, restores nvidia.com/gpu.deploy.* to true, and returns the node back to service.
  5. After applying the profile, the cluster exposes resources like nvidia.com/mig-<profile> (e.g., nvidia.com/mig-1g.5gb).
  6. If the nvidia-mig-manager Pod is restarted/removed, the preStop hook waits for an active operation to finish (/processing file), then runs kubectl uncordon, removes the mig-reconfigure=true:NoSchedule taint, and removes update.node.deckhouse.io/drained/update.node.deckhouse.io/disruption-approved (best-effort, may not finish on forced termination).

Disable (MIG → non-MIG, or full GPU disable)

  1. When switching away from MIG (Exclusive/TimeSlicing) or removing spec.gpu, the module sets nvidia.com/mig.config=all-disabled to roll back MIG on the node.
  2. nvidia-mig-manager applies all-disabled in the same way (taint + drain + operation). If nvidia.com/mig.capable is already missing (e.g., after GPU/NFD labels were removed), the manager is still scheduled while nvidia.com/mig.config=all-disabled exists to finish the rollback and remove the taint.
  3. After successful MIG disable (nvidia.com/mig.config.state=success), the script removes nvidia.com/mig.config and nvidia.com/mig.config.state if the node is no longer in MIG mode or GPU is disabled, so the manager does not “hang” on the node waiting for a label change.
  4. The manager does not run on GPUs without MIG support (nvidia.com/mig.capable=false); use Exclusive or TimeSlicing for such nodes.

Monitoring

DCGM Exporter publishes metrics; Grafana dashboards show GPU load and MIG state.