The module lifecycle stage: General Availability
The module has requirements for installation
How do I work with GPU nodes?
Step-by-step procedure for adding a GPU node to the cluster
Starting with Deckhouse 1.75, if a NodeGroup contains the spec.gpu section, the gpu module automatically:
- configures containerd with
default_runtime = "nvidia"(via NodeGroupConfiguration); - applies the required system settings (including fixes for the NVIDIA Container Toolkit);
- deploys system components: NFD, GFD, NVIDIA Device Plugin, DCGM Exporter, and, if needed, MIG Manager.
Always specify the desired mode in spec.gpu.sharing (Exclusive, TimeSlicing, or MIG). Manual containerd configuration (via NodeGroupConfiguration, TOML, etc.) is not required and must not be combined with the automatic setup.
For the list of supported NVIDIA Container Toolkit platforms, see the official documentation.
To add a GPU node to the cluster, perform the following steps:
-
Create a NodeGroup for GPU nodes.
An example with TimeSlicing enabled (
partitionCount: 4) and typical taint/label:apiVersion: deckhouse.io/v1 kind: NodeGroup metadata: name: gpu spec: nodeType: CloudStatic # or Static/CloudEphemeral — depending on your infrastructure. gpu: sharing: TimeSlicing timeSlicing: partitionCount: 4 nodeTemplate: labels: node-role/gpu: "" taints: - key: node-role value: gpu effect: NoScheduleIf you use custom taint keys, ensure they are allowed in ModuleConfig
globalin the array.spec.settings.modules.placement.customTolerationKeysso workloads can add the correspondingtolerations.Full field schema: see NodeGroup CR documentation.
-
Install the NVIDIA driver and nvidia-container-toolkit.
Install the NVIDIA driver and NVIDIA Container Toolkit on the nodes—either manually or via a NodeGroupConfiguration. Below are NodeGroupConfiguration examples for the gpu NodeGroup.
Ubuntu
apiVersion: deckhouse.io/v1alpha1 kind: NodeGroupConfiguration metadata: name: install-cuda.sh spec: bundles: - ubuntu-lts content: | #!/bin/bash if [ ! -f "/etc/apt/sources.list.d/nvidia-container-toolkit.list" ]; then distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list fi bb-apt-install nvidia-container-toolkit nvidia-driver-535-server nvidia-ctk config --set nvidia-container-runtime.log-level=error --in-place nodeGroups: - gpu weight: 30CentOS
apiVersion: deckhouse.io/v1alpha1 kind: NodeGroupConfiguration metadata: name: install-cuda.sh spec: bundles: - centos content: | #!/bin/bash if [ ! -f "/etc/yum.repos.d/nvidia-container-toolkit.repo" ]; then distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo fi bb-dnf-install nvidia-container-toolkit nvidia-driver nvidia-ctk config --set nvidia-container-runtime.log-level=error --in-place nodeGroups: - gpu weight: 30After these configurations are applied, perform bootstrap and reboot the nodes so that settings are applied and the drivers get installed.
-
Verify installation on the node using the command:
nvidia-smiExpected healthy output (example):
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.247.01 Driver Version: 535.247.01 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla V100-PCIE-32GB Off | 00000000:65:00.0 Off | 0 | | N/A 32C P0 35W / 250W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ -
Verify infrastructure components in the cluster.
NVIDIA Pods in
d8-nvidia-gpu:d8 k -n d8-nvidia-gpu get podExpected healthy output (example):
NAME READY STATUS RESTARTS AGE gpu-feature-discovery-80ceb7d-r842q 2/2 Running 0 2m53s nvidia-dcgm-exporter-w9v9h 1/1 Running 0 2m53s nvidia-dcgm-njqqb 1/1 Running 0 2m53s nvidia-device-plugin-80ceb7d-8xt8g 2/2 Running 0 2m53sNFD Pods in
d8-nvidia-gpu:d8 k -n d8-nvidia-gpu get pods | egrep '^(NAME|node-feature-discovery)'Expected healthy output (example):
NAME READY STATUS RESTARTS AGE node-feature-discovery-gc-6d845765df-45vpj 1/1 Running 0 3m6s node-feature-discovery-master-74696fd9d5-wkjk4 1/1 Running 0 3m6s node-feature-discovery-worker-5f4kv 1/1 Running 0 3m8sResource exposure on the node:
d8 k describe node <node-name>Output snippet (example):
Capacity: cpu: 40 memory: 263566308Ki nvidia.com/gpu: 4 Allocatable: cpu: 39930m memory: 262648294441 nvidia.com/gpu: 4 -
Run functional tests.
Option A. Invoke
nvidia-smifrom inside a container:apiVersion: batch/v1 kind: Job metadata: name: nvidia-cuda-test namespace: default spec: completions: 1 template: spec: restartPolicy: Never nodeSelector: node.deckhouse.io/group: gpu containers: - name: nvidia-cuda-test image: nvidia/cuda:11.6.2-base-ubuntu20.04 imagePullPolicy: "IfNotPresent" command: - nvidia-smiCheck the logs:
d8 k logs job/nvidia-cuda-testOption B. CUDA sample (vectoradd):
apiVersion: batch/v1 kind: Job metadata: name: gpu-operator-test namespace: default spec: completions: 1 template: spec: restartPolicy: Never nodeSelector: node.deckhouse.io/group: gpu containers: - name: gpu-operator-test image: nvidia/samples:vectoradd-cuda10.2 imagePullPolicy: "IfNotPresent"
Incompatible strategy detected auto in nvidia-device-plugin/gpu-feature-discovery logs
Errors like:
Incompatible strategy detected autofailed to create resource manager: unsupported strategy autoinvalid device discovery strategy
mean the component cannot detect the NVML platform inside the container (typically libnvidia-ml.so.* is not available / NVIDIA Container Toolkit runtime is not in use).
What to check:
nvidia-smiworks on the node.- NVIDIA Container Toolkit is installed (
/usr/bin/nvidia-container-runtimeexists). - containerd is configured to use the
nvidiaruntime on GPU nodes (thegpumodule does this after the driver/toolkit installation and a containerd restart/node reboot). - After fixing, recreate
nvidia-device-plugin-*andgpu-feature-discovery-*Pods in thed8-nvidia-gpunamespace.
How to monitor GPUs?
Deckhouse Kubernetes Platform automatically deploys DCGM Exporter; GPU metrics are scraped by Prometheus and available in Grafana.
Which GPU modes are supported?
- Exclusive — the node exposes the
nvidia.com/gpuresource; each Pod receives an entire GPU. - TimeSlicing — time-sharing a single GPU among multiple Pods (default
partitionCount: 4); Pods still requestnvidia.com/gpu. - MIG (Multi-Instance GPU) — hardware partitioning of supported GPUs into independent instances; with the
all-1g.5gbprofile the cluster exposes resources likenvidia.com/mig-1g.5gb.
See examples in Managing nodes: examples.
How to view available MIG profiles in the cluster?
Pre-defined profiles are stored in the mig-parted-config ConfigMap inside the d8-nvidia-gpu namespace and can be viewed with:
d8 k -n d8-nvidia-gpu get cm mig-parted-config -o json | jq -r '.data["config.yaml"]'The mig-configs: section lists the GPU models (by PCI ID) and the MIG profiles each card supports (e.g., all-1g.5gb, all-2g.10gb, all-balanced). Select the profile that matches your accelerator and set its name in spec.gpu.mig.partedConfig of the NodeGroup.
How to define a custom MIG profile per GPU on a node?
Use partedConfig: custom and describe MIG partitioning per GPU index:
gpu:
sharing: MIG
mig:
partedConfig: custom
customConfigs:
- index: 0
slices:
- profile: "1g.10gb"
count: 7
- index: 1
slices:
- profile: "2g.20gb"
count: 3What the module does:
- Generates a unique MIG config name for the NodeGroup and sets it in the
nvidia.com/mig.configlabel. - For GPUs listed in
customConfigs, rendersmig-enabled: truewith the declaredslices. - For all unspecified indexes (all remaining GPUs on the node), renders
mig-enabled: false, so those cards remain in full mode and do not override explicitly configured GPUs.
MIG profile does not activate — what to check?
-
GPU model: MIG is supported on H100/A100/A30; it is not supported on V100/T4. See the profile tables in the NVIDIA MIG guide.
-
NodeGroup configuration:
gpu: sharing: MIG mig: partedConfig: all-1g.5gb -
Wait until
nvidia-mig-managercompletes the drain of the node and reconfigures the GPU. This process can take several minutes. While it is running, the node is tainted withmig-reconfigure. When the operation succeeds, that taint is removed. -
Track the progress via the
nvidia.com/mig.config.statelabel on the node:pending,rebooting,success(orfailedif something goes wrong). -
If
nvidia.com/mig-*resources are still missing, check:d8 k -n d8-nvidia-gpu logs daemonset/nvidia-mig-manager nvidia-smi -L
Are AMD or Intel GPUs supported?
At this time, Deckhouse Kubernetes Platform automatically configures NVIDIA GPUs only. Support for AMD (ROCm) and Intel GPUs is being worked on and is planned for future releases.