FAQ | gpu | Deckhouse

The module lifecycle stage: General Availability
The module has requirements for installation

How do I work with GPU nodes?

Step-by-step procedure for adding a GPU node to the cluster

Starting with Deckhouse 1.75, if a NodeGroup contains the spec.gpu section, the gpu module automatically:

configures containerd with default_runtime = "nvidia" (via NodeGroupConfiguration);
applies the required system settings (including fixes for the NVIDIA Container Toolkit);
deploys system components: NFD, GFD, NVIDIA Device Plugin, DCGM Exporter, and, if needed, MIG Manager.

Always specify the desired mode in spec.gpu.sharing (Exclusive, TimeSlicing, or MIG). Manual containerd configuration (via NodeGroupConfiguration, TOML, etc.) is not required and must not be combined with the automatic setup. For the list of supported NVIDIA Container Toolkit platforms, see the official documentation.

To add a GPU node to the cluster, perform the following steps:

Create a NodeGroup for GPU nodes.

An example with TimeSlicing enabled (partitionCount: 4) and typical taint/label:

apiVersion: deckhouse.io/v1
kind: NodeGroup
metadata:
  name: gpu
spec:
  nodeType: CloudStatic # or Static/CloudEphemeral — depending on your infrastructure.
  gpu:
    sharing: TimeSlicing
    timeSlicing:
      partitionCount: 4
  nodeTemplate:
    labels:
      node-role/gpu: ""
    taints:
      - key: node-role
        value: gpu
        effect: NoSchedule

If you use custom taint keys, ensure they are allowed in ModuleConfig global in the array .spec.settings.modules.placement.customTolerationKeys so workloads can add the corresponding tolerations.

Full field schema: see NodeGroup CR documentation.

Install the NVIDIA driver and NVIDIA Container Toolkit (nvidia-container-toolkit).

Install the NVIDIA driver and NVIDIA Container Toolkit on the nodes—either manually or via a NodeGroupConfiguration. Below are NodeGroupConfiguration examples for the gpu NodeGroup.

Ubuntu

apiVersion: deckhouse.io/v1alpha1
kind: NodeGroupConfiguration
metadata:
  name: install-cuda.sh
spec:
  bundles:
    - ubuntu-lts
  content: |
    #!/bin/bash
    if [ ! -f "/etc/apt/sources.list.d/nvidia-container-toolkit.list" ]; then
      distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
      curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
      curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    fi
    bb-apt-install nvidia-container-toolkit nvidia-driver-535-server
    nvidia-ctk config --set nvidia-container-runtime.log-level=error --in-place
  nodeGroups:
    - gpu
  weight: 30

CentOS

apiVersion: deckhouse.io/v1alpha1
kind: NodeGroupConfiguration
metadata:
  name: install-cuda.sh
spec:
  bundles:
    - centos
  content: |
    #!/bin/bash
    if [ ! -f "/etc/yum.repos.d/nvidia-container-toolkit.repo" ]; then
      distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
    fi
    bb-dnf-install nvidia-container-toolkit nvidia-driver
    nvidia-ctk config --set nvidia-container-runtime.log-level=error --in-place
  nodeGroups:
    - gpu
  weight: 30

After these configurations are applied, perform bootstrap and reboot the nodes so that settings are applied and the drivers get installed.

Verify installation on the node using the command:

nvidia-smi

Make sure the GPU is not used by third-party processes: before running user workloads, the Processes section of the nvidia-smi output must not contain any processes using the GPU.

On nodes with a graphical environment, the GPU can be used, for example, by graphical session processes or a display manager: Xorg, gnome-shell, gdm, sddm, lightdm, and others. Such processes can occupy GPU memory and interfere with workloads, as well as with applying the MIG configuration.

Expected healthy output (example):

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.247.01             Driver Version: 535.247.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-32GB           Off | 00000000:65:00.0 Off |                    0 |
| N/A   32C    P0              35W / 250W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Verify infrastructure components in the cluster.

Device Plugin mode — NVIDIA Pods in d8-nvidia-gpu:

d8 k -n d8-nvidia-gpu get pod

Expected healthy output (example):

NAME                                  READY   STATUS    RESTARTS   AGE
gpu-feature-discovery-80ceb7d-r842q   2/2     Running   0          2m53s
nvidia-dcgm-exporter-w9v9h            1/1     Running   0          2m53s
nvidia-dcgm-njqqb                     1/1     Running   0          2m53s
nvidia-device-plugin-80ceb7d-8xt8g    2/2     Running   0          2m53s

NFD Pods in d8-nvidia-gpu:

d8 k -n d8-nvidia-gpu get pods | egrep '^(NAME|node-feature-discovery)'

Expected healthy output (example):

NAME                                             READY   STATUS      RESTARTS       AGE
node-feature-discovery-gc-6d845765df-45vpj       1/1     Running     0              3m6s
node-feature-discovery-master-74696fd9d5-wkjk4   1/1     Running     0              3m6s
node-feature-discovery-worker-5f4kv              1/1     Running     0              3m8s

DRA mode — DRA Pods in d8-nvidia-gpu:

d8 k -n d8-nvidia-gpu get pod

Expected healthy output (example):

NAME                              READY   STATUS    RESTARTS   AGE
gpu-controller-7d9f8b6c4-xk2lp   2/2     Running   0          5m
gpu-node-agent-q8tnz              1/1     Running   0          5m

Resource exposure on the node:

d8 k describe node <node-name>

Output snippet (example):

Capacity:
  cpu:                40
  memory:             263566308Ki
  nvidia.com/gpu:     4
Allocatable:
  cpu:                39930m
  memory:             262648294441
  nvidia.com/gpu:     4

Run functional tests.

Option A. Invoke nvidia-smi from inside a container:

apiVersion: batch/v1
kind: Job
metadata:
  name: nvidia-cuda-test
  namespace: default
spec:
  completions: 1
  template:
    spec:
      restartPolicy: Never
      nodeSelector:
        node.deckhouse.io/group: gpu
      containers:
        - name: nvidia-cuda-test
          image: nvidia/cuda:11.6.2-base-ubuntu20.04
          imagePullPolicy: "IfNotPresent"
          command:
            - nvidia-smi

Check the logs:

d8 k logs job/nvidia-cuda-test

Option B. CUDA sample (vectoradd):

apiVersion: batch/v1
kind: Job
metadata:
  name: gpu-operator-test
  namespace: default
spec:
  completions: 1
  template:
    spec:
      restartPolicy: Never
      nodeSelector:
        node.deckhouse.io/group: gpu
      containers:
        - name: gpu-operator-test
          image: nvidia/samples:vectoradd-cuda10.2
          imagePullPolicy: "IfNotPresent"

How to switch to DRA mode?

Set dra.enabled: true in ModuleConfig:

apiVersion: deckhouse.io/v1alpha1
kind: ModuleConfig
metadata:
  name: gpu
spec:
  enabled: true
  version: 1
  settings:
    dra:
      enabled: true

Requirements: Kubernetes ≥ 1.34. The module automatically removes the Device Plugin stack from d8-nvidia-gpu and deploys the DRA stack into the same namespace. No manual cleanup is required.

To verify the DRA stack is healthy:

d8 k get module gpu -o jsonpath='{.status.phase}'
d8 k -n d8-nvidia-gpu get deploy,ds

`Incompatible strategy detected auto` in `nvidia-device-plugin`/`gpu-feature-discovery` logs

Errors like:

Incompatible strategy detected auto
failed to create resource manager: unsupported strategy auto
invalid device discovery strategy

mean the component cannot detect the NVML platform inside the container (typically libnvidia-ml.so.* is not available / NVIDIA Container Toolkit runtime is not in use).

What to check:

nvidia-smi works on the node.
NVIDIA Container Toolkit is installed (/usr/bin/nvidia-container-runtime exists).
containerd is configured to use the nvidia runtime on GPU nodes (the gpu module does this after the driver/toolkit installation and a containerd restart/node reboot).
After fixing, recreate nvidia-device-plugin-* and gpu-feature-discovery-* Pods in the d8-nvidia-gpu namespace.

How to monitor GPUs?

Deckhouse Kubernetes Platform automatically deploys DCGM Exporter; GPU metrics are scraped by Prometheus and available in Grafana.

Which GPU modes are supported?

Exclusive — the node exposes the nvidia.com/gpu resource; each Pod receives an entire GPU.
TimeSlicing — time-sharing a single GPU among multiple Pods (default partitionCount: 4); Pods still request nvidia.com/gpu.
MIG (Multi-Instance GPU) — hardware partitioning of supported GPUs into independent instances; with the all-1g.5gb profile the cluster exposes resources like nvidia.com/mig-1g.5gb.

See GPU module examples.

How to view available MIG profiles in the cluster?

Pre-defined profiles are stored in the mig-parted-config ConfigMap inside the d8-nvidia-gpu namespace and can be viewed with:

d8 k -n d8-nvidia-gpu get cm mig-parted-config -o json | jq -r '.data["config.yaml"]'

The mig-configs: section lists the GPU models (by PCI ID) and the MIG profiles each card supports (e.g., all-1g.5gb, all-2g.10gb, all-balanced). Select the profile that matches your accelerator and set its name in spec.gpu.mig.partedConfig of the NodeGroup.

How to define a custom MIG profile per GPU on a node?

Use partedConfig: custom and describe MIG partitioning per GPU index:

gpu:
  sharing: MIG
  mig:
    partedConfig: custom
    customConfigs:
      - index: 0
        slices:
          - profile: "1g.10gb"
            count: 7
      - index: 1
        slices:
          - profile: "2g.20gb"
            count: 3

What the module does:

Generates a unique MIG config name for the NodeGroup and sets it in the nvidia.com/mig.config label.
For GPUs listed in customConfigs, renders mig-enabled: true with the declared slices.
For all unspecified indexes (all remaining GPUs on the node), renders mig-enabled: false, so those cards remain in full mode and do not override explicitly configured GPUs.

MIG profile does not activate — what to check?

Check the GPU model. MIG is supported in the H100/A100/A30 models and not supported in V100/T4. To verify the support in a model, refer to profile tables in the NVIDIA MIG guide.
Ensure the GPU is not being used by OS processes or user applications. If a graphical environment, display manager, or other GPU-consuming processes are running on the node, applying the MIG configuration may fail or may not take effect until the GPU is released. Check this with the following command:
```
nvidia-smi
```

Check the NodeGroup configuration:

gpu:
  sharing: MIG
  mig:
    partedConfig: all-1g.5gb

Wait until nvidia-mig-manager completes the drain of the node and reconfigures the GPU. This process can take several minutes. While it is running, the node is tainted with mig-reconfigure. When the operation succeeds, that taint is removed.
Track the progress via the nvidia.com/mig.config.state label on the node: pending, rebooting, success (or failed if something goes wrong).

If nvidia.com/mig-* resources are still missing, check:

d8 k -n d8-nvidia-gpu logs daemonset/nvidia-mig-manager
nvidia-smi -L

Are AMD or Intel GPUs supported?

At this time, Deckhouse Kubernetes Platform automatically configures NVIDIA GPUs only. Support for AMD (ROCm) and Intel GPUs is being worked on and is planned for future releases.

Managing nodes: FAQ

How do I work with GPU nodes?

Step-by-step procedure for adding a GPU node to the cluster

How to switch to DRA mode?

`Incompatible strategy detected auto` in `nvidia-device-plugin`/`gpu-feature-discovery` logs

How to monitor GPUs?

Which GPU modes are supported?

How to view available MIG profiles in the cluster?

How to define a custom MIG profile per GPU on a node?

MIG profile does not activate — what to check?

Are AMD or Intel GPUs supported?

An error has occurred

Tell us what you didn’t like.

Managing nodes: FAQ

How do I work with GPU nodes?

Step-by-step procedure for adding a GPU node to the cluster

How to switch to DRA mode?

Incompatible strategy detected auto in nvidia-device-plugin/gpu-feature-discovery logs

How to monitor GPUs?

Which GPU modes are supported?

How to view available MIG profiles in the cluster?

How to define a custom MIG profile per GPU on a node?

MIG profile does not activate — what to check?

Are AMD or Intel GPUs supported?

An error has occurred

Tell us what you didn’t like.

Request trial access

Thank you

Error

Request callback

Thank you

Something went wrong

Book your sessions

Thank you

Error

Request demo

Thank you

Error

Get the PCI SSC Compliance Report

Thank you

Error

`Incompatible strategy detected auto` in `nvidia-device-plugin`/`gpu-feature-discovery` logs