The module lifecycle stageGeneral Availability
The module has requirements for installation

How do I work with GPU nodes?

Step-by-step procedure for adding a GPU node to the cluster

Starting with Deckhouse 1.75, if a NodeGroup contains the spec.gpu section, the gpu module automatically:

  • configures containerd with default_runtime = "nvidia" (via NodeGroupConfiguration);
  • applies the required system settings (including fixes for the NVIDIA Container Toolkit);
  • deploys system components: NFD, GFD, NVIDIA Device Plugin, DCGM Exporter, and, if needed, MIG Manager.

Always specify the desired mode in spec.gpu.sharing (Exclusive, TimeSlicing, or MIG). Manual containerd configuration (via NodeGroupConfiguration, TOML, etc.) is not required and must not be combined with the automatic setup. For the list of supported NVIDIA Container Toolkit platforms, see the official documentation.

To add a GPU node to the cluster, perform the following steps:

  1. Create a NodeGroup for GPU nodes.

    An example with TimeSlicing enabled (partitionCount: 4) and typical taint/label:

    apiVersion: deckhouse.io/v1
    kind: NodeGroup
    metadata:
      name: gpu
    spec:
      nodeType: CloudStatic # or Static/CloudEphemeral — depending on your infrastructure.
      gpu:
        sharing: TimeSlicing
        timeSlicing:
          partitionCount: 4
      nodeTemplate:
        labels:
          node-role/gpu: ""
        taints:
          - key: node-role
            value: gpu
            effect: NoSchedule

    If you use custom taint keys, ensure they are allowed in ModuleConfig global in the array .spec.settings.modules.placement.customTolerationKeys so workloads can add the corresponding tolerations.

    Full field schema: see NodeGroup CR documentation.

  2. Install the NVIDIA driver and nvidia-container-toolkit.

    Install the NVIDIA driver and NVIDIA Container Toolkit on the nodes—either manually or via a NodeGroupConfiguration. Below are NodeGroupConfiguration examples for the gpu NodeGroup.

    Ubuntu

    apiVersion: deckhouse.io/v1alpha1
    kind: NodeGroupConfiguration
    metadata:
      name: install-cuda.sh
    spec:
      bundles:
        - ubuntu-lts
      content: |
        #!/bin/bash
        if [ ! -f "/etc/apt/sources.list.d/nvidia-container-toolkit.list" ]; then
          distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
          curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
          curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
        fi
        bb-apt-install nvidia-container-toolkit nvidia-driver-535-server
        nvidia-ctk config --set nvidia-container-runtime.log-level=error --in-place
      nodeGroups:
        - gpu
      weight: 30

    CentOS

    apiVersion: deckhouse.io/v1alpha1
    kind: NodeGroupConfiguration
    metadata:
      name: install-cuda.sh
    spec:
      bundles:
        - centos
      content: |
        #!/bin/bash
        if [ ! -f "/etc/yum.repos.d/nvidia-container-toolkit.repo" ]; then
          distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
          curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
        fi
        bb-dnf-install nvidia-container-toolkit nvidia-driver
        nvidia-ctk config --set nvidia-container-runtime.log-level=error --in-place
      nodeGroups:
        - gpu
      weight: 30

    After these configurations are applied, perform bootstrap and reboot the nodes so that settings are applied and the drivers get installed.

  3. Verify installation on the node using the command:

    nvidia-smi

    Expected healthy output (example):

    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.247.01             Driver Version: 535.247.01   CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  Tesla V100-PCIE-32GB           Off | 00000000:65:00.0 Off |                    0 |
    | N/A   32C    P0              35W / 250W |      0MiB / 32768MiB |      0%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+
    
  4. Verify infrastructure components in the cluster.

    NVIDIA Pods in d8-nvidia-gpu:

    d8 k -n d8-nvidia-gpu get pod

    Expected healthy output (example):

    NAME                                  READY   STATUS    RESTARTS   AGE
    gpu-feature-discovery-80ceb7d-r842q   2/2     Running   0          2m53s
    nvidia-dcgm-exporter-w9v9h            1/1     Running   0          2m53s
    nvidia-dcgm-njqqb                     1/1     Running   0          2m53s
    nvidia-device-plugin-80ceb7d-8xt8g    2/2     Running   0          2m53s
    

    NFD Pods in d8-nvidia-gpu:

    d8 k -n d8-nvidia-gpu get pods | egrep '^(NAME|node-feature-discovery)'

    Expected healthy output (example):

    NAME                                             READY   STATUS      RESTARTS       AGE
    node-feature-discovery-gc-6d845765df-45vpj       1/1     Running     0              3m6s
    node-feature-discovery-master-74696fd9d5-wkjk4   1/1     Running     0              3m6s
    node-feature-discovery-worker-5f4kv              1/1     Running     0              3m8s
    

    Resource exposure on the node:

    d8 k describe node <node-name>

    Output snippet (example):

    Capacity:
      cpu:                40
      memory:             263566308Ki
      nvidia.com/gpu:     4
    Allocatable:
      cpu:                39930m
      memory:             262648294441
      nvidia.com/gpu:     4
    
  5. Run functional tests.

    Option A. Invoke nvidia-smi from inside a container:

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: nvidia-cuda-test
      namespace: default
    spec:
      completions: 1
      template:
        spec:
          restartPolicy: Never
          nodeSelector:
            node.deckhouse.io/group: gpu
          containers:
            - name: nvidia-cuda-test
              image: nvidia/cuda:11.6.2-base-ubuntu20.04
              imagePullPolicy: "IfNotPresent"
              command:
                - nvidia-smi

    Check the logs:

    d8 k logs job/nvidia-cuda-test

    Option B. CUDA sample (vectoradd):

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: gpu-operator-test
      namespace: default
    spec:
      completions: 1
      template:
        spec:
          restartPolicy: Never
          nodeSelector:
            node.deckhouse.io/group: gpu
          containers:
            - name: gpu-operator-test
              image: nvidia/samples:vectoradd-cuda10.2
              imagePullPolicy: "IfNotPresent"

Incompatible strategy detected auto in nvidia-device-plugin/gpu-feature-discovery logs

Errors like:

  • Incompatible strategy detected auto
  • failed to create resource manager: unsupported strategy auto
  • invalid device discovery strategy

mean the component cannot detect the NVML platform inside the container (typically libnvidia-ml.so.* is not available / NVIDIA Container Toolkit runtime is not in use).

What to check:

  1. nvidia-smi works on the node.
  2. NVIDIA Container Toolkit is installed (/usr/bin/nvidia-container-runtime exists).
  3. containerd is configured to use the nvidia runtime on GPU nodes (the gpu module does this after the driver/toolkit installation and a containerd restart/node reboot).
  4. After fixing, recreate nvidia-device-plugin-* and gpu-feature-discovery-* Pods in the d8-nvidia-gpu namespace.

How to monitor GPUs?

Deckhouse Kubernetes Platform automatically deploys DCGM Exporter; GPU metrics are scraped by Prometheus and available in Grafana.

Which GPU modes are supported?

  • Exclusive — the node exposes the nvidia.com/gpu resource; each Pod receives an entire GPU.
  • TimeSlicing — time-sharing a single GPU among multiple Pods (default partitionCount: 4); Pods still request nvidia.com/gpu.
  • MIG (Multi-Instance GPU) — hardware partitioning of supported GPUs into independent instances; with the all-1g.5gb profile the cluster exposes resources like nvidia.com/mig-1g.5gb.

See examples in Managing nodes: examples.

How to view available MIG profiles in the cluster?

Pre-defined profiles are stored in the mig-parted-config ConfigMap inside the d8-nvidia-gpu namespace and can be viewed with:

d8 k -n d8-nvidia-gpu get cm mig-parted-config -o json | jq -r '.data["config.yaml"]'

The mig-configs: section lists the GPU models (by PCI ID) and the MIG profiles each card supports (e.g., all-1g.5gb, all-2g.10gb, all-balanced). Select the profile that matches your accelerator and set its name in spec.gpu.mig.partedConfig of the NodeGroup.

How to define a custom MIG profile per GPU on a node?

Use partedConfig: custom and describe MIG partitioning per GPU index:

gpu:
  sharing: MIG
  mig:
    partedConfig: custom
    customConfigs:
      - index: 0
        slices:
          - profile: "1g.10gb"
            count: 7
      - index: 1
        slices:
          - profile: "2g.20gb"
            count: 3

What the module does:

  1. Generates a unique MIG config name for the NodeGroup and sets it in the nvidia.com/mig.config label.
  2. For GPUs listed in customConfigs, renders mig-enabled: true with the declared slices.
  3. For all unspecified indexes (all remaining GPUs on the node), renders mig-enabled: false, so those cards remain in full mode and do not override explicitly configured GPUs.

MIG profile does not activate — what to check?

  1. GPU model: MIG is supported on H100/A100/A30; it is not supported on V100/T4. See the profile tables in the NVIDIA MIG guide.

  2. NodeGroup configuration:

    gpu:
      sharing: MIG
      mig:
        partedConfig: all-1g.5gb
  3. Wait until nvidia-mig-manager completes the drain of the node and reconfigures the GPU. This process can take several minutes. While it is running, the node is tainted with mig-reconfigure. When the operation succeeds, that taint is removed.

  4. Track the progress via the nvidia.com/mig.config.state label on the node: pending, rebooting, success (or failed if something goes wrong).

  5. If nvidia.com/mig-* resources are still missing, check:

    d8 k -n d8-nvidia-gpu logs daemonset/nvidia-mig-manager
    nvidia-smi -L

Are AMD or Intel GPUs supported?

At this time, Deckhouse Kubernetes Platform automatically configures NVIDIA GPUs only. Support for AMD (ROCm) and Intel GPUs is being worked on and is planned for future releases.