The documentation is under development and may contain incomplete information.

Restoring a master node when kubelet can’t load control plane components

This situation may occur if the control plane component images were deleted in a cluster with a single master node (for example, the /var/lib/containerd directory was deleted). In this case, kubelet can’t pull the control plane component images during a restart, since the master node lacks authorization parameters required for accessing registry.deckhouse.io.

containerd

To restore the master node, follow these steps:

  1. Run the following command in any cluster managed by Deckhouse:

    d8 k -n d8-system get secrets deckhouse-registry -o json |
    jq -r '.data.".dockerconfigjson"' | base64 -d |
    jq -r '.auths."registry.deckhouse.io".auth'
    
  2. Copy the command output and set it for the AUTH variable on the corrupted master node.
  3. Pull the control plane component images to the corrupted master node:

    for image in $(grep "image:" /etc/kubernetes/manifests/* | awk '{print $3}'); do
      crictl pull --auth $AUTH $image
    done
    
  4. Once the images are pulled, restart kubelet.

Changing CRI for a NodeGroup

Container Runtime Interface (CRI) can only be switched from Containerd to NotManaged and back.

To change the CRI for a NodeGroup, set the cri.type parameter to either Containerd or NotManaged.

Example of a NodeGroup YAML manifest:

apiVersion: deckhouse.io/v1
kind: NodeGroup
metadata:
  name: worker
spec:
  nodeType: Static
  cri:
    type: Containerd

Alternatively, you can change the CRI using the following patch:

  • For Containerd:

    d8 k patch nodegroup <NodeGroup name> --type merge -p '{"spec":{"cri":{"type":"Containerd"}}}'
    
  • For NotManaged:

    d8 k patch nodegroup <NodeGroup name> --type merge -p '{"spec":{"cri":{"type":"NotManaged"}}}'
    

When changing cri.type for the NodeGroup objects created using dhctl, change it in dhctl config edit provider-cluster-configuration and in the NodeGroup object configuration.

After setting up a new CRI for NodeGroup, the node-manager module drains nodes one by one and installs a new CRI on them. The node update is followed by a disruption. Depending on the disruption setting for a NodeGroup, the node-manager module either automatically allows node updates or requires manual confirmation.

Changing CRI for the whole cluster

CRI can only be switched from Containerd to NotManaged and back.

To change the CRI for for the whole cluster, use the dhctl utility and edit the defaultCRI parameter in cluster-configuration.

Alternatively, you can change the CRI for the cluster using the following patch:

  • For Containerd:

    data="$(d8 k -n kube-system get secret d8-cluster-configuration -o json | jq -r '.data."cluster-configuration.yaml"' | base64 -d | sed "s/NotManaged/Containerd/" | base64 -w0)"
    d8 k -n kube-system patch secret d8-cluster-configuration -p '{"data":{"cluster-configuration.yaml":"'${data}'"}}'
    
  • For NotManaged:

    data="$(d8 k -n kube-system get secret d8-cluster-configuration -o json | jq -r '.data."cluster-configuration.yaml"' | base64 -d | sed "s/Containerd/NotManaged/" | base64 -w0)"
    d8 k -n kube-system patch secret d8-cluster-configuration -p '{"data":{"cluster-configuration.yaml":"'${data}'"}}'
    

If you need to keep a NodeGroup on a different CRI, before changing defaultCRI, set the CRI for this NodeGroup following the corresponding procedure.

Changing defaultCRI will cause the CRI changing on all nodes, including master nodes. If there’s only one master node, this operation is dangerous and can lead to a complete failure of the cluster. The preferred option is to make a multi-master and change the CRI type.

When you change the CRI in the cluster, complete several additional steps for the master nodes:

  1. Deckhouse updates nodes in the master NodeGroup one by one, so to find out which node is currently being updated, run the following command:

    kubectl get nodes -l node-role.kubernetes.io/control-plane="" -o json | jq '.items[] | select(.metadata.annotations."update.node.deckhouse.io/approved"=="") | .metadata.name' -r
    
  2. Confirm the disruption of the master node you found in the previous step:

    kubectl annotate node <master node name> update.node.deckhouse.io/disruption-approved=
    
  3. Wait for the updated master node to switch to the Ready state. Repeat the steps for the next master node.

How to use containerd with Nvidia GPU support?

  1. Create a separate NodeGroup for GPU nodes:

    apiVersion: deckhouse.io/v1
    kind: NodeGroup
    metadata:
      name: gpu
    spec:
      chaos:
        mode: Disabled
      disruptions:
        approvalMode: Automatic
      nodeType: CloudStatic
    
  2. Create NodeGroupConfiguration for the gpu NodeGroup for containerd configuration:

    apiVersion: deckhouse.io/v1alpha1
    kind: NodeGroupConfiguration
    metadata:
      name: containerd-additional-config.sh
    spec:
      bundles:
      - '*'
      content: |
        # Copyright 2023 Flant JSC
        #
        # Licensed under the Apache License, Version 2.0 (the "License");
        # you may not use this file except in compliance with the License.
        # You may obtain a copy of the License at
        #
        #     http://www.apache.org/licenses/LICENSE-2.0
        #
        # Unless required by applicable law or agreed to in writing, software
        # distributed under the License is distributed on an "AS IS" BASIS,
        # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        # See the License for the specific language governing permissions and
        # limitations under the License.
    
        mkdir -p /etc/containerd/conf.d
        bb-sync-file /etc/containerd/conf.d/nvidia_gpu.toml - << "EOF"
        [plugins]
          [plugins."io.containerd.grpc.v1.cri"]
            [plugins."io.containerd.grpc.v1.cri".containerd]
              default_runtime_name = "nvidia"
              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
                [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
                  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
                    privileged_without_host_devices = false
                    runtime_engine = ""
                    runtime_root = ""
                    runtime_type = "io.containerd.runc.v1"
                    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
                      BinaryName = "/usr/bin/nvidia-container-runtime"
                      SystemdCgroup = false
        EOF
      nodeGroups:
      - gpu
      weight: 31
    
  3. Add NodeGroupConfiguration to install Nvidia drivers on the gpu NodeGroup:
  4. Bootstrap and reboot the node.

Ubuntu

apiVersion: deckhouse.io/v1alpha1
kind: NodeGroupConfiguration
metadata:
  name: install-cuda.sh
spec:
  bundles:
  - ubuntu-lts
  content: |
    # Copyright 2023 Flant JSC
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.

    if [ ! -f "/etc/apt/sources.list.d/nvidia-container-toolkit.list" ]; then
      distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
      curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
      curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    fi
    bb-apt-install nvidia-container-toolkit nvidia-driver-535-server
    nvidia-ctk config --set nvidia-container-runtime.log-level=error --in-place
  nodeGroups:
  - gpu
  weight: 30

CentOS

apiVersion: deckhouse.io/v1alpha1
kind: NodeGroupConfiguration
metadata:
  name: install-cuda.sh
spec:
  bundles:
  - centos
  content: |
    # Copyright 2023 Flant JSC
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.

    if [ ! -f "/etc/yum.repos.d/nvidia-container-toolkit.repo" ]; then
      distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
    fi
    bb-yum-install nvidia-container-toolkit nvidia-driver
    nvidia-ctk config --set nvidia-container-runtime.log-level=error --in-place
  nodeGroups:
  - gpu
  weight: 30

How to check if it was successful?

Create the Job named nvidia-cuda-test in the cluster:

apiVersion: batch/v1
kind: Job
metadata:
  name: nvidia-cuda-test
  namespace: default
spec:
  completions: 1
  template:
    spec:
      restartPolicy: Never
      nodeSelector:
        node.deckhouse.io/group: gpu
      containers:
        - name: nvidia-cuda-test
          image: nvidia/cuda:11.6.2-base-ubuntu20.04
          imagePullPolicy: "IfNotPresent"
          command:
            - nvidia-smi

Check the logs by running the following command:

d8 k logs job/nvidia-cuda-test

Example of the output:

Tue Jan 24 11:36:18 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:8B:00.0 Off |                    0 |
| N/A   45C    P0    25W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Create the Job named gpu-operator-test in the cluster:

apiVersion: batch/v1
kind: Job
metadata:
  name: gpu-operator-test
  namespace: default
spec:
  completions: 1
  template:
    spec:
      restartPolicy: Never
      nodeSelector:
        node.deckhouse.io/group: gpu
      containers:
        - name: gpu-operator-test
          image: nvidia/samples:vectoradd-cuda10.2
          imagePullPolicy: "IfNotPresent"

Check the logs by running the following command:

d8 k logs job/gpu-operator-test

Example of the output:

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

How to manually add several static nodes to the cluster?

Use the existing NodeGroup custom resource or create a new one.

Example of a NodeGroup resource named worker:

apiVersion: deckhouse.io/v1
kind: NodeGroup
metadata:
  name: worker
spec:
  nodeType: Static

You can automate the process of adding the nodes by using any configuration automation tool. The following is an example using Ansible.

  1. Find out a Kubernetes API server IP address. Note that this IP address must be accessible from the nodes you will be adding to the cluster. Run the following command:

    d8 k -n default get ep kubernetes -o json | jq '.subsets[0].addresses[0].ip + ":" + (.subsets[0].ports[0].port | tostring)' -r
    

    Check the Kubernetes version. If the version is higher than 1.25, create the node-group token:

    d8 k create token node-group --namespace d8-cloud-instance-manager --duration 1h
    

    Create the token you receive and add it to the token field in the Ansible playbook in the following steps.

  2. If the Kubernetes version is older than 1.25, get the Kubernetes API token for the ServiceAccount managed by Deckhouse:

    d8 k -n d8-cloud-instance-manager get $(d8 k -n d8-cloud-instance-manager get secret -o name | grep node-group-token) \
      -o json | jq '.data.token' -r | base64 -d && echo ""
    
  3. Create an Ansible playbook and replace the vars values with the data you received on previous steps.

    - hosts: all
      become: yes
      gather_facts: no
      vars:
        kube_apiserver: <KUBE_APISERVER>
        token: <TOKEN>
      tasks:
        - name: # Check if the node is already bootstrapped
          stat:
            path: /var/lib/bashible
          register: bootstrapped
        - name: # Get the bootstrap secret
          uri:
            url: "https:///api/v1/namespaces/d8-cloud-instance-manager/secrets/manual-bootstrap-for-"
            return_content: yes
            method: GET
            status_code: 200
            body_format: json
            headers:
              Authorization: "Bearer "
            validate_certs: no
          register: bootstrap_secret
          when: bootstrapped.stat.exists == False
        - name: Run bootstrap.sh
          shell: ""
          args:
            executable: /bin/bash
          ignore_errors: yes
          when: bootstrapped.stat.exists == False
        - name: wait
          wait_for_connection:
            delay: 30
          when: bootstrapped.stat.exists == False
    
  4. Define the additional node_group variable. The variable value must match the name of a NodeGroup that the node will be assigned to. There are several ways to pass the variable, for example, using the inventory file:

    [system]
    system-0
    system-1
    
    [system:vars]
    node_group=system
    
    [worker]
    worker-0
    worker-1
    
    [worker:vars]
    node_group=worker
    
  5. Run the playbook using the inventory file.

How do I make werf ignore the Ready state in a node group?

werf checks that the resources are in the Ready state. If they are, it waits until the value becomes True.

Creating (updating) a NodeGroup resource in a cluster can take a significant amount of time to deploy the required number of nodes. When deploying such a resource in a cluster using werf (for example, as part of a CI/CD process), the deployment may terminate when resource readiness timeout is exceeded.

To make werf ignore the NodeGroup status, add the following NodeGroup annotations:

metadata:
  annotations:
    werf.io/fail-mode: IgnoreAndContinueDeployProcess
    werf.io/track-termination-mode: NonBlocking