The page contains frequently asked questions about configuring and using the Deckhouse Kubernetes Platform.

Cluster and Infrastructure

What to do if the resources generated by an object are not created after adding the object?

If the required resources do not appear after creating an object in the system (for example, dexAuthenticator), follow these steps:

  1. Check if there are any critical alerts in the cluster that may block the creation of the required objects. To do this, use the command:

    d8 k get clusteralerts.deckhouse.io
    

    Example output:

    NAME               ALERT                                           SEVERITY   AGE   LAST RECEIVED   STATUS
    012f602592aa7a91   K8SSchedulerTargetDown                          3          16h   54s             firing
    0836dc893d5ecc65   KubernetesDeploymentReplicasUnavailable         5          15h   62s             firing
    08742f87d62d0063   NTPDaemonOnNodeDoesNotSynchronizeTime           5          16h   46s             firing
    172cfd38d2f7fd19   D8DeckhouseQueueIsHung                          7          12h   66s             firing
    1c5705daf731f5cf   D8StrongholdNoActiveNodes                       3          16h   55s             firing
    1d2c2f7d69f69f4b   D8DeckhouseIsNotOnReleaseChannel                9          12h   53s             firing
    205a551243d795f3   D8LogShipperAgentNotScheduledInCluster          7          15h   63s             firing
    2e34039aa7a3018e   D8NodeIsNotUpdating                             9          12h   47s             firing
    31baf9a70d657275   D8StrongholdClusterNotHealthy                   7          16h   55s             firing
    

    For more information about alerts, see the The list of alerts section.

  2. Check the Deckhouse task queue:

    d8 s queue list
    

    Example output (queues are empty):

    Summary:
    - 'main' queue: empty.
    - 88 other queues (0 active, 88 empty): 0 tasks.
    - no tasks to handle.
    

    If there are many unprocessed or long-running tasks in the queue, this may indicate problems.

  3. Analyze DKP logs and events:

    • To view logs in real time, use the command:

      d8 k -n d8-system logs -f -l app=deckhouse
      

      Example output:

      {"level":"info","logger":"addon-operator","msg":"ConvergeModules task for OperatorStartup in phase '', trigger is Operator-Startup","binding":"ConvergeModules","event.type":"OperatorStartup","queue":"main","task.flow":"start","task.id":"fde0eb3b-5c3e-4da6-a0d8-a52f8ae03428","time":"2025-11-26T08:29:33Z"}
      {"level":"warn","logger":"addon-operator.converge-modules","msg":"ConvergeModules: functional scheduler not finished","binding":"ConvergeModules","event.type":"OperatorStartup","queue":"main","task.id":"fde0eb3b-5c3e-4da6-a0d8-a52f8ae03428","time":"2025-11-26T08:29:33Z"}
      

      When analyzing logs, pay special attention to warnings (WARNING) and error messages (ERROR).

    • To view events, use the command:

      d8 k -n d8-system get events
      

      Example output:

      LAST SEEN   TYPE      REASON              OBJECT                                          MESSAGE
      11m         Warning   Unhealthy           pod/deckhouse-5886c9bd77-vgdbw                  Readiness probe failed: HTTP probe failed with statuscode: 500
      7m22s       Normal    SuccessfulDelete    replicaset/deckhouse-5886c9bd77                 Deleted pod: deckhouse-5886c9bd77-vgdbw
      7m20s       Normal    Scheduled           pod/deckhouse-6bc5c4494-fwx6z                   Successfully assigned d8-system/deckhouse-6bc5c4494-fwx6z to sandbox1-master-0
      7m20s       Normal    Pulling             pod/deckhouse-6bc5c4494-fwx6z                   Pulling image "dev-registry.deckhouse.io/sys/deckhouse-oss@sha256:17ac07634e17422df52720264cddec3916ed6985a77782dc8a24fe5352290e6e"
      

    When analyzing events, pay special attention to those with the Warning type.

Deckhouse

How to check the job queue in Deckhouse?

How to check the status of all Deckhouse task queues?

To view the status of all Deckhouse job queues, run the following command:

d8 s queue list

Example output (queues are empty):

Summary:
- 'main' queue: empty.
- 88 other queues (0 active, 88 empty): 0 tasks.
- no tasks to handle.

How to view the status of the main task queue?

To view the status of the Deckhouse main task queue, run the following command:

d8 s queue main

Example output (38 tasks in the main queue):

Queue 'main': length 38, status: 'run first task'

Example output (the main queue is empty):

Queue 'main': length 0, status: 'waiting for task 0s'

What to do if there are problems updating DKP?

Deckhouse Kubernetes Platform update fails, one or more Deckhouse pods are in an unworkable state

If the Deckhouse Kubernetes Platform update fails, one or more Deckhouse pods in the d8-system namespace are in an unworkable state. Perform the following steps:

  1. Check the Deckhouse logs using the command:

    d8 k -n d8-system logs -f -l app=deckhouse | jq -Rr 'fromjson? | .msg'
    

    If there are any problems, information about them will be included in the output. When analyzing logs, pay special attention to warnings (WARNING) and error messages (ERROR).

  2. Check Deckhouse events using the command:

    d8 k -n d8-system describe po -l app=deckhouse | awk '
    /^Name:/ { 
        pod = $2; 
        print "=== " pod " ==="; 
        in_events = 0 
    }
    /Events:/ { 
        in_events = 1; 
        next 
    }
    in_events && /^$/ { 
        in_events = 0; 
        print "---" 
    }
    in_events && !/^Events:/ { 
        print $0 
    }
    ' | sed '/^---$/N;/^\n$/D'
    

    Pod events contain key information about problems (e.g., planning errors, image loading errors, etc.). When analyzing events, pay special attention to those with the Warning type.

    Example output:

    Type     Reason     Age                      From     Message
    ----     ------     ----                     ----     -------
    Warning  Unhealthy  4m44s (x1918 over 154m)  kubelet  Readiness probe failed: HTTP probe failed with statuscode: 500
    

DKP update is stuck in the Release is suspended status

The status Release is suspended indicates that it has been postponed and is currently unavailable (not recommended) for installation. In this case, it is recommended to remain on the latest available release or on the one currently installed (it will have the status Deployed).

To view the list of releases, use the command:

d8 k get deckhousereleases.deckhouse.io

Example output:

NAME       PHASE        TRANSITIONTIME   MESSAGE
v1.69.13   Skipped      3h46m
v1.69.14   Skipped      3h46m
v1.69.15   Skipped      3h46m
v1.69.16   Superseded   160m
v1.70.12   Suspended    49d              Release is suspended
v1.70.13   Skipped      36d
v1.70.14   Skipped      34d
v1.70.15   Skipped      28d
v1.70.16   Skipped      19d
v1.70.17   Deployed     160m
v1.71.3    Suspended    14d              Release is suspended

IAM

What to do if there are problems applying DexProvider settings?

If you have changed the DexProvider settings in the user-authn module and one of the following problems occurs:

  • No changes are visible (settings are not applied).
  • When attempting to log in to the platform’s web interface with any type of authorization, a 500 Internal Server Error occurs without a detailed description.

Perform the following steps:

  1. Check the status of the dex deployment:

    d8 k -n d8-user-authn get pod
    

    Example output:

    NAME                                    READY   STATUS    RESTARTS   AGE
    dex-5ddb779b7d-6pbhs                    2/2     Running   0          20h
    kubeconfig-generator-7c46977b9f-5kdmc   1/1     Running   0          20h
    

    If the module is functioning properly and the correct configuration is specified in DexProvider, all pods will have the status Running. If there is a problem, one or more pods will have a status other than Running, and it will be impossible to log into the platform’s web interface with any type of authorization.

  2. Check the logs for the problematic pod:

    d8 k -n d8-user-authn logs dex-<pod-name>
    

    Based on the information from the logs, correct the configuration in the DexProvider resource and wait for the dex pods to restart. Within a few minutes, the pods will restart automatically, and the platform’s web interface (located at console.<CLUSTER_NAME_TEMPLATE>) will become available and will reflect the changes made to the DexProvider resource.

Kubernetes and Scheduling

What to do when the API server is overloaded?

The following signs may indicate problems with API server load and memory consumption:

  • kubectl (d8) responds slowly or does not respond at all (commands are executed slowly or not at all).
  • Pods are recreated in the cluster for no apparent reason.

If these signs are present, perform the following actions:

  1. Check the resource consumption of API server pods. To do this, use the command:

    d8 k -n kube-system top po -l component=kube-apiserver
    

    Pay attention to MEMORY consumption and CPU.

    Example output:

    NAME                               CPU(cores)   MEMORY(bytes)
    kube-apiserver-sandbox1-master-0   251m         1476Mi
    
  2. Check the metrics in Grafana.

    To view the metrics, open the dashboard “Home” → “Dashboards” → “Kubernetes Cluster” → “Control Plane Status”. Review the graphs related to the API server (“Kube-apiserver CPU Usage”, “Kube-apiserver Memory Usage”, “Kube-apiserver latency”, etc.).

  3. Review the API server audit logs to identify the source of high memory consumption. One common cause of high memory consumption is a large number of requests.

How to check the Kubernetes version in use?

To check the Kubernetes version in use, run the command:

d8 k get nodes

Example output:

NAME                   STATUS   ROLES                  AGE    VERSION
frontend-0             Ready    frontend               118d   v1.31.9
master-0               Ready    control-plane,master   118d   v1.31.9
master-1               Ready    control-plane,master   118d   v1.31.9
master-2               Ready    control-plane,master   118d   v1.31.9
system-0               Ready    system                 118d   v1.31.9
system-1               Ready    system                 118d   v1.31.9
worker-0               Ready    worker                 37d    v1.31.9
worker-1               Ready    worker                 19d    v1.31.9

What to do if you encounter problems updating Kubernetes components on cluster nodes, synchronizing nodes, or applying NodeGroup Configuration?

If Kubernetes components are not updated on the cluster node, the NodeGroup configuration is not applied, and not all NodeGroup nodes are synchronized (have the UPTODATE status), perform the following steps:

  1. Check the bashible logs on the node where the problems are occurring. The bashible mechanism is used to keep cluster nodes up to date. It is started by the bashible.timer timer at regular intervals as a service on the cluster nodes. This involves restarting, synchronizing scripts, and executing them (if necessary).

    To check bashible logs, use the command:

    journalctl -u bashible
    

    If the response contains the message Configuration is in sync, nothing to do, the node is synchronized and there are no problems. The absence of this message or the presence of errors indicates a problem.

  2. Check the synchronization status of cluster nodes using the command:

    d8 k get ng
    

    The number of nodes in the UPTODATE state must match the total number of nodes in each group.

    Example output:

    NAME       TYPE     READY   NODES   UPTODATE   INSTANCES   DESIRED   MIN   MAX   STANDBY   STATUS   AGE    SYNCED
    frontend   Static   1       1       1                                                               118d   True
    master     Static   3       3       3                                                               118d   True
    system     Static   2       2       2                                                               118d   True
    worker     Static   2       2       2                                                               118d   True
    

What to do if problems adding a node to the cluster via Cluster API Provider Static are present?

If, when adding a node to the cluster via Cluster API Provider Static (CAPS), it remains in Pending or Bootstrapping status, perform the following steps:

  1. Verify that the access keys specified in the SSHCredentials resource are correct. Ensure that the username and SSH key specified in SSHCredentials are correct.

  2. On the node where the problem occurred, check that the public key corresponding to the private key from SSHCredentials is present in authorized_keys. Example command for checking:

    cat ~/.ssh/authorized_keys
    
  3. Check the number of nodes specified in NodeGroup, which should include the node being added. Make sure that the maximum number of nodes is not exceeded.

  4. Check the status of the bashible.service on the node that caused the problem:

    systemctl status bashible.service
    

    It must have the status active (running). If the service has the status inactive or failed, the service has not started. This indicates a problem with the configuration process.

  5. If the steps above did not resolve the issue, remove the problematic node and its StaticInstance resource from the cluster so that the system will attempt to recreate them. To do this:

    • Get a list of nodes and locate the problematic one:

      d8 k get nodes
      
    • Find the corresponding StaticInstance resource:

      kubectl get staticinstances -n <namespace-name>
      
    • Remove the problematic node:

      kubectl delete node <node-name>
      
    • Remove the corresponding StaticInstance resource:

      kubectl delete staticinstances -n <namespace-name> <static-instance-name>
      

How to change the instance type for nodes with the CloudPermanent type?

To change the instance type for nodes with the CloudPermanent type, follow these steps:

  1. Make a backup of etcd and the /etc/kubernetes directory.
  2. Transfer the archive to a server outside the cluster (e.g., on a local machine).
  3. Ensure there are no alerts in the cluster that can prevent the update of the master nodes.
  4. Make sure that Deckhouse queue is empty. To view the status of all Deckhouse job queues, run the following command:

    d8 s queue list
    

    Example output (queues are empty):

    Summary:
    - 'main' queue: empty.
    - 88 other queues (0 active, 88 empty): 0 tasks.
    - no tasks to handle.
    
  5. Run the appropriate edition and version of the Deckhouse installer container on the local machine (change the container registry address if necessary):

    DH_VERSION=$(d8 k -n d8-system get deployment deckhouse -o jsonpath='{.metadata.annotations.core\.deckhouse\.io\/version}') 
    DH_EDITION=$(d8 k -n d8-system get deployment deckhouse -o jsonpath='{.metadata.annotations.core\.deckhouse\.io\/edition}' | tr '[:upper:]' '[:lower:]' ) 
    docker run --pull=always -it -v "$HOME/.ssh/:/tmp/.ssh/" \
      registry.deckhouse.ru/deckhouse/${DH_EDITION}/install:${DH_VERSION} bash
    
  6. In the installer container, run the following command to check the state before working:

    dhctl terraform check --ssh-agent-private-keys=/tmp/.ssh/<SSH_KEY_FILENAME> --ssh-user=<USERNAME> \
      --ssh-host <MASTER-NODE-0-HOST> --ssh-host <MASTER-NODE-1-HOST> --ssh-host <MASTER-NODE-2-HOST>
    

    The command output should indicate that Terraform found no inconsistencies and no changes are required.

  7. In the installer container, run the command to edit the cluster configuration (specify the addresses of all master nodes in the --ssh-host parameter):

    dhctl config edit provider-cluster-configuration --ssh-agent-private-keys=/tmp/.ssh/<SSH_KEY_FILENAME> --ssh-user=<USERNAME> \
      --ssh-host <MASTER-NODE-0-HOST> --ssh-host <MASTER-NODE-1-HOST> --ssh-host <MASTER-NODE-2-HOST>
    
  8. Edit the instanceClass parameter of the desired node group by changing the instance type and save the changes. Example settings for the masterNodeGroup of the Yandex Cloud provider:

    masterNodeGroup:
     replicas: 3  # required number of master nodes
     instanceClass:
       cores: 4      # change the number of CPUs
       memory: 8192  # change the memory size (in MB)
       # other instance parameters...
       externalIPAddresses:
       - "Auto"      # for each master node
       - "Auto"
       - "Auto"
    
  9. In the installer container, run the following command to perform nodes upgrade:

    You should read carefully what converge is going to do when it asks for approval.

    When the command is executed, the nodes will be replaced by new nodes with confirmation on each node. The replacement will be performed one by one in reverse order (2,1,0).

    dhctl converge --ssh-agent-private-keys=/tmp/.ssh/<SSH_KEY_FILENAME> --ssh-user=<USERNAME> \
      --ssh-host <MASTER-NODE-0-HOST> --ssh-host <MASTER-NODE-1-HOST> --ssh-host <MASTER-NODE-2-HOST>
    

    Repeat the steps below (Sec. 9-12) for each master node one by one, starting with the node with the highest number (suffix 2) and ending with the node with the lowest number (suffix 0).

  10. On the newly created node, check the systemd-unit log for the bashible.service. Wait until the node configuration is complete (you will see a message nothing to do in the log):

    journalctl -fu bashible.service
    
  11. Make sure the node is listed as an etcd cluster member:

    for pod in $(d8 k -n kube-system get pod -l component=etcd,tier=control-plane -o name); do
      d8 k -n kube-system exec "$pod" -- etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \
      --cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key \
      --endpoints https://127.0.0.1:2379/ member list -w table
      if [ $? -eq 0 ]; then
        break
      fi
    done
    
  12. Make sure control-plane-manager is running on the node:

    d8 k -n kube-system wait pod --timeout=10m --for=condition=ContainersReady \
      -l app=d8-control-plane-manager --field-selector spec.nodeName=<MASTER-NODE-N-NAME>
    
  13. Proceed to update the next node (repeat the steps above).