The page contains frequently asked questions about configuring and using the Deckhouse Kubernetes Platform.
Cluster and Infrastructure
What to do if the resources generated by an object are not created after adding the object?
If the required resources do not appear after creating an object in the system (for example, dexAuthenticator), follow these steps:
-
Check if there are any critical alerts in the cluster that may block the creation of the required objects. To do this, use the command:
d8 k get clusteralerts.deckhouse.ioExample output:
NAME ALERT SEVERITY AGE LAST RECEIVED STATUS 012f602592aa7a91 K8SSchedulerTargetDown 3 16h 54s firing 0836dc893d5ecc65 KubernetesDeploymentReplicasUnavailable 5 15h 62s firing 08742f87d62d0063 NTPDaemonOnNodeDoesNotSynchronizeTime 5 16h 46s firing 172cfd38d2f7fd19 D8DeckhouseQueueIsHung 7 12h 66s firing 1c5705daf731f5cf D8StrongholdNoActiveNodes 3 16h 55s firing 1d2c2f7d69f69f4b D8DeckhouseIsNotOnReleaseChannel 9 12h 53s firing 205a551243d795f3 D8LogShipperAgentNotScheduledInCluster 7 15h 63s firing 2e34039aa7a3018e D8NodeIsNotUpdating 9 12h 47s firing 31baf9a70d657275 D8StrongholdClusterNotHealthy 7 16h 55s firingFor more information about alerts, see the The list of alerts section.
-
Check the Deckhouse task queue:
d8 s queue listExample output (queues are empty):
Summary: - 'main' queue: empty. - 88 other queues (0 active, 88 empty): 0 tasks. - no tasks to handle.If there are many unprocessed or long-running tasks in the queue, this may indicate problems.
-
Analyze DKP logs and events:
-
To view logs in real time, use the command:
d8 k -n d8-system logs -f -l app=deckhouseExample output:
{"level":"info","logger":"addon-operator","msg":"ConvergeModules task for OperatorStartup in phase '', trigger is Operator-Startup","binding":"ConvergeModules","event.type":"OperatorStartup","queue":"main","task.flow":"start","task.id":"fde0eb3b-5c3e-4da6-a0d8-a52f8ae03428","time":"2025-11-26T08:29:33Z"} {"level":"warn","logger":"addon-operator.converge-modules","msg":"ConvergeModules: functional scheduler not finished","binding":"ConvergeModules","event.type":"OperatorStartup","queue":"main","task.id":"fde0eb3b-5c3e-4da6-a0d8-a52f8ae03428","time":"2025-11-26T08:29:33Z"}When analyzing logs, pay special attention to warnings (
WARNING) and error messages (ERROR). -
To view events, use the command:
d8 k -n d8-system get eventsExample output:
LAST SEEN TYPE REASON OBJECT MESSAGE 11m Warning Unhealthy pod/deckhouse-5886c9bd77-vgdbw Readiness probe failed: HTTP probe failed with statuscode: 500 7m22s Normal SuccessfulDelete replicaset/deckhouse-5886c9bd77 Deleted pod: deckhouse-5886c9bd77-vgdbw 7m20s Normal Scheduled pod/deckhouse-6bc5c4494-fwx6z Successfully assigned d8-system/deckhouse-6bc5c4494-fwx6z to sandbox1-master-0 7m20s Normal Pulling pod/deckhouse-6bc5c4494-fwx6z Pulling image "dev-registry.deckhouse.io/sys/deckhouse-oss@sha256:17ac07634e17422df52720264cddec3916ed6985a77782dc8a24fe5352290e6e"
When analyzing events, pay special attention to those with the
Warningtype. -
Deckhouse
How to check the job queue in Deckhouse?
How to check the status of all Deckhouse task queues?
To view the status of all Deckhouse job queues, run the following command:
d8 s queue list
Example output (queues are empty):
Summary:
- 'main' queue: empty.
- 88 other queues (0 active, 88 empty): 0 tasks.
- no tasks to handle.
How to view the status of the main task queue?
To view the status of the Deckhouse main task queue, run the following command:
d8 s queue main
Example output (38 tasks in the main queue):
Queue 'main': length 38, status: 'run first task'
Example output (the main queue is empty):
Queue 'main': length 0, status: 'waiting for task 0s'
What to do if there are problems updating DKP?
Deckhouse Kubernetes Platform update fails, one or more Deckhouse pods are in an unworkable state
If the Deckhouse Kubernetes Platform update fails, one or more Deckhouse pods in the d8-system namespace are in an unworkable state. Perform the following steps:
-
Check the Deckhouse logs using the command:
d8 k -n d8-system logs -f -l app=deckhouse | jq -Rr 'fromjson? | .msg'If there are any problems, information about them will be included in the output. When analyzing logs, pay special attention to warnings (
WARNING) and error messages (ERROR). -
Check Deckhouse events using the command:
d8 k -n d8-system describe po -l app=deckhouse | awk ' /^Name:/ { pod = $2; print "=== " pod " ==="; in_events = 0 } /Events:/ { in_events = 1; next } in_events && /^$/ { in_events = 0; print "---" } in_events && !/^Events:/ { print $0 } ' | sed '/^---$/N;/^\n$/D'Pod events contain key information about problems (e.g., planning errors, image loading errors, etc.). When analyzing events, pay special attention to those with the
Warningtype.Example output:
Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 4m44s (x1918 over 154m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
DKP update is stuck in the Release is suspended status
The status Release is suspended indicates that it has been postponed and is currently unavailable (not recommended) for installation. In this case, it is recommended to remain on the latest available release or on the one currently installed (it will have the status Deployed).
To view the list of releases, use the command:
d8 k get deckhousereleases.deckhouse.io
Example output:
NAME PHASE TRANSITIONTIME MESSAGE
v1.69.13 Skipped 3h46m
v1.69.14 Skipped 3h46m
v1.69.15 Skipped 3h46m
v1.69.16 Superseded 160m
v1.70.12 Suspended 49d Release is suspended
v1.70.13 Skipped 36d
v1.70.14 Skipped 34d
v1.70.15 Skipped 28d
v1.70.16 Skipped 19d
v1.70.17 Deployed 160m
v1.71.3 Suspended 14d Release is suspended
IAM
What to do if there are problems applying DexProvider settings?
If you have changed the DexProvider settings in the user-authn module and one of the following problems occurs:
- No changes are visible (settings are not applied).
- When attempting to log in to the platform’s web interface with any type of authorization, a
500 Internal Server Erroroccurs without a detailed description.
Perform the following steps:
-
Check the status of the dex deployment:
d8 k -n d8-user-authn get podExample output:
NAME READY STATUS RESTARTS AGE dex-5ddb779b7d-6pbhs 2/2 Running 0 20h kubeconfig-generator-7c46977b9f-5kdmc 1/1 Running 0 20hIf the module is functioning properly and the correct configuration is specified in DexProvider, all pods will have the status
Running. If there is a problem, one or more pods will have a status other thanRunning, and it will be impossible to log into the platform’s web interface with any type of authorization. -
Check the logs for the problematic pod:
d8 k -n d8-user-authn logs dex-<pod-name>Based on the information from the logs, correct the configuration in the DexProvider resource and wait for the dex pods to restart. Within a few minutes, the pods will restart automatically, and the platform’s web interface (located at
console.<CLUSTER_NAME_TEMPLATE>) will become available and will reflect the changes made to the DexProvider resource.
Kubernetes and Scheduling
What to do when the API server is overloaded?
The following signs may indicate problems with API server load and memory consumption:
kubectl (d8)responds slowly or does not respond at all (commands are executed slowly or not at all).- Pods are recreated in the cluster for no apparent reason.
If these signs are present, perform the following actions:
-
Check the resource consumption of API server pods. To do this, use the command:
d8 k -n kube-system top po -l component=kube-apiserverPay attention to
MEMORYconsumption andCPU.Example output:
NAME CPU(cores) MEMORY(bytes) kube-apiserver-sandbox1-master-0 251m 1476Mi -
Check the metrics in Grafana.
To view the metrics, open the dashboard “Home” → “Dashboards” → “Kubernetes Cluster” → “Control Plane Status”. Review the graphs related to the API server (“Kube-apiserver CPU Usage”, “Kube-apiserver Memory Usage”, “Kube-apiserver latency”, etc.).
-
Review the API server audit logs to identify the source of high memory consumption. One common cause of high memory consumption is a large number of requests.
How to check the Kubernetes version in use?
To check the Kubernetes version in use, run the command:
d8 k get nodes
Example output:
NAME STATUS ROLES AGE VERSION
frontend-0 Ready frontend 118d v1.31.9
master-0 Ready control-plane,master 118d v1.31.9
master-1 Ready control-plane,master 118d v1.31.9
master-2 Ready control-plane,master 118d v1.31.9
system-0 Ready system 118d v1.31.9
system-1 Ready system 118d v1.31.9
worker-0 Ready worker 37d v1.31.9
worker-1 Ready worker 19d v1.31.9
What to do if you encounter problems updating Kubernetes components on cluster nodes, synchronizing nodes, or applying NodeGroup Configuration?
If Kubernetes components are not updated on the cluster node, the NodeGroup configuration is not applied, and not all NodeGroup nodes are synchronized (have the UPTODATE status), perform the following steps:
-
Check the bashible logs on the node where the problems are occurring. The bashible mechanism is used to keep cluster nodes up to date. It is started by the
bashible.timertimer at regular intervals as a service on the cluster nodes. This involves restarting, synchronizing scripts, and executing them (if necessary).To check bashible logs, use the command:
journalctl -u bashibleIf the response contains the message
Configuration is in sync, nothing to do, the node is synchronized and there are no problems. The absence of this message or the presence of errors indicates a problem. -
Check the synchronization status of cluster nodes using the command:
d8 k get ngThe number of nodes in the
UPTODATEstate must match the total number of nodes in each group.Example output:
NAME TYPE READY NODES UPTODATE INSTANCES DESIRED MIN MAX STANDBY STATUS AGE SYNCED frontend Static 1 1 1 118d True master Static 3 3 3 118d True system Static 2 2 2 118d True worker Static 2 2 2 118d True
What to do if problems adding a node to the cluster via Cluster API Provider Static are present?
If, when adding a node to the cluster via Cluster API Provider Static (CAPS), it remains in Pending or Bootstrapping status, perform the following steps:
-
Verify that the access keys specified in the SSHCredentials resource are correct. Ensure that the username and SSH key specified in SSHCredentials are correct.
-
On the node where the problem occurred, check that the public key corresponding to the private key from SSHCredentials is present in
authorized_keys. Example command for checking:cat ~/.ssh/authorized_keys -
Check the number of nodes specified in NodeGroup, which should include the node being added. Make sure that the maximum number of nodes is not exceeded.
-
Check the status of the
bashible.serviceon the node that caused the problem:systemctl status bashible.serviceIt must have the status
active (running). If the service has the statusinactiveorfailed, the service has not started. This indicates a problem with the configuration process. -
If the steps above did not resolve the issue, remove the problematic node and its StaticInstance resource from the cluster so that the system will attempt to recreate them. To do this:
-
Get a list of nodes and locate the problematic one:
d8 k get nodes -
Find the corresponding StaticInstance resource:
kubectl get staticinstances -n <namespace-name> -
Remove the problematic node:
kubectl delete node <node-name> -
Remove the corresponding StaticInstance resource:
kubectl delete staticinstances -n <namespace-name> <static-instance-name>
-
How to change the instance type for nodes with the CloudPermanent type?
To change the instance type for nodes with the CloudPermanent type, follow these steps:
- Make a backup of etcd and the
/etc/kubernetesdirectory. - Transfer the archive to a server outside the cluster (e.g., on a local machine).
- Ensure there are no alerts in the cluster that can prevent the update of the master nodes.
-
Make sure that Deckhouse queue is empty. To view the status of all Deckhouse job queues, run the following command:
d8 s queue listExample output (queues are empty):
Summary: - 'main' queue: empty. - 88 other queues (0 active, 88 empty): 0 tasks. - no tasks to handle. -
Run the appropriate edition and version of the Deckhouse installer container on the local machine (change the container registry address if necessary):
DH_VERSION=$(d8 k -n d8-system get deployment deckhouse -o jsonpath='{.metadata.annotations.core\.deckhouse\.io\/version}') DH_EDITION=$(d8 k -n d8-system get deployment deckhouse -o jsonpath='{.metadata.annotations.core\.deckhouse\.io\/edition}' | tr '[:upper:]' '[:lower:]' ) docker run --pull=always -it -v "$HOME/.ssh/:/tmp/.ssh/" \ registry.deckhouse.ru/deckhouse/${DH_EDITION}/install:${DH_VERSION} bash -
In the installer container, run the following command to check the state before working:
dhctl terraform check --ssh-agent-private-keys=/tmp/.ssh/<SSH_KEY_FILENAME> --ssh-user=<USERNAME> \ --ssh-host <MASTER-NODE-0-HOST> --ssh-host <MASTER-NODE-1-HOST> --ssh-host <MASTER-NODE-2-HOST>The command output should indicate that Terraform found no inconsistencies and no changes are required.
-
In the installer container, run the command to edit the cluster configuration (specify the addresses of all master nodes in the
--ssh-hostparameter):dhctl config edit provider-cluster-configuration --ssh-agent-private-keys=/tmp/.ssh/<SSH_KEY_FILENAME> --ssh-user=<USERNAME> \ --ssh-host <MASTER-NODE-0-HOST> --ssh-host <MASTER-NODE-1-HOST> --ssh-host <MASTER-NODE-2-HOST> -
Edit the
instanceClassparameter of the desired node group by changing the instance type and save the changes. Example settings for themasterNodeGroupof the Yandex Cloud provider:masterNodeGroup: replicas: 3 # required number of master nodes instanceClass: cores: 4 # change the number of CPUs memory: 8192 # change the memory size (in MB) # other instance parameters... externalIPAddresses: - "Auto" # for each master node - "Auto" - "Auto" -
In the installer container, run the following command to perform nodes upgrade:
You should read carefully what converge is going to do when it asks for approval.
When the command is executed, the nodes will be replaced by new nodes with confirmation on each node. The replacement will be performed one by one in reverse order (2,1,0).
dhctl converge --ssh-agent-private-keys=/tmp/.ssh/<SSH_KEY_FILENAME> --ssh-user=<USERNAME> \ --ssh-host <MASTER-NODE-0-HOST> --ssh-host <MASTER-NODE-1-HOST> --ssh-host <MASTER-NODE-2-HOST>Repeat the steps below (Sec. 9-12) for each master node one by one, starting with the node with the highest number (suffix 2) and ending with the node with the lowest number (suffix 0).
-
On the newly created node, check the systemd-unit log for the
bashible.service. Wait until the node configuration is complete (you will see a messagenothing to doin the log):journalctl -fu bashible.service -
Make sure the node is listed as an etcd cluster member:
for pod in $(d8 k -n kube-system get pod -l component=etcd,tier=control-plane -o name); do d8 k -n kube-system exec "$pod" -- etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \ --cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key \ --endpoints https://127.0.0.1:2379/ member list -w table if [ $? -eq 0 ]; then break fi done -
Make sure
control-plane-manageris running on the node:d8 k -n kube-system wait pod --timeout=10m --for=condition=ContainersReady \ -l app=d8-control-plane-manager --field-selector spec.nodeName=<MASTER-NODE-N-NAME> - Proceed to update the next node (repeat the steps above).