Recovery from failures
During its operation DKP automatically creates backups of configuration and data that may be useful in case of problems. These backups are saved in the /etc/kubernetes/deckhouse/backup directory. If any issues or unexpected situations occur during operation, you can use these backups to restore the system to a previously healthy state.
Restoring etcd cluster functionality
If the etcd cluster is not functioning and cannot be restored from a backup, you can attempt to recover it from scratch by following the steps below.
- On all nodes that are part of your etcd cluster, except one, delete the
etcd.yamlmanifest located in/etc/kubernetes/manifests/. This will leave only one active node, from which the multi-master cluster state will be restored. - On the remaining node, open the
etcd.yamlmanifest and add the--force-new-clusterflag underspec.containers.command. - After the cluster is successfully restored, remove the
--force-new-clusterflag.
This operation is destructive: it completely wipes the existing data and initializes a new cluster based on the state preserved on the remaining node. All pending records will be lost.
Restoring a master node when kubelet fails to load control plane components
Such a situation may occur if images of the control plane components on the master were deleted in a cluster that has a single master node (e.g., the directory /var/lib/containerd was deleted). In this case, kubelet cannot pull images of the control plane components when restarted since the master node lacks authorization parameters required for accessing registry.deckhouse.io.
Below is an instruction on how you can restore the master node.
containerd
-
Execute the following command to restore the master node in any cluster running under DKP:
d8 k -n d8-system get secrets deckhouse-registry -o json | jq -r '.data.".dockerconfigjson"' | base64 -d | jq -r '.auths."registry.deckhouse.io".auth' -
Copy the command’s output and use it for setting the
AUTHvariable on the corrupted master. -
Next, pull images of control plane components to the corrupted master:
for image in $(grep "image:" /etc/kubernetes/manifests/* | awk '{print $3}'); do crictl pull --auth $AUTH $image done -
Restart kubelet after pulling the images.
etcd restore
Viewing etcd cluster members
Option 1
Use the etcdctl member list command.
Example:
for pod in $(d8 k -n kube-system get pod -l component=etcd,tier=control-plane -o name); do
d8 k -n kube-system exec "$pod" -- etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key \
--endpoints https://127.0.0.1:2379/ member list -w table
if [ $? -eq 0 ]; then
break
fi
done
Warning. The last parameter in the output table shows etcd member is in learner state, is not in leader state.
Option 2
To obtain information about etcd cluster nodes in tabular form, use the etcdctl endpoint status command. For the leader, the IS LEADER column will show true.
Example:
for pod in $(d8 k -n kube-system get pod -l component=etcd,tier=control-plane -o name); do
d8 k -n kube-system exec "$pod" -- etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key \
--endpoints https://127.0.0.1:2379/ endpoint status --cluster -w table
if [ $? -eq 0 ]; then
break
fi
done
Restoring the etcd cluster in case of complete unavailability
- Stop all etcd nodes except one by deleting the
etcd.yamlmanifest on the others. - On the remaining node, add the
--force-new-clusteroption to the etcd startup command. - After the cluster is restored, remove this option.
Be careful: these actions completely erase the previous data and form a new etcd cluster.
Recovering etcd after panic: unexpected removal of unknown remote peer error
In some cases, manual restoration via etcdutl snapshot restore can help:
- Save a local snapshot from
/var/lib/etcd/member/snap/db. - Use
etcdutlwith the--force-new-clusteroption to restore. - Completely wipe the
/var/lib/etcddirectory and place the restored snapshot there. - Remove any “stuck” etcd/kube-apiserver containers and restart the node.
Actions to take when etcd database exceeds quota-backend-bytes limit
When the database volume of etcd reaches the limit set by the quota-backend-bytes parameter, it switches to “read-only” mode. This means that the etcd database stops accepting new entries but remains available for reading data. You can tell that you are facing a similar situation by executing the command:
d8 k -n kube-system exec -ti $(d8 k -n kube-system get pod -l component=etcd,tier=control-plane -o name | sed -n 1p) -- \
etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key \
--endpoints https://127.0.0.1:2379/ endpoint status -w table --cluster
If you see a message like alarm:NOSPACE in the ERRORS field, you need to take the following steps:
- Make change to
/etc/kubernetes/manifests/etcd.yaml— find the line with--quota-backend-bytesand increase the value by multiplying the specified number by two. If there is no such line — add, for example:- --quota-backend-bytes=8589934592— this sets the limit to 8 GB. -
Disarm the active alarm that occurred due to reaching the limit. To do this, execute the command:
d8 k -n kube-system exec -ti $(d8 k -n kube-system get pod -l component=etcd,tier=control-plane -o name | sed -n 1p) -- \ etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \ --cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key \ --endpoints https://127.0.0.1:2379/ alarm disarm - Change the
maxDbSizeparameter in thecontrol-plane-managersettings to match the value specified in the manifest.
etcd defragmentation
Before defragmenting, back up etcd.
To view the size of the etcd database on a specific node before and after defragmentation, use the command (where NODE_NAME is the name of the master node):
d8 k -n kube-system exec -it etcd-NODE_NAME -- /usr/bin/etcdctl \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
endpoint status --cluster -w table
Output example (the size of the etcd database on the node is specified in the DB SIZE column):
+-----------------------------+------------------+---------+-----------------+---------+--------+-----------------------+--------+------------+------------+-----------+------------+--------------------+--------+--------------------------+-------------------+
| ENDPOINT | ID | VERSION | STORAGE VERSION | DB SIZE | IN USE | PERCENTAGE NOT IN USE | QUOTA | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | DOWNGRADE TARGET VERSION | DOWNGRADE ENABLED |
+-----------------------------+------------------+---------+-----------------+---------+--------+-----------------------+--------+------------+------------+-----------+------------+--------------------+--------+--------------------------+-------------------+
| https://192.168.199.80:2379 | 489a8af1e7acd7a0 | 3.6.1 | 3.6.0 | 76 MB | 62 MB | 20% | 2.1 GB | true | false | 56 | 258054684 | 258054684 | | | false |
+-----------------------------+------------------+---------+-----------------+---------+--------+-----------------------+--------+------------+------------+-----------+------------+--------------------+--------+--------------------------+-------------------+
| https://192.168.199.81:2379 | 589a8ad1e7ccd7b0 | 3.6.1 | 3.6.0 | 76 MB | 62 MB | 20% | 2.1 GB | false | false | 56 | 258054685 | 258054685 | | | false |
+-----------------------------+------------------+---------+-----------------+---------+--------+-----------------------+--------+------------+------------+-----------+------------+--------------------+--------+--------------------------+-------------------+
| https://192.168.199.82:2379 | 229a8cd1e7bcd7a0 | 3.6.1 | 3.6.0 | 76 MB | 62 MB | 20% | 2.1 GB | false | false | 56 | 258054685 | 258054685 | | | false |
+-----------------------------+------------------+---------+-----------------+---------+--------+-----------------------+--------+------------+------------+-----------+------------+--------------------+--------+--------------------------+-------------------+
How to defragment an etcd node in a single-master cluster
Defragmenting etcd is a resource-intensive operation that temporarily blocks etcd from running on that node. Keep this in mind when choosing a time to perform the operation in a cluster with a single master node.
To defragment etcd in a cluster with a single master node, use the following command (where NODE_NAME is the name of the master node):
d8 k -n kube-system exec -ti etcd-NODE_NAME -- /usr/bin/etcdctl \
--cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/ca.crt \
--key /etc/kubernetes/pki/etcd/ca.key \
--endpoints https://127.0.0.1:2379/ defrag --command-timeout=30s
Example output when the operation is successful:
Finished defragmenting etcd member[https://localhost:2379]. took 848.948927ms
If a timeout error occurs, increase the value of the
–command-timeoutparameter from the command above until defragmentation is successful.
How to defragment etcd in a cluster with multiple master nodes
To defragment etcd in a cluster with multiple master nodes:
-
Get a list of etcd pods. To do this, use the following command:
d8 k -n kube-system get pod -l component=etcd -o wideExample output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES etcd-master-0 1/1 Running 0 3d21h 192.168.199.80 master-0 <none> <none> etcd-master-1 1/1 Running 0 3d21h 192.168.199.81 master-1 <none> <none> etcd-master-2 1/1 Running 0 3d21h 192.168.199.82 master-2 <none> <none> -
Identify the leader master node. To do this, contact any etcd pod and get a list of nodes participating in the etcd cluster using the command (where
NODE_NAMEis the name of the master node):d8 k -n kube-system exec -it etcd-NODE_NAME -- /usr/bin/etcdctl \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ endpoint status --cluster -w tableOutput example (the leader in the
IS LEADERcolumn will have the valuetrue):+-----------------------------+------------------+---------+-----------------+---------+--------+-----------------------+--------+------------+------------+-----------+------------+--------------------+--------+--------------------------+-------------------+ | ENDPOINT | ID | VERSION | STORAGE VERSION | DB SIZE | IN USE | PERCENTAGE NOT IN USE | QUOTA | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | DOWNGRADE TARGET VERSION | DOWNGRADE ENABLED | +-----------------------------+------------------+---------+-----------------+---------+--------+-----------------------+--------+------------+------------+-----------+------------+--------------------+--------+--------------------------+-------------------+ | https://192.168.199.80:2379 | 489a8af1e7acd7a0 | 3.6.1 | 3.6.0 | 76 MB | 62 MB | 20% | 2.1 GB | true | false | 56 | 258054684 | 258054684 | | | false | +-----------------------------+------------------+---------+-----------------+---------+--------+-----------------------+--------+------------+------------+-----------+------------+--------------------+--------+--------------------------+-------------------+ | https://192.168.199.81:2379 | 589a8ad1e7ccd7b0 | 3.6.1 | 3.6.0 | 76 MB | 62 MB | 20% | 2.1 GB | false | false | 56 | 258054685 | 258054685 | | | false | +-----------------------------+------------------+---------+-----------------+---------+--------+-----------------------+--------+------------+------------+-----------+------------+--------------------+--------+--------------------------+-------------------+ | https://192.168.199.82:2379 | 229a8cd1e7bcd7a0 | 3.6.1 | 3.6.0 | 76 MB | 62 MB | 20% | 2.1 GB | false | false | 56 | 258054685 | 258054685 | | | false | +-----------------------------+------------------+---------+-----------------+---------+--------+-----------------------+--------+------------+------------+-----------+------------+--------------------+--------+--------------------------+-------------------+ -
Defragment the etcd nodes that are members of the etcd cluster one by one. Use the following command to defragment (where
NODE_NAMEis the name of the master node):Important: Defragment the leader last.
Restoring etcd on a node after defragmentation may take some time. It is recommended to wait at least a minute before proceeding to defragment the next etcd node.
d8 k -n kube-system exec -ti etcd-NODE_NAME -- /usr/bin/etcdctl \ --cacert /etc/kubernetes/pki/etcd/ca.crt \ --cert /etc/kubernetes/pki/etcd/ca.crt \ --key /etc/kubernetes/pki/etcd/ca.key \ --endpoints https://127.0.0.1:2379/ defrag --command-timeout=30sExample output when the operation is successful:
Finished defragmenting etcd member[https://localhost:2379]. took 848.948927msIf a timeout error occurs, increase the value of the
–command-timeoutparameter from the command above until defragmentation is successful.
High availability
If any component of the control plane becomes unavailable, the cluster temporarily maintains its current state but cannot process new events. For example:
- If
kube-controller-managerfails, Deployment scaling will stop working. - If
kube-apiserveris unavailable, no requests can be made to the Kubernetes API, although existing applications will continue to function.
However, prolonged unavailability of control plane components disrupts the processing of new objects, handling of node failures, and other operations. Over time, this can lead to cluster degradation and impact user applications.
To mitigate these risks, the control plane should be scaled to a high-availability configuration — a minimum of three nodes. This is especially critical for etcd, which requires a quorum to elect a leader. The quorum works on a majority basis (N/2 + 1) of the total number of nodes.
Example:
| Cluster size | Quorum (majority) | Max fault tolerance |
|---|---|---|
| 1 | 1 | 0 |
| 3 | 2 | 1 |
| 5 | 3 | 2 |
| 7 | 4 | 3 |
| 9 | 5 | 4 |
An even number of nodes does not improve fault tolerance but increases replication overhead.
In most cases, three etcd nodes are sufficient. Use five if high availability is critical. More than seven is rarely necessary and not recommended due to high resource consumption.
After new control plane nodes are added:
- The label
node-role.kubernetes.io/control-plane=""is applied. - A DaemonSet launches control plane pods on the new nodes.
- DKP creates or updates files in
/etc/kubernetes: manifests, configuration files, certificates, etc. - All DKP modules that support high availability will enable it automatically, unless the global setting
highAvailabilityis manually overridden.
Control plane node removal is performed in reverse:
- Labels
node-role.kubernetes.io/control-plane,node-role.kubernetes.io/master, andnode.deckhouse.io/groupare removed. - DKP removes its pods from these nodes.
- etcd members on the nodes are automatically deleted.
- If the number of nodes drops from two to one, etcd may enter
readonlymode. In this case, you must start etcd with the--force-new-clusterflag, which should be removed after a successful startup.