Control plane recovery and debugging

Recovery from failures

During its operation DVP automatically creates backups of configuration and data that may be useful in case of problems. These backups are saved in the /etc/kubernetes/deckhouse/backup directory. If any issues or unexpected situations occur during operation, you can use these backups to restore the system to a previously healthy state.

Restoring etcd cluster functionality

If the etcd cluster is not functioning and cannot be restored from a backup, you can attempt to recover it from scratch by following the steps below.

On all nodes that are part of your etcd cluster, except one, delete the etcd.yaml manifest located in /etc/kubernetes/manifests/. This will leave only one active node, from which the multi-master cluster state will be restored.
On the remaining node, open the etcd.yaml manifest and add the --force-new-cluster flag under spec.containers.command.
After the cluster is successfully restored, remove the --force-new-cluster flag.

This operation is destructive: it completely wipes the existing data and initializes a new cluster based on the state preserved on the remaining node. All pending records will be lost.

Restoring a master node when kubelet fails to load control plane components

Such a situation may occur if images of the control plane components on the master were deleted in a cluster that has a single master node (e.g., the directory /var/lib/containerd was deleted). In this case, kubelet cannot pull images of the control plane components when restarted since the master node lacks authorization parameters required for accessing registry.deckhouse.io.

Below is an instruction on how you can restore the master node.

containerd

Execute the following command to restore the master node in any cluster running under DVP:

d8 k -n d8-system get secrets deckhouse-registry -o json |
jq -r '.data.".dockerconfigjson"' | base64 -d |
jq -r '.auths."registry.deckhouse.io".auth'

Copy the command’s output and use it for setting the AUTH variable on the corrupted master.

Next, pull images of control plane components to the corrupted master:

for image in $(grep "image:" /etc/kubernetes/manifests/* | awk '{print $3}'); do
  crictl pull --auth $AUTH $image
done

Restart kubelet after pulling the images.

etcd restore

Viewing etcd cluster members

Option 1

Use the etcdctl member list command.

Example:

for pod in $(d8 k -n kube-system get pod -l component=etcd,tier=control-plane -o name); do
  d8 k -n kube-system exec "$pod" -- etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \
  --cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key \
  --endpoints https://127.0.0.1:2379/ member list -w table
  if [ $? -eq 0 ]; then
    break
  fi
done

Warning. The last parameter in the output table shows etcd member is in learner state, is not in leader state.

Option 2

To obtain information about etcd cluster nodes in tabular form, use the etcdctl endpoint status command. For the leader, the IS LEADER column will show true.

Example:

for pod in $(d8 k -n kube-system get pod -l component=etcd,tier=control-plane -o name); do
  d8 k -n kube-system exec "$pod" -- etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \
  --cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key \
  --endpoints https://127.0.0.1:2379/ endpoint status --cluster -w table
  if [ $? -eq 0 ]; then
    break
  fi
done

Restoring the etcd cluster in case of complete unavailability

Stop all etcd nodes except one by deleting the etcd.yaml manifest on the others.
On the remaining node, add the --force-new-cluster option to the etcd startup command.
After the cluster is restored, remove this option.

Be careful: these actions completely erase the previous data and form a new etcd cluster.

Recovering etcd after panic: unexpected removal of unknown remote peer error

In some cases, manual restoration via etcdutl snapshot restore can help:

Save a local snapshot from /var/lib/etcd/member/snap/db.
Use etcdutl with the --force-new-cluster option to restore.
Completely wipe the /var/lib/etcd directory and place the restored snapshot there.
Remove any “stuck” etcd/kube-apiserver containers and restart the node.

Actions to take when etcd database exceeds quota-backend-bytes limit

When the database volume of etcd reaches the limit set by the quota-backend-bytes parameter, it switches to “read-only” mode. This means that the etcd database stops accepting new entries but remains available for reading data. You can tell that you are facing a similar situation by executing the command:

d8 k -n kube-system exec -ti $(d8 k -n kube-system get pod -l component=etcd,tier=control-plane -o name | sed -n 1p) -- \
etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key \
--endpoints https://127.0.0.1:2379/ endpoint status -w table --cluster

This command uses substitution: $(d8 k -n kube-system get pod -l component=etcd,tier=control-plane -o name | sed -n 1p). It automatically inserts the name of the first Pod matching the specified labels.

If you see a message like alarm:NOSPACE in the ERRORS field, you need to take the following steps:

Make change to /etc/kubernetes/manifests/etcd.yaml — find the line with --quota-backend-bytes and increase the value by multiplying the specified number by two. If there is no such line — add, for example: - --quota-backend-bytes=8589934592 — this sets the limit to 8 GB.

Disarm the active alarm that occurred due to reaching the limit. To do this, execute the command:

d8 k -n kube-system exec -ti $(d8 k -n kube-system get pod -l component=etcd,tier=control-plane -o name | sed -n 1p) -- \
etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key \
--endpoints https://127.0.0.1:2379/ alarm disarm

Change the maxDbSize parameter in the control-plane-manager settings to match the value specified in the manifest.

High availability

If any component of the control plane becomes unavailable, the cluster temporarily maintains its current state but cannot process new events. For example:

If kube-controller-manager fails, Deployment scaling will stop working.
If kube-apiserver is unavailable, no requests can be made to the Kubernetes API, although existing applications will continue to function.

However, prolonged unavailability of control plane components disrupts the processing of new objects, handling of node failures, and other operations. Over time, this can lead to cluster degradation and impact user applications.

To mitigate these risks, the control plane should be scaled to a high-availability configuration — a minimum of three nodes. This is especially critical for etcd, which requires a quorum to elect a leader. The quorum works on a majority basis (N/2 + 1) of the total number of nodes.

Example:

Cluster size	Quorum (majority)	Max fault tolerance
1	1	0
3	2	1
5	3	2
7	4	3
9	5	4

An even number of nodes does not improve fault tolerance but increases replication overhead.

In most cases, three etcd nodes are sufficient. Use five if high availability is critical. More than seven is rarely necessary and not recommended due to high resource consumption.

After new control plane nodes are added:

The label node-role.kubernetes.io/control-plane="" is applied.
A DaemonSet launches control plane pods on the new nodes.
DVP creates or updates files in /etc/kubernetes: manifests, configuration files, certificates, etc.
All DVP modules that support high availability will enable it automatically, unless the global setting highAvailability is manually overridden.

Control plane node removal is performed in reverse:

Labels node-role.kubernetes.io/control-plane, node-role.kubernetes.io/master, and node.deckhouse.io/group are removed.
DVP removes its pods from these nodes.
etcd members on the nodes are automatically deleted.
If the number of nodes drops from two to one, etcd may enter readonly mode. In this case, you must start etcd with the --force-new-cluster flag, which should be removed after a successful startup.

Recovery from failures

Restoring etcd cluster functionality

Restoring a master node when kubelet fails to load control plane components

containerd

etcd restore

Viewing etcd cluster members

Option 1

Option 2

Restoring the etcd cluster in case of complete unavailability

Recovering etcd after panic: unexpected removal of unknown remote peer error

Actions to take when etcd database exceeds quota-backend-bytes limit

High availability

An error has occurred

Tell us what you didn’t like.

Control plane recovery and debugging

Recovery from failures

Restoring etcd cluster functionality

Restoring a master node when kubelet fails to load control plane components

containerd

etcd restore

Viewing etcd cluster members

Option 1

Option 2

Restoring the etcd cluster in case of complete unavailability

Recovering etcd after panic: unexpected removal of unknown remote peer error

Actions to take when etcd database exceeds quota-backend-bytes limit

High availability

An error has occurred

Tell us what you didn’t like.

Request trial access

Thank you

Error

Request callback

Thank you

Something went wrong

Book your sessions

Thank you

Error

Request demo

Thank you

Error

Get the PCI SSC Compliance Report

Thank you

Error