The documentation is under development and may contain incomplete information.
etcd backup
Automatic backup
Deckhouse creates a CronJob kube-system/d8-etcd-backup-*
, which is triggered at 00:00 UTC+0. The etcd data backup is saved to the archive /var/lib/etcd/etcd-backup.tar.gz
on all master nodes.
Manual backup using Deckhouse CLI
In Deckhouse v1.65 and higher clusters, etcd data backup can be created with a single d8 backup etcd
command:
d8 backup etcd --kubeconfig $KUBECONFIG ./etcd.db
Manual backup with etcdctl
Not recommended for use in Deckhouse 1.65 and higher.
On Deckhouse v1.64 and earlier, run the following script on any master node as root
:
#!/usr/bin/env bash
set -e
pod=etcd-`hostname`
d8 k -n kube-system exec "$pod" -- /usr/bin/etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key --endpoints https://127.0.0.1:2379/ snapshot save /var/lib/etcd/${pod##*/}.snapshot && \
mv /var/lib/etcd/"${pod##*/}.snapshot" etcd-backup.snapshot && \
cp -r /etc/kubernetes/ ./ && \
tar -cvzf kube-backup.tar.gz ./etcd-backup.snapshot ./kubernetes/
rm -r ./kubernetes ./etcd-backup.snapshot
The kube-backup.tar.gz
file will be created in the current directory with a snapshot of the etcd database of one of the etcd cluster nodes.
The resulting snapshot can be used to restore the state of the etcd cluster.
It is also recommended to backup the directory /etc/kubernetes
, which contains:
- manifests and configuration of control-plane components;
- PKI of the Kubernetes cluster.
We recommend storing backup copies of the etcd cluster snapshots, as well as a backup of the directory /etc/kubernetes/
in encrypted form outside the Deckhouse cluster.
For this, you can use third-party file backup tools, such as Restic, Borg, Duplicity, etc.
Full cluster state recovery from etcd backup
The following are the steps for restoring a cluster to a previous state from a backup in case of complete data loss.
Recovering a cluster with one master node
To correctly recover a cluster with one master node, follow these steps:
-
Download the etcdctl utility to the server (it is desirable that its version is the same as the etcd version in the cluster).
wget "https://github.com/etcd-io/etcd/releases/download/v3.5.4/etcd-v3.5.4-linux-amd64.tar.gz" tar -xzvf etcd-v3.5.4-linux-amd64.tar.gz && mv etcd-v3.5.4-linux-amd64/etcdctl /usr/local/bin/etcdctl
You can check the etcd version in your cluster by running the following command:
d8 k -n kube-system exec -ti etcd-$(hostname) -- etcdctl version
-
Stop etcd.
Etcd runs as a static pod, so it’s enough to move the manifest file:
mv /etc/kubernetes/manifests/etcd.yaml ~/etcd.yaml
-
Backup the current etcd data.
cp -r /var/lib/etcd/member/ /var/lib/deckhouse-etcd-backup
-
Clean up the etcd directory.
rm -rf /var/lib/etcd/member/
-
Place the etcd backup in
~/etcd-backup.snapshot
. -
Restore the etcd database.
ETCDCTL_API=3 etcdctl snapshot restore ~/etcd-backup.snapshot --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/ca.crt \ --key /etc/kubernetes/pki/etcd/ca.key --endpoints https://127.0.0.1:2379/ --data-dir=/var/lib/etcd
-
Start etcd.
mv ~/etcd.yaml /etc/kubernetes/manifests/etcd.yaml
Recovering a multi-master cluster
To properly recover a multi-master cluster, follow these steps:
-
Explicitly enable High Availability (HA) mode using the global parameter highAvailability. This is necessary, for example, to avoid losing one Prometheus replica and its PVC, since HA is disabled by default in single-master cluster mode.
-
Switch the cluster to single-master mode, according to the instruction for cloud clusters or manually remove static master nodes from the cluster.
-
On the remaining single master node, follow the steps to restore etcd from backup as described in the guide for a single-master cluster.
-
When etcd is running will be restored, remove information about the master nodes already deleted in step 1 from the cluster using the following command (specify the node name):
d8 k delete node <MASTER_NODE_I>
-
Restart all cluster nodes.
-
Wait for the tasks from the Deckhouse queue to complete:
d8 k -n d8-system exec svc/deckhouse-leader -c deckhouse -- deckhouse-controller queue main
-
Switch the cluster back to multi-master mode in accordance with instruction for cloud clusters or instruction for static or hybrid clusters.
Restoring a Kubernetes object from an etcd backup
A short scenario for restoring individual objects from an etcd backup:
-
Get a backup of your data.
-
Start a temporary etcd instance.
-
Fill it with data from the backup.
-
Get descriptions of the required objects using the
etcdhelper
utility.
Steps for restoring objects from an etcd backup
In the example:
etcd-snapshot.bin
is a file with a backup of etcd data (snapshot);infra-production
is the namespace in which you want to restore the objects.
-
Start a pod with a temporary etcd instance.
It is desirable that the version of the etcd instance you are starting matches the version of etcd from which the backup was created. For simplicity, the instance is launched not locally, but in the cluster, since the cluster already has an etcd image.
-
Prepare the
etcd.pod.yaml
file with the pod manifest:cat <<EOF >etcd.pod.yaml apiVersion: v1 kind: Pod metadata: name: etcdrestore namespace: default spec: containers: - command: - /bin/sh - -c - "sleep 96h" image: IMAGE imagePullPolicy: IfNotPresent name: etcd volumeMounts: - name: etcddir mountPath: /default.etcd volumes: - name: etcddir emptyDir: {} EOF
-
Set the current name of the etcd image:
IMG=`kubectl -n kube-system get pod -l component=etcd -o jsonpath="{.items[0].spec. containers[*].image}"` sed -i -e "s#IMAGE#$IMG#" etcd.pod.yaml
-
Create a pod:
kubectl create -f etcd.pod.yaml
-
Copy
etcdhelper
and the etcd snapshot to the pod container.etcdhelper
can be built from source or copied from a pre-built image (e.g. theetcdhelper
image on Docker Hub).Example:
kubectl cp etcd-snapshot.bin default/etcdrestore:/tmp/etcd-snapshot.bin kubectl cp etcdhelper default/etcdrestore:/usr/bin/etcdhelper
-
In the container, set permissions to run
etcdhelper
, restore the data from the backup, and start etcd.Example:
~ # kubectl -n default exec -it etcdrestore -- sh / # chmod +x /usr/bin/etcdhelper / # etcdctl snapshot restore /tmp/etcd-snapshot.bin / # etcd &
-
Get the descriptions of the desired cluster objects by filtering them with
grep
.Example:
~ # kubectl -n default exec -it etcdrestore -- sh / # mkdir /tmp/restored_yaml / # cd /tmp/restored_yaml /tmp/restored_yaml # for o in `etcdhelper -endpoint 127.0.0.1:2379 ls /registry/ | grep infra-production` ; do etcdhelper -endpoint 127.0.0.1:2379 get $o > `echo $o | sed -e "s#/registry/##g;s#/#_#g"`.yaml ; done
The
sed
replacement in the example allows object descriptions to be saved to files named like the etcd registry structure. For example:/registry/deployments/infra-production/supercronic.yaml
→deployments_infra-production_supercronic.yaml
.
-
-
Copy the received object descriptions from the pod to the master node using the command:
d8 k cp default/etcdrestore:/tmp/restored_yaml restored_yaml
-
Remove information about the creation time, UID, status and other operational data from the received object descriptions, then restore the objects using the command:
d8 k create -f restored_yaml/deployments_infra-production_supercronic.yaml
-
A pod with a temporary etcd instance can be deleted using the command:
d8 k delete -f etcd.pod.yaml
How to get a list of etcd cluster nodes (option 1)
Use the etcdctl member list
command.
Example:
d8 k -n kube-system exec -ti $(d8 k -n kube-system get pod -l component=etcd,tier=control-plane -o name | head -n1) -- \
etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key \
--endpoints https://127.0.0.1:2379/ member list -w table
Warning. The last parameter in the output table shows that the etcd cluster node is in the learner state, not the leader state.
How to get a list of cluster nodes etcd (option 2)
Use the etcdctl endpoint status
command. For this command, after the --endpoints
flag, you need to substitute the address of each control-plane node.
The true
value in the fifth column of the output indicates the leader.
An example of a script that automatically transfers all addresses of control-plane nodes:
MASTER_NODE_IPS=($(d8 k get nodes -l \
node-role.kubernetes.io/control-plane="" \
-o 'custom-columns=IP:.status.addresses[?(@.type=="InternalIP")].address' \
--no-headers))
unset ENDPOINTS_STRING
for master_node_ip in ${MASTER_NODE_IPS[@]}
do ENDPOINTS_STRING+="--endpoints https://${master_node_ip}:2379 "
done
d8 k -n kube-system exec -ti $(d8 k -n kube-system get pod \
-l component=etcd,tier=control-plane -o name | head -n1)\
-- etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/ca.crt \
--key /etc/kubernetes/pki/etcd/ca.key \
$(echo -n $ENDPOINTS_STRING) endpoint status -w table
Rebuilding etcd cluster
A rebuild may be required if the etcd cluster has collapsed, or when migrating from a multi-master cluster to a single-master cluster.
- Select the node from which to start restoring the etcd cluster. In case of migrating to a single-master cluster, this is the node where etcd should remain.
- Stop etcd on all other nodes. To do this, delete the file
/etc/kubernetes/manifests/etcd.yaml
. -
On the remaining node, in the manifest
/etc/kubernetes/manifests/etcd.yaml
, add the--force-new-cluster
argument to thespec.containers.command
field. - After the cluster has been successfully started, remove the
--force-new-cluster
parameter.
The operation is destructive, it completely destroys the consensus and starts the etcd cluster from the state that was saved on the selected node. Any pending entries will be lost.
Eliminating infinite restart
This option may be needed if starting with the --force-new-cluster
argument does not restore etcd. This can happen during an unsuccessful converge of master nodes, when the new master node was created with the old etcd disk, changed its address from the local network, and there are no other master nodes. It is worth using this method if the etcd container is in an infinite restart, and its log contains the error: panic: unexpected removal of unknown remote peer
.
- Install the etcdutl utility.
-
From the current local snapshot of the etcd database (
/var/lib/etcd/member/snap/db
), create a new snapshot:./etcdutl snapshot restore /var/lib/etcd/member/snap/db --name <HOSTNAME> \ --initial-cluster=HOSTNAME=https://<ADDRESS>:2380 --initial-advertise-peer-urls=https://ADDRESS:2380 \ --skip-hash-check=true --data-dir /var/lib/etcdtest
where:
<HOSTNAME>
is the name of the master node;<ADDRESS>
is the address of the master node.
-
Run commands to use the new snapshot:
cp -r /var/lib/etcd /tmp/etcd-backup rm -rf /var/lib/etcd mv /var/lib/etcdtest /var/lib/etcd
-
Find the
etcd
andkube-apiserver
containers:crictl ps -a --name "^etcd|^kube-apiserver"
-
Remove the found
etcd
andkube-apiserver
containers:crictl rm <CONTAINER-ID>
- Restart the master node.