The documentation is under development and may contain incomplete information.

etcd backup

Automatic backup

Deckhouse creates a CronJob kube-system/d8-etcd-backup-*, which is triggered at 00:00 UTC+0. The etcd data backup is saved to the archive /var/lib/etcd/etcd-backup.tar.gz on all master nodes.

Manual backup using Deckhouse CLI

In Deckhouse v1.65 and higher clusters, etcd data backup can be created with a single d8 backup etcd command:

d8 backup etcd --kubeconfig $KUBECONFIG ./etcd.db

Manual backup with etcdctl

Not recommended for use in Deckhouse 1.65 and higher.

On Deckhouse v1.64 and earlier, run the following script on any master node as root:

#!/usr/bin/env bash
set -e

pod=etcd-`hostname`
d8 k -n kube-system exec "$pod" -- /usr/bin/etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key --endpoints https://127.0.0.1:2379/ snapshot save /var/lib/etcd/${pod##*/}.snapshot && \
mv /var/lib/etcd/"${pod##*/}.snapshot" etcd-backup.snapshot && \
cp -r /etc/kubernetes/ ./ && \
tar -cvzf kube-backup.tar.gz ./etcd-backup.snapshot ./kubernetes/
rm -r ./kubernetes ./etcd-backup.snapshot

The kube-backup.tar.gz file will be created in the current directory with a snapshot of the etcd database of one of the etcd cluster nodes. The resulting snapshot can be used to restore the state of the etcd cluster.

It is also recommended to backup the directory /etc/kubernetes, which contains:

We recommend storing backup copies of the etcd cluster snapshots, as well as a backup of the directory /etc/kubernetes/ in encrypted form outside the Deckhouse cluster. For this, you can use third-party file backup tools, such as Restic, Borg, Duplicity, etc.

Full cluster state recovery from etcd backup

The following are the steps for restoring a cluster to a previous state from a backup in case of complete data loss.

Recovering a cluster with one master node

To correctly recover a cluster with one master node, follow these steps:

  1. Download the etcdctl utility to the server (it is desirable that its version is the same as the etcd version in the cluster).

     wget "https://github.com/etcd-io/etcd/releases/download/v3.5.4/etcd-v3.5.4-linux-amd64.tar.gz"
     tar -xzvf etcd-v3.5.4-linux-amd64.tar.gz && mv etcd-v3.5.4-linux-amd64/etcdctl /usr/local/bin/etcdctl
    

    You can check the etcd version in your cluster by running the following command:

     d8 k -n kube-system exec -ti etcd-$(hostname) -- etcdctl version
    
  2. Stop etcd.

    Etcd runs as a static pod, so it’s enough to move the manifest file:

     mv /etc/kubernetes/manifests/etcd.yaml ~/etcd.yaml
    
  3. Backup the current etcd data.

     cp -r /var/lib/etcd/member/ /var/lib/deckhouse-etcd-backup
    
  4. Clean up the etcd directory.

     rm -rf /var/lib/etcd/member/
    
  5. Place the etcd backup in ~/etcd-backup.snapshot.

  6. Restore the etcd database.

      ETCDCTL_API=3 etcdctl snapshot restore ~/etcd-backup.snapshot --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/ca.crt \
      --key /etc/kubernetes/pki/etcd/ca.key --endpoints https://127.0.0.1:2379/ --data-dir=/var/lib/etcd
    
  7. Start etcd.

     mv ~/etcd.yaml /etc/kubernetes/manifests/etcd.yaml
    

Recovering a multi-master cluster

To properly recover a multi-master cluster, follow these steps:

  1. Explicitly enable High Availability (HA) mode using the global parameter highAvailability. This is necessary, for example, to avoid losing one Prometheus replica and its PVC, since HA is disabled by default in single-master cluster mode.

  2. Switch the cluster to single-master mode, according to the instruction for cloud clusters or manually remove static master nodes from the cluster.

  3. On the remaining single master node, follow the steps to restore etcd from backup as described in the guide for a single-master cluster.

  4. When etcd is running will be restored, remove information about the master nodes already deleted in step 1 from the cluster using the following command (specify the node name):

     d8 k delete node <MASTER_NODE_I>
    
  5. Restart all cluster nodes.

  6. Wait for the tasks from the Deckhouse queue to complete:

     d8 k -n d8-system exec svc/deckhouse-leader -c deckhouse -- deckhouse-controller queue main
    
  7. Switch the cluster back to multi-master mode in accordance with instruction for cloud clusters or instruction for static or hybrid clusters.

Restoring a Kubernetes object from an etcd backup

A short scenario for restoring individual objects from an etcd backup:

  1. Get a backup of your data.

  2. Start a temporary etcd instance.

  3. Fill it with data from the backup.

  4. Get descriptions of the required objects using the etcdhelper utility.

Steps for restoring objects from an etcd backup

In the example:

  • etcd-snapshot.bin is a file with a backup of etcd data (snapshot);
  • infra-production is the namespace in which you want to restore the objects.
  1. Start a pod with a temporary etcd instance.

    It is desirable that the version of the etcd instance you are starting matches the version of etcd from which the backup was created. For simplicity, the instance is launched not locally, but in the cluster, since the cluster already has an etcd image.

    • Prepare the etcd.pod.yaml file with the pod manifest:

      cat <<EOF >etcd.pod.yaml
      apiVersion: v1
      kind: Pod
      metadata:
      name: etcdrestore
      namespace: default
      spec:
      containers:
      - command:
      - /bin/sh
      - -c
      - "sleep 96h"
      image: IMAGE
      imagePullPolicy: IfNotPresent
      name: etcd
      volumeMounts:
      - name: etcddir
      mountPath: /default.etcd
      volumes:
      - name: etcddir
      emptyDir: {}
      EOF
      
    • Set the current name of the etcd image:

      IMG=`kubectl -n kube-system get pod -l component=etcd -o jsonpath="{.items[0].spec. containers[*].image}"`
      sed -i -e "s#IMAGE#$IMG#" etcd.pod.yaml
      
    • Create a pod:

      kubectl create -f etcd.pod.yaml
      
    • Copy etcdhelper and the etcd snapshot to the pod container.

      etcdhelper can be built from source or copied from a pre-built image (e.g. the etcdhelper image on Docker Hub).

      Example:

      kubectl cp etcd-snapshot.bin default/etcdrestore:/tmp/etcd-snapshot.bin
      kubectl cp etcdhelper default/etcdrestore:/usr/bin/etcdhelper
      
    • In the container, set permissions to run etcdhelper, restore the data from the backup, and start etcd.

      Example:

      ~ # kubectl -n default exec -it etcdrestore -- sh
      / # chmod +x /usr/bin/etcdhelper
      / # etcdctl snapshot restore /tmp/etcd-snapshot.bin
      / # etcd &
      
    • Get the descriptions of the desired cluster objects by filtering them with grep.

      Example:

      ~ # kubectl -n default exec -it etcdrestore -- sh
      / # mkdir /tmp/restored_yaml
      / # cd /tmp/restored_yaml
      /tmp/restored_yaml # for o in `etcdhelper -endpoint 127.0.0.1:2379 ls /registry/ | grep infra-production` ; do etcdhelper -endpoint 127.0.0.1:2379 get $o > `echo $o | sed -e "s#/registry/##g;s#/#_#g"`.yaml ; done
      

      The sed replacement in the example allows object descriptions to be saved to files named like the etcd registry structure. For example: /registry/deployments/infra-production/supercronic.yamldeployments_infra-production_supercronic.yaml.

  2. Copy the received object descriptions from the pod to the master node using the command:

     d8 k cp default/etcdrestore:/tmp/restored_yaml restored_yaml
    
  3. Remove information about the creation time, UID, status and other operational data from the received object descriptions, then restore the objects using the command:

     d8 k create -f restored_yaml/deployments_infra-production_supercronic.yaml
    
  4. A pod with a temporary etcd instance can be deleted using the command:

     d8 k delete -f etcd.pod.yaml
    

How to get a list of etcd cluster nodes (option 1)

Use the etcdctl member list command.

Example:

d8 k -n kube-system exec -ti $(d8 k -n kube-system get pod -l component=etcd,tier=control-plane -o name | head -n1) -- \
etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key \
--endpoints https://127.0.0.1:2379/ member list -w table

Warning. The last parameter in the output table shows that the etcd cluster node is in the learner state, not the leader state.

How to get a list of cluster nodes etcd (option 2)

Use the etcdctl endpoint status command. For this command, after the --endpoints flag, you need to substitute the address of each control-plane node.

The true value in the fifth column of the output indicates the leader.

An example of a script that automatically transfers all addresses of control-plane nodes:

MASTER_NODE_IPS=($(d8 k get nodes -l \
node-role.kubernetes.io/control-plane="" \
-o 'custom-columns=IP:.status.addresses[?(@.type=="InternalIP")].address' \
--no-headers))
unset ENDPOINTS_STRING
for master_node_ip in ${MASTER_NODE_IPS[@]}
do ENDPOINTS_STRING+="--endpoints https://${master_node_ip}:2379 "
done
d8 k -n kube-system exec -ti $(d8 k -n kube-system get pod \
-l component=etcd,tier=control-plane -o name | head -n1)\
-- etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/ca.crt \
--key /etc/kubernetes/pki/etcd/ca.key \
$(echo -n $ENDPOINTS_STRING) endpoint status -w table

Rebuilding etcd cluster

A rebuild may be required if the etcd cluster has collapsed, or when migrating from a multi-master cluster to a single-master cluster.

  1. Select the node from which to start restoring the etcd cluster. In case of migrating to a single-master cluster, this is the node where etcd should remain.
  2. Stop etcd on all other nodes. To do this, delete the file /etc/kubernetes/manifests/etcd.yaml.
  3. On the remaining node, in the manifest /etc/kubernetes/manifests/etcd.yaml, add the --force-new-cluster argument to the spec.containers.command field.

  4. After the cluster has been successfully started, remove the --force-new-cluster parameter.

The operation is destructive, it completely destroys the consensus and starts the etcd cluster from the state that was saved on the selected node. Any pending entries will be lost.

Eliminating infinite restart

This option may be needed if starting with the --force-new-cluster argument does not restore etcd. This can happen during an unsuccessful converge of master nodes, when the new master node was created with the old etcd disk, changed its address from the local network, and there are no other master nodes. It is worth using this method if the etcd container is in an infinite restart, and its log contains the error: panic: unexpected removal of unknown remote peer.

  1. Install the etcdutl utility.
  2. From the current local snapshot of the etcd database (/var/lib/etcd/member/snap/db), create a new snapshot:

     ./etcdutl snapshot restore /var/lib/etcd/member/snap/db --name <HOSTNAME> \
     --initial-cluster=HOSTNAME=https://<ADDRESS>:2380 --initial-advertise-peer-urls=https://ADDRESS:2380 \
     --skip-hash-check=true --data-dir /var/lib/etcdtest
    

    where:

    • <HOSTNAME> is the name of the master node;
    • <ADDRESS> is the address of the master node.
  3. Run commands to use the new snapshot:

     cp -r /var/lib/etcd /tmp/etcd-backup
     rm -rf /var/lib/etcd
     mv /var/lib/etcdtest /var/lib/etcd
    
  4. Find the etcd and kube-apiserver containers:

     crictl ps -a --name "^etcd|^kube-apiserver"
    
  5. Remove the found etcd and kube-apiserver containers:

     crictl rm <CONTAINER-ID>
    
  6. Restart the master node.