Data Backup
The Deckhouse Observability Platform provides the ability to create a data backup by copying the contents of S3 buckets. In addition to metrics and logs that are stored in S3 during normal platform operation, the module also automatically creates a backup of the PostgreSQL database and saves its dump to S3.
Thus, to perform a full backup of the platform data, you need to copy the contents of the following three S3 buckets:
- mimir — for metrics
- loki — for logs
- backup — for PostgreSQL database dumps
PostgreSQL backup settings are configured in the ModuleConfig
of the observability-platform
module. See the module parameters documentation for details.
Metrics Backup
Metrics are stored in an S3 bucket in the form of TSDB blocks. To back them up, it is sufficient to copy the contents of the bucket — for example, using the rclone
utility (described below).
⚠️ Note: Backup does not guarantee the preservation of all metrics. The last two hours of metrics may reside only in the
ingester
memory and not yet be flushed to S3. This should be considered when estimating acceptable data loss.
Block Structure Details
- TSDB blocks are immutable and have unique names. This allows safe use of simple copying tools without interrupting ingestion.
- A block is considered complete only after the
meta.json
file is written (created last). - Two types of blocks may be present in the backup:
- Complete blocks (with
meta.json
) are ready for recovery. - Incomplete blocks (without
meta.json
, also known as partial) are not used in queries, are automatically deleted over time by the system, and may be present in the backup.
- Complete blocks (with
- If a block contains
meta.json
but lacks thechunks
orindex
data directories, the block is considered corrupted. These blocks appear only due to copying failures and are resolved by re-running the backup process.
S3 Bucket Backup Methods
S3 buckets can be backed up using the following approaches:
- Use utilities such as
rclone
orminio-client
to synchronize S3 bucket contents with another local or S3-compatible storage. - Mount S3 buckets to the local file system with tools like
s3fs
orgeesefs
, then use your company’s standard backup system to copy data.
Using geesefs
-
Install
geesefs
on the server where the buckets will be mounted:curl -L https://github.com/yandex-cloud/geesefs/releases/download/v0.43.0/geesefs-linux-amd64 -o /usr/local/bin/geesefs chmod +x /usr/local/bin/geesefs
-
Create mount points for the S3 buckets:
mkdir -p /mnt/dop-backup/mimir /mnt/dop-backup/loki /mnt/dop-backup/backup
-
Obtain a credentials file for accessing S3 buckets. On the Kubernetes control-plane node, run:
kubectl -n d8-observability-platform get secrets backup-s3 loki-s3 mimir-s3 -o json | jq -r '.items[] | reduce . as $elt ({}; .[$elt.metadata.name|sub("-s3$"; "")] += [($elt.data | map_values(@base64d) | with_entries(.key |= ascii_downcase) | to_entries[] | "\(.key) = \(.value)")]) | to_entries[] | "[\(.key)]\n\(.value|join("\n"))"'
Save the output as
/etc/dop-s3-credentials
on your server. -
Generate
/etc/fstab
entries. On the Kubernetes control-plane node, run:kubectl -n d8-observability-platform get cm backup-s3 loki-s3 mimir-s3 -o json | jq --arg endpoint $(kubectl get mc observability-platform -o json | jq -r '"https://s3." + .spec.settings.general.baseDomain') -r '.items[] | (.metadata.name|sub("-s3$"; "")) as $name | "\(.data.BUCKET_NAME) /mnt/dop-backup/\($name) fuse.geesefs _netdev,allow_other,--file-mode=0644,--dir-mode=0755,--shared-config=/etc/dop-s3-credentials,--profile=\($name),--endpoint=\($endpoint) 0 0"'
Save the output to
/etc/fstab
. -
Mount the buckets:
mount -a
-
Verify that the S3 buckets are mounted correctly:
ls -l /mnt/dop-backup/backup/postgres-backup
-
Perform backups using your organization’s standard backup tools at the required frequency.
Using rclone
-
Install
rclone
:curl -L https://github.com/rclone/rclone/releases/download/v1.69.1/rclone-v1.69.1-linux-amd64.zip -o rclone.zip unzip -p rclone.zip rclone-*-linux-amd64/rclone | sudo tee /usr/local/bin/rclone > /dev/null sudo chmod +x /usr/local/bin/rclone rm rclone.zip
-
Generate the
rclone.conf
configuration file. On the Kubernetes control-plane node, run:kubectl -n d8-observability-platform get secrets backup-s3 loki-s3 mimir-s3 -o json | jq -r \ --arg endpoint $(kubectl get mc observability-platform -o json | jq -r '"https://s3." + .spec.settings.general.baseDomain') \ --argjson buckets $(kubectl -n d8-observability-platform get cm backup-s3 loki-s3 mimir-s3 -o json | jq -cM 'reduce .items[] as $elt ({}; .[$elt.metadata.name] = $elt.data.BUCKET_NAME)') \ '.items[] | reduce . as $elt ({}; .[$elt.metadata.name] += [($elt.data | map_values(@base64d) | with_entries(.key |= ascii_downcase) | with_entries(.key |= sub("^aws_"; "")) | . += {type: "s3", provider: "Ceph", endpoint: $endpoint} | to_entries[] | "\(.key) = \(.value)")] | .[($elt.metadata.name|sub("-s3$"; ""))] = ["type = alias", "remote = " + ($elt.metadata.name + ":" + $buckets[$elt.metadata.name])]) | to_entries[] | "[\(.key)]\n\(.value|join("\n"))\n"'
-
Save the output as
rclone.conf
on the backup server. -
Verify access to the bucket:
rclone --config rclone.conf ls backup:
-
Use
rclone sync
orrclone copy
commands to perform backups.
Example: Metrics Backup with rclone
rclone --config rclone.conf sync -v --delete-before --exclude-if-present deletion-mark.json --exclude '*/markers/*' --exclude '*/bucket-index.json.gz' dop-mimir: /backup/prod/mimir/
This command synchronizes the bucket with a local backup directory, excluding marker files, bucket indexes, and blocks marked for deletion. Any excluded files still present locally but not in the source will be removed.
Data Recovery
Restoring metrics and logs involves loading the backup contents back into the appropriate S3 buckets. You can use mounted buckets or rclone
in the same way as during backup.
First, all blocks except meta.json
are uploaded. Then, meta.json
files are uploaded in a separate step. This ensures that only complete blocks are seen and processed by the system.
Example:
rclone --config rclone.conf sync /backup/prod/mimir/ dop-flant-mimir: --exclude '*/meta.json'
rclone --config rclone.conf sync /backup/prod/mimir/ dop-flant-mimir: --include '*/meta.json'
PostgreSQL Database Recovery
This procedure applies when the observability-platform
module is deployed with an internal PostgreSQL database. If you are using an external database (.spec.settings.ui.postgres.mode: External
), follow your DB provider’s recovery instructions.
-
Stop the
backend
andalertgate
components. Remove any cronjobs that interact with the database:kubectl -n d8-observability-platform scale deploy backend alertgate-receiver alertgate-sender alertgate-api --replicas=0 kubectl -n d8-observability-platform delete cronjob backend-clean-silences host-load postgres-backup
-
Ensure that you have a database dump of the correct size. Copy it to the control-plane node.
-
Drop the existing PostgreSQL database:
kubectl -n d8-observability-platform exec -it $(kubectl -n d8-observability-platform get po -l spilo-role=master -o name) -- psql -U dop -c "DROP DATABASE dop;" postgres
-
Create a new
dop
database:kubectl -n d8-observability-platform exec -it $(kubectl -n d8-observability-platform get po -l spilo-role=master -o name) -- psql -U dop -c "CREATE DATABASE dop;" postgres
-
Restore the database from the dump:
zcat dop-202504211200.dump.gz | kubectl -n d8-observability-platform exec -it $(kubectl -n d8-observability-platform get po -l spilo-role=master -o name) -- psql -U dop dop
-
Restart the
backend
andalertgate
components:kubectl -n d8-observability-platform scale deploy backend alertgate-receiver alertgate-sender alertgate-api --replicas=2
-
Verify that all components are running:
kubectl -n d8-observability-platform get po -l 'app in (backend,alertgate-receiver,alertgate-sender,alertgate-api)'
-
Check that the web interface is accessible.