Data Backup

The Deckhouse Observability Platform provides the ability to create a data backup by copying the contents of S3 buckets. In addition to metrics and logs that are stored in S3 during normal platform operation, the module also automatically creates a backup of the PostgreSQL database and saves its dump to S3.

Thus, to perform a full backup of the platform data, you need to copy the contents of the following three S3 buckets:

  • mimir — for metrics
  • loki — for logs
  • backup — for PostgreSQL database dumps

PostgreSQL backup settings are configured in the ModuleConfig of the observability-platform module. See the module parameters documentation for details.

Metrics Backup

Metrics are stored in an S3 bucket in the form of TSDB blocks. To back them up, it is sufficient to copy the contents of the bucket — for example, using the rclone utility (described below).

⚠️ Note: Backup does not guarantee the preservation of all metrics. The last two hours of metrics may reside only in the ingester memory and not yet be flushed to S3. This should be considered when estimating acceptable data loss.

Block Structure Details

  • TSDB blocks are immutable and have unique names. This allows safe use of simple copying tools without interrupting ingestion.
  • A block is considered complete only after the meta.json file is written (created last).
  • Two types of blocks may be present in the backup:
    • Complete blocks (with meta.json) are ready for recovery.
    • Incomplete blocks (without meta.json, also known as partial) are not used in queries, are automatically deleted over time by the system, and may be present in the backup.
  • If a block contains meta.json but lacks the chunks or index data directories, the block is considered corrupted. These blocks appear only due to copying failures and are resolved by re-running the backup process.

S3 Bucket Backup Methods

S3 buckets can be backed up using the following approaches:

  • Use utilities such as rclone or minio-client to synchronize S3 bucket contents with another local or S3-compatible storage.
  • Mount S3 buckets to the local file system with tools like s3fs or geesefs, then use your company’s standard backup system to copy data.

Using geesefs

  1. Install geesefs on the server where the buckets will be mounted:

    curl -L https://github.com/yandex-cloud/geesefs/releases/download/v0.43.0/geesefs-linux-amd64 -o /usr/local/bin/geesefs
    chmod +x /usr/local/bin/geesefs
    
  2. Create mount points for the S3 buckets:

    mkdir -p /mnt/dop-backup/mimir /mnt/dop-backup/loki /mnt/dop-backup/backup
    
  3. Obtain a credentials file for accessing S3 buckets. On the Kubernetes control-plane node, run:

    kubectl -n d8-observability-platform get secrets backup-s3 loki-s3 mimir-s3 -o json | jq -r '.items[] | reduce . as $elt ({}; .[$elt.metadata.name|sub("-s3$"; "")] += [($elt.data | map_values(@base64d) | with_entries(.key |= ascii_downcase) | to_entries[] | "\(.key) = \(.value)")]) | to_entries[] | "[\(.key)]\n\(.value|join("\n"))"'
    

    Save the output as /etc/dop-s3-credentials on your server.

  4. Generate /etc/fstab entries. On the Kubernetes control-plane node, run:

    kubectl -n d8-observability-platform get cm backup-s3 loki-s3 mimir-s3 -o json | jq --arg endpoint $(kubectl get mc observability-platform -o json | jq -r '"https://s3." + .spec.settings.general.baseDomain') -r '.items[] | (.metadata.name|sub("-s3$"; "")) as $name | "\(.data.BUCKET_NAME) /mnt/dop-backup/\($name) fuse.geesefs _netdev,allow_other,--file-mode=0644,--dir-mode=0755,--shared-config=/etc/dop-s3-credentials,--profile=\($name),--endpoint=\($endpoint) 0 0"'
    

    Save the output to /etc/fstab.

  5. Mount the buckets:

    mount -a
    
  6. Verify that the S3 buckets are mounted correctly:

    ls -l /mnt/dop-backup/backup/postgres-backup
    
  7. Perform backups using your organization’s standard backup tools at the required frequency.

Using rclone

  1. Install rclone:

    curl -L https://github.com/rclone/rclone/releases/download/v1.69.1/rclone-v1.69.1-linux-amd64.zip -o rclone.zip
    unzip -p rclone.zip rclone-*-linux-amd64/rclone | sudo tee /usr/local/bin/rclone > /dev/null
    sudo chmod +x /usr/local/bin/rclone
    rm rclone.zip
    
  2. Generate the rclone.conf configuration file. On the Kubernetes control-plane node, run:

    kubectl -n d8-observability-platform get secrets backup-s3 loki-s3 mimir-s3 -o json | jq -r \
      --arg endpoint $(kubectl get mc observability-platform -o json | jq -r '"https://s3." + .spec.settings.general.baseDomain') \
      --argjson buckets $(kubectl -n d8-observability-platform get cm backup-s3 loki-s3 mimir-s3 -o json | jq -cM 'reduce .items[] as $elt ({}; .[$elt.metadata.name] = $elt.data.BUCKET_NAME)') \
      '.items[] | reduce . as $elt ({}; .[$elt.metadata.name] += [($elt.data | map_values(@base64d) | with_entries(.key |= ascii_downcase) | with_entries(.key |= sub("^aws_"; "")) | . += {type: "s3", provider: "Ceph", endpoint: $endpoint} | to_entries[] | "\(.key) = \(.value)")] | .[($elt.metadata.name|sub("-s3$"; ""))] = ["type = alias", "remote = " + ($elt.metadata.name + ":" + $buckets[$elt.metadata.name])]) | to_entries[] | "[\(.key)]\n\(.value|join("\n"))\n"'
    
  3. Save the output as rclone.conf on the backup server.

  4. Verify access to the bucket:

    rclone --config rclone.conf ls backup:
    
  5. Use rclone sync or rclone copy commands to perform backups.

Example: Metrics Backup with rclone

rclone --config rclone.conf sync -v --delete-before --exclude-if-present deletion-mark.json --exclude '*/markers/*' --exclude '*/bucket-index.json.gz' dop-mimir: /backup/prod/mimir/

This command synchronizes the bucket with a local backup directory, excluding marker files, bucket indexes, and blocks marked for deletion. Any excluded files still present locally but not in the source will be removed.

Data Recovery

Restoring metrics and logs involves loading the backup contents back into the appropriate S3 buckets. You can use mounted buckets or rclone in the same way as during backup.

First, all blocks except meta.json are uploaded. Then, meta.json files are uploaded in a separate step. This ensures that only complete blocks are seen and processed by the system.

Example:

rclone --config rclone.conf sync /backup/prod/mimir/ dop-flant-mimir: --exclude '*/meta.json'
rclone --config rclone.conf sync /backup/prod/mimir/ dop-flant-mimir: --include '*/meta.json'

PostgreSQL Database Recovery

This procedure applies when the observability-platform module is deployed with an internal PostgreSQL database. If you are using an external database (.spec.settings.ui.postgres.mode: External), follow your DB provider’s recovery instructions.

  1. Stop the backend and alertgate components. Remove any cronjobs that interact with the database:

    kubectl -n d8-observability-platform scale deploy backend alertgate-receiver alertgate-sender alertgate-api --replicas=0
    kubectl -n d8-observability-platform delete cronjob backend-clean-silences host-load postgres-backup
    
  2. Ensure that you have a database dump of the correct size. Copy it to the control-plane node.

  3. Drop the existing PostgreSQL database:

    kubectl -n d8-observability-platform exec -it $(kubectl -n d8-observability-platform get po -l spilo-role=master -o name) -- psql -U dop -c "DROP DATABASE dop;" postgres
    
  4. Create a new dop database:

    kubectl -n d8-observability-platform exec -it $(kubectl -n d8-observability-platform get po -l spilo-role=master -o name) -- psql -U dop -c "CREATE DATABASE dop;" postgres
    
  5. Restore the database from the dump:

    zcat dop-202504211200.dump.gz | kubectl -n d8-observability-platform exec -it $(kubectl -n d8-observability-platform get po -l spilo-role=master -o name) -- psql -U dop dop
    
  6. Restart the backend and alertgate components:

    kubectl -n d8-observability-platform scale deploy backend alertgate-receiver alertgate-sender alertgate-api --replicas=2
    
  7. Verify that all components are running:

    kubectl -n d8-observability-platform get po -l 'app in (backend,alertgate-receiver,alertgate-sender,alertgate-api)'
    
  8. Check that the web interface is accessible.