The module lifecycle stageGeneral Availability
The module has requirements for installation

Deckhouse Commander internals

Deckhouse Commander components

Diagram

Deckhouse Commander has an external dependency — PostgreSQL database.

The API server is the central component. Data is stored in PostgreSQL. Options for installing Deckhouse Commander with a DBMS are listed in the Installation section.

The API server provides both external APIs — web applications and for external integration — and internal APIs for working with clusters.

Web Application uses the API to manage clusters and other Commander entities. It also provides seamless access to the application cluster’s admin web UI (DKP UI, the console module): browsing Kubernetes resources and the web terminal. Requests to the application cluster’s Kubernetes API are routed through Commander under the same account the user signed in with.

Asynchronous operations — tasks — are used to manage clusters. The cluster manager is a service that monitors tasks and executes them. Tasks can be cluster installation, cluster deletion, or cluster state reconciliation with the specified configuration.

When a cluster is created, an installation task is created. Then a free cluster manager instance takes the task to work on. The same happens for cluster update, delete, or reconciliation operations.

After acquiring a task, the cluster manager determines which Deckhouse Kubernetes Platform (DKP) version the task requires, starts or reuses a running dhctl server of that version, and runs the required operation in it. One dhctl server performs only one operation at a time. The cluster manager automatically starts and stops dhctl servers and scales the number of their replicas based on the number of incoming tasks. This lets the system adapt to the current load.

Every application cluster runs the commander-agent module. Deckhouse Commander enables it automatically. After the initial cluster installation, the agent opens a reverse TLS tunnel to the management cluster and keeps it open. With the tunnel in place, Commander does not need inbound network reachability to the application cluster’s Kubernetes API. The tunnel carries traffic to the application cluster’s Kubernetes API: it is used during Check (check) and Change (converge) phases to reconcile the application cluster’s infrastructure configuration with the desired configuration (creating, deleting, and upgrading nodes, upgrading DKP components), and by the cluster admin web UI to retrieve objects from the application cluster’s Kubernetes API.

Cluster telemetry is sent through the same Commander HTTPS API endpoint that the agent uses to establish the tunnel. Telemetry includes basic metrics (total number of CPUs, total memory, number of nodes, and total storage), DKP and Kubernetes versions, and DKP component availability.

The resource conversion mechanism also uses this same Commander HTTPS API: the agent requests the desired configuration, applies it in the application cluster, and sends back the status and the applied configuration.

Interaction with the infrastructure provider API is separate and can originate from either the management cluster or the application cluster, without using the commander-agent reverse TLS tunnel.

Direct SSH to a master node (22/TCP) is used during the initial cluster installation and deletion, during attach and detach operations, and as a fallback when commander-agent cannot establish or hold the reverse tunnel. The agent itself runs on the application cluster’s master nodes by default.

When the billing feature flag is on, Commander also collects resource-consumption metrics from every application cluster to calculate cost. These metrics are delivered to a dedicated long-term Prometheus in the management cluster through Prometheus remote write. Remote write uses the same Commander HTTPS API endpoint where the agent opens the tunnel, but it is a separate push mechanism from the application cluster, not traffic over the reverse TLS tunnel. Commander renders billing dashboards and exports CSV cost reports from the stored data.

On the application-cluster side, commander-agent handles billing integration:

  • it configures the remote-write stream of resource-consumption metrics;
  • it reports node group, cloud instance-class, and StorageClass descriptions to Commander — these are used in the billing UI to bind compute classes and storage classes to real cluster objects;
  • it labels every node with billing.commander.deckhouse.io/name, so the node can be matched to its compute class and priced according to the active tariff.

Commander builds the binding contract (which nodes should carry which label) and delivers it to the agent. The agent directly patches Node objects, adding the label to every node that belongs to the bound node group.

Component placement and networking

Deckhouse Commander can be enabled in any DKP cluster. For the application clusters it manages, that cluster becomes the management cluster. Other Deckhouse system modules can run in the same cluster alongside Commander. In a shared setup, user workloads may also run in this cluster — on separate nodes.

By default, Commander runs on nodes with the commander role (label node-role.deckhouse.io/commander); if there are none, it falls back to system nodes (node-role.deckhouse.io/system). To pin Commander to specific nodes, set the nodeSelector parameter. If those nodes carry a taint, also set tolerations. Billing components inherit the same settings.

commander-agent runs in every application cluster and, by default, on master nodes. It keeps a persistent reverse TLS tunnel to the management cluster, so during normal operation Commander does not open inbound connections to the application cluster’s Kubernetes API. Direct SSH to a master node (22/TCP) is used during the initial cluster installation, deletion, attach and detach operations, and as a fallback when the reverse tunnel is unavailable.

Network connectivity requirements

The following network connectivity is required between the management and application clusters:

  • 22/TCP from dhctl servers to all master nodes of application clusters — for the initial cluster installation, deletion, attach and detach operations, and as a fallback when the reverse tunnel is unavailable.
  • Access from dhctl servers to cloud provider APIs — for managing infrastructure resources of application clusters.
  • 443/TCP from the application cluster to the management cluster — through ingress on the management cluster frontend nodes. The actual port depends on the ingress controller settings. Traffic goes to two domains:
    • Commander domain (commander.<publicDomainTemplate>) — commander-agent uses it to establish the reverse TLS tunnel, send telemetry, and run the configuration conversion mechanism. Prometheus in the application cluster also sends billing remote write data to the /prometheus/api/v1/write path on this same domain. Through the reverse tunnel, Commander accesses the application cluster’s Kubernetes API for Check and Change phases and the cluster admin web UI.
    • Dex domain (dex.<publicDomainTemplate>) — Dex in the application cluster connects to it for OIDC discovery and token validation. This is required for DexProvider/commander, which lets the application cluster trust accounts from the management cluster.
  • Access to the container registry from which the dhctl image is pulled — from the nodes running the cluster manager, because it accesses the registry directly, and from the nodes running dhctl servers, because those nodes pull the image of the required dhctl version.

The “agent → management cluster” channel only works over HTTPS. Running Deckhouse Commander without HTTPS is not supported.

Network quality requirements and timeouts

Deckhouse Commander is designed to work over the public internet and tolerates unstable network links. There are no strict latency requirements — all interactions use retry logic and persistent connections with automatic reconnection.

Key timeouts that affect operation:

Component Parameter Value
SSH (bootstrap, destroy, attach, detach) Connect timeout 10 sec
SSH KeepAlive interval 15 sec
Agent → Commander API (all requests, including resource conversion) HTTP timeout 30 sec
Agent → Commander API Retry on error up to 3 attempts
Agent: resource sync interval Converge interval 30 sec
Reverse tunnel (AMPG) TCP KeepAlive enabled (OS default)
Reverse tunnel (AMPG): backend connection wait Acquire timeout 60 sec
Dex: ID token lifetime idTokenTTL 10 min (configurable)
Dex: auth request lifetime authRequests 10 min
Prometheus remote write Send timeout 30 sec (Prometheus default)

When connectivity between the application and management clusters is lost:

  • The reverse TLS tunnel (yamux) automatically reconnects on disconnect.
  • The agent continues trying to reach the Commander API every 30 seconds.
  • Prometheus buffers remote write metrics and will deliver them once connectivity is restored.

Application cluster configuration and resource synchronization

Each application cluster is synchronized through two independent channels — infrastructure configuration and Kubernetes resources have their own executors and sources of truth. These are two separate processes, even though both are shown on the cluster page.

Infrastructure configuration: Change and Check operations

Infrastructure configuration covers the tabs in the Infrastructure group on the cluster page: Kubernetes, Placement, Post-bootstrap script, and SSH Parameters. Deckhouse Commander applies and verifies this configuration directly from the management cluster:

  • Change brings the cluster infrastructure — nodes, control plane, initial DKP configuration, provider resources — to the desired state. It runs on cluster creation and when edits to the infrastructure tabs are saved in the web UI. In the Auto change application mode (see below) it runs automatically; in Manual mode it goes through a change request that requires approval.
  • Check compares the desired and the actual infrastructure state without making changes. It runs automatically at the configured frequency (the Reconciliation Interval in the workspace or per-cluster parameters).

If an operation fails, retry it manually with the Retry button on the cluster page.

The change application mode and the reconciliation interval are configured on the workspace ParametersSettings tab and may be overridden per cluster.

These settings do not control Kubernetes resource group synchronization. When the change application mode is switched from Auto to Manual, commander-agent continues to synchronize resource groups inside the application cluster according to their control modes.

Kubernetes resource groups: agent-driven synchronization

The “Kubernetes” group on the cluster page contains numbered tabs — Kubernetes resource groups. Each group is a set of YAML manifests that commander-agent applies inside the application cluster. These groups are unrelated to the Change and Check operations: a separate agent loop reconciles them, independently of any infrastructure operations that may be running at the same time.

Each group has one of the following synchronization modes, set in the cluster template and switchable on the cluster page:

  • Force Creation — the agent applies the group’s manifests on every reconcile cycle and restores any modified or deleted resources;
  • Create on Install — the group’s resources are created only during cluster installation. The agent does not control them during the cluster lifecycle; subsequent in-cluster edits are not reverted;
  • Ignored — group control is disabled in Commander: previously applied resources stay in the application cluster, but the agent stops synchronizing them. After that the resources become ordinary Kubernetes objects — they can be edited or deleted manually.

Issues with commander-agent (loss of connectivity, errors when applying manifests) surface on the cluster page: in the overall cluster status, the synchronization badge next to the cluster name, and the Kubernetes tab (per-group and per-resource details). See Cluster status in the user guide for the full list of cluster statuses.

Protection of controlled resources

To prevent out-of-band changes from drifting away from the desired configuration, resources under active synchronization are protected from external modifications. The protection covers resources from groups in “Force Creation” mode and Commander’s own service resources (DexProvider, billing, RBAC, projects, agent secrets). An attempt to edit or delete such a resource via kubectl is rejected by the application cluster’s API server. To exempt a group from this protection, switch it to “Ignored” — after that its resources become ordinary Kubernetes objects.

A subset of Commander’s service resources (RBAC, projects, Dex and billing configurations) is protected more strictly: when the corresponding group leaves Commander control or its source becomes temporarily unavailable, such resources are deleted rather than left in the cluster without Commander supervision.

Enabling and disabling Deckhouse Commander capabilities

Certain Deckhouse Commander capabilities are enabled and disabled via the commander ModuleConfig, field spec.settings.featureFlags. Apply changes in the management cluster where Deckhouse Commander runs.

Billing and cost management — billingEnabled

Parameter Description
Default Off (false) until billingEnabled: true is set explicitly
When enabled Billing components are deployed; the Billing section appears in the UI
When disabled Set billingEnabled: false

Enable:

d8 k patch mc commander --type merge --patch '{"spec":{"settings":{"featureFlags":{"billingEnabled":true}}}}'

After billing is enabled, the Billing section becomes available in the top navigation bar on the workspace list screen or at URL {COMMANDER_ADDRESS}/billing/.

In addition, the commander-agent component in every application cluster starts:

  • configuring the local Prometheus to send resource consumption metrics (CPU, memory, storage) to Prometheus in the management cluster — a PrometheusRemoteWrite resource is created in the application cluster for this purpose. See How Commander calculates cost for details;
  • sending Commander the descriptions of node groups, cloud-provider instance classes, and Kubernetes storage classes — these are used in the billing UI to bind compute classes and storage classes to real cluster objects;
  • adding a service label billing.commander.deckhouse.io/name to every node, with the name of the compute class the node belongs to. Commander uses this label to match the node with its compute class and apply the price defined in the current tariff to the node’s consumption.

Commander builds the binding contract (node group → label value) and delivers it to the agent. The agent directly patches Node objects, adding the label to every node of the bound node group.

Consumption metrics are stored in a dedicated billing Prometheus (billing-prometheus) in the management cluster. It stores only the data required for cost calculation: container CPU and memory consumption, pod and PVC resource requests, volume usage, pod phases, controller-to-pod relationships, namespace, pod, and node labels, and PVC information.

Prometheus in the application cluster sends these data to the management cluster through Prometheus remote write over the Commander HTTPS API. This is a separate Prometheus HTTP(S) request, commander-agent does not send metrics itself. The agent configures the PrometheusRemoteWrite resource and passes connection parameters to Prometheus.

Billing metadata is stored in the Commander database. This includes tariffs, compute classes, storage classes, resource bindings, and report schedules. Generated reports use separate billing-reports storage.

See the billing documentation for how this data is used to compute cost.

If the section does not appear, check access management in the billing documentation.

Disable:

d8 k patch mc commander --type merge --patch '{"spec":{"settings":{"featureFlags":{"billingEnabled":false}}}}'

After disabling, if needed manually delete remaining resources:

d8 k -n d8-commander delete pvc -l app=billing-reports
d8 k -n d8-commander delete pvc -l app=billing-prometheus

User documentation: Billing and cost management.

Projects — projectsEnabled

Parameter Description
Default Off (false) until projectsEnabled: true is set explicitly
When enabled The Projects section is available in the UI
When disabled Set projectsEnabled: false

Enable:

d8 k patch mc commander --type merge --patch '{"spec":{"settings":{"featureFlags":{"projectsEnabled":true}}}}'

Disable:

d8 k patch mc commander --type merge --patch '{"spec":{"settings":{"featureFlags":{"projectsEnabled":false}}}}'

User documentation: User guide — Projects.

Authentication in application clusters via DexProvider

Every cluster attached to Deckhouse Commander automatically trusts users authenticated by the management cluster’s Dex. Once a user signs into Commander, they can open the application cluster’s admin web UI (DKP UI, the console module) without signing in again.

This does not require a separate account on the application cluster.

Commander reconciles the trust link automatically through a pair of resources:

  • In the application cluster — a DexProvider named commander, type OIDC. Its issuer points at the management cluster’s Dex URL. Its trust bundle carries the management cluster’s root CA, so TLS to the Dex endpoint validates without extra system-wide trust anchors. The provider requests the openid, profile, email, and groups scopes. This single DexProvider is enough for the application cluster to rely on management-cluster identities; clusters may still have additional DexProvider resources for local identities.
  • In the management cluster — a paired DexClient named commander-agent-<id> and a Secret with its OIDC client secret. The DexClient carries a redirect URI pointing to the application cluster’s Dex, so the management-cluster Dex correctly accepts login callbacks.

Both resources are created when a cluster is installed or attached, and removed on detach. No manual steps are required. If a resource is edited or removed by hand, the next reconciliation restores it.

How Sign-In Works

When a user opens the application cluster’s admin web UI from Commander, the browser reaches the application cluster’s Dex. The application cluster’s Dex uses the commander provider and redirects the browser to the management cluster’s Dex.

The management cluster’s Dex authenticates the user and redirects the browser back to the application cluster’s Dex using a redirect URI that belongs to that application cluster. The application cluster’s Dex then completes sign-in for the cluster admin web UI. If the user already has an active session in the management cluster’s Dex after signing in to Commander, they usually do not need to enter their login and password again.

The allowedUserGroups parameter of the commander module only controls who can sign in to the Commander web UI. It does not grant any permissions inside Deckhouse Commander itself; roles are configured separately. See Access control for details.

Do not edit, disable, or delete the auto-managed DexProvider/commander in an application cluster or the paired DexClient and its Secret in the management cluster. Removing them breaks all Commander-mediated logins into the application cluster. The next reconciliation restores them, so manual tuning is lost in any case. If you need additional authentication settings for an application cluster, add extra DexProvider resources alongside the managed one — the Commander-managed provider is the minimum required to keep the cluster reachable from Commander and is not exclusive.

Data encryption

Deckhouse Commander encrypts sensitive data stored in the database using keys that are automatically generated when the module is enabled and stored in the commander-envs secret.

It is extremely important to save the keys to a secure location in order to be able to restore the database in case of any problems. If there are no keys, it will be impossible to restore the data!

$ d8 k -n d8-commander get secret commander-envs -oyaml
apiVersion: v1
data:
  ACTIVE_RECORD_ENCRYPTION_DETERMINISTIC_KEY: YVBBNVh5QUxoZjc1Tk5uTXphc3BXN2FrVGZacDBsUFk=
  ACTIVE_RECORD_ENCRYPTION_KEY_DERIVATION_SALT: eEVZMGR0NlRaY0FNZzUySzdPODR3WXpranZiQTYySHo=
  ACTIVE_RECORD_ENCRYPTION_PRIMARY_KEY: RUdZOFdodWxVT1hpeHlib2Q3Wld3TUlMNjhSOW81a0M=
kind: Secret
metadata:
...
  name: commander-envs
  namespace: d8-commander
type: Opaque

Collecting logs of the history of changes

In Deckhouse Commander version 1.9 and later, events related to the history of changes are printed to the standard output and are tagged with the ["audit"] label. These logs can be collected and sent using the log-shipper module.

Logs example:

{"level":"INFO","time":"2025-06-18 14:22:15 +0300","request_id":"ea09d409dc3c95dcf658fc2c2838084b","pid":19,"tags":["audit"],"auditable_type":"ClusterSettings","auditable_id":"8a0041ef-6c30-48bc-b3ca-e9db3e22be47","action":"create","user_type":"User","remote_address":"82.150.57.81","request_uuid":"ea09d409dc3c95dcf658fc2c2838084b","workspace_slug":"xcjtd","user_name":"admin@company.my","audited_changes":{"cluster_manager":{"sync":{"mode":"auto"},"check_interval":1}}}
{"level":"INFO","time":"2025-06-18 14:22:15 +0300","request_id":"ea09d409dc3c95dcf658fc2c2838084b","pid":19,"tags":["audit"],"auditable_type":"Cluster","auditable_id":"056f7fe5-7d22-4a76-b5e2-f225c0a99613","action":"create","user_type":"User","remote_address":"82.150.57.81","request_uuid":"ea09d409dc3c95dcf658fc2c2838084b","workspace_slug":"xcjtd","user_name":"admin@company.my","audited_changes":{"name":"mycluster","archived_at":null}}
{"level":"INFO","time":"2025-06-18 14:23:57 +0300","request_id":"a1eaf50bbc87a8cca4cd17d8be8fffdb","pid":12,"tags":["audit"],"auditable_type":"ClusterSettings","auditable_id":"707c46b1-b2c8-4fab-9392-8216a2058219","action":"create","user_type":"AuthToken","remote_address":"238.106.231.86","request_uuid":"a1eaf50bbc87a8cca4cd17d8be8fffdb","workspace_slug":"bfqcc","user_name":"api-user","audited_changes":{"cluster_manager":{"sync":{"mode":"auto"},"check_interval":1}}}
{"level":"INFO","time":"2025-06-18 14:23:57 +0300","request_id":"a1eaf50bbc87a8cca4cd17d8be8fffdb","pid":12,"tags":["audit"],"auditable_type":"Cluster","auditable_id":"42d432aa-8250-4ef0-b260-51639e1445d0","action":"create","user_type":"AuthToken","remote_address":"238.106.231.86","request_uuid":"a1eaf50bbc87a8cca4cd17d8be8fffdb","workspace_slug":"bfqcc","user_name":"api-user","audited_changes":{"name":"15731486914-1-con-1-30","archived_at":null}}
{"level":"INFO","time":"2025-06-18 14:28:56 +0300","request_id":"069566a46c004e53b686189587d484a9","pid":19,"tags":["audit"],"auditable_type":"ClusterSettings","auditable_id":"402a4d4d-5c14-4466-a1f3-3d990d7cf35a","action":"create","user_type":"User","remote_address":"30.231.184.26","request_uuid":"069566a46c004e53b686189587d484a9","workspace_slug":"xcjtd","user_name":"user@company.my","audited_changes":{"cluster_manager":{"sync":{"mode":"auto"},"check_interval":1}}}
{"level":"INFO","time":"2025-06-18 14:28:56 +0300","request_id":"069566a46c004e53b686189587d484a9","pid":19,"tags":["audit"],"auditable_type":"Cluster","auditable_id":"9ee687d4-18fe-423c-bbaa-e8e46ea47e67","action":"create","user_type":"User","remote_address":"30.231.184.26","request_uuid":"069566a46c004e53b686189587d484a9","workspace_slug":"xcjtd","user_name":"user@company.my","audited_changes":{"name":"mycluster2","archived_at":null}}
{"level":"INFO","time":"2025-06-18 14:29:06 +0300","request_id":"d29b248fbce414db8b71f821a3b1886e","pid":12,"tags":["audit"],"auditable_type":"Cluster","auditable_id":"e0f3c3de-2129-4b75-b927-72a8eb26902b","action":"update","user_type":"User","remote_address":"30.231.184.26","request_uuid":"d29b248fbce414db8b71f821a3b1886e","workspace_slug":"xcjtd","user_name":"user@company.my","audited_changes":{"archived_at":[null,"2025-06-18T14:29:05.943+03:00"]}}

Configuration example:

apiVersion: deckhouse.io/v1alpha2
kind: ClusterLoggingConfig
metadata:
  name: commander-audit-logs
spec:
  destinationRefs:
  - loki-example
  kubernetesPods:
    labelSelector:
      matchLabels:
        app: backend
    namespaceSelector:
      labelSelector:
        matchLabels:
          kubernetes.io/metadata.name: d8-commander
  labelFilter:
  - field: message
    operator: Regex
    values:
    - .*\[\"audit\"\].*
  type: KubernetesPods
---
apiVersion: deckhouse.io/v1alpha1
kind: ClusterLogDestination
metadata:
  name: loki-example
spec:
  type: Loki
  loki:
    endpoint: http://loki-example.loki.svc:3100

For more detailed configuration information, see the documentation of the log-shipper module.

Changing the storage class

Option 1 (preferred)

  1. Perform a backup of the database instance

    d8 k -n d8-commander exec -t commander-postgres-0 -- su - postgres -c "pg_dump -Fc -b -v -d commander" > commander.dump
  2. Change storageClass in the module settings, replacing <NEW_STORAGECLASS_NAME> with the name of the necessary storage class

    The list of available storage classes can be found using the command d8 k get storageclasses

    d8 k patch moduleconfig commander --type=merge -p '{"spec":{"settings":{"postgres":{"internal":{"storageClass":"<NEW_STORAGECLASS_NAME>"}}}}}'
    
    moduleconfig.deckhouse.io/commander patched

    Wait until the deckhouse queue will be empty

    d8 system queue main
    
    Queue 'main': length 0, status: 'waiting for task 5s'

    Check the logs of the postgres operator

    d8 k -n d8-operator-postgres logs deployments/operator-postgres
    
    {"cluster-name":"d8-commander/commander-postgres","level":"info","msg":"cluster has been updated","pkg":"controller","time":"2024-05-19T20:36:22Z","worker":0}
  3. Increase the number of replicas of the PostgreSQL database (optional)

    This step must be skipped if the HighAvailability mode is active and PostgreSQL has 2 replicas

    d8 k -n d8-commander patch postgresqls.acid.zalan.do commander-postgres --type=merge -p '{"spec":{"numberOfInstances":2}}'
    
    postgresql.acid.zalan.do/commander-postgres patched

    Check the logs of the operator and the postgres instance

    d8 k -n d8-operator-postgres logs deployments/operator-postgres
    
    {"cluster-name":"d8-commander/commander-postgres","level":"info","msg":"cluster has been updated","pkg":"controller","time":"2024-05-19T20:36:22Z","worker":0}
    d8 k -n d8-commander logs commander-postgres-1
    
    2024-05-19 20:38:15,648 INFO: no action. I am (commander-postgres-1), a secondary, and following a leader (commander-postgres-0)
  4. Perform the master switch

    d8 k -n d8-commander exec -it commander-postgres-0 -- patronictl failover
    
    Current cluster topology
    + Cluster: commander-postgres --------+---------+---------+----+-----------+
    | Member               | Host         | Role    | State   | TL | Lag in MB |
    +----------------------+--------------+---------+---------+----+-----------+
    | commander-postgres-0 | 10.111.3.167 | Leader  | running |  5 |           |
    | commander-postgres-1 | 10.111.2.239 | Replica | running |  5 |         0 |
    +----------------------+--------------+---------+---------+----+-----------+
    Candidate ['commander-postgres-1'] []: commander-postgres-1
    Are you sure you want to failover cluster commander-postgres, demoting current leader commander-postgres-0? [y/N]: y
    2024-05-19 20:40:52.63041 Successfully failed over to "commander-postgres-1"
    + Cluster: commander-postgres --------+---------+---------+----+-----------+
    | Member               | Host         | Role    | State   | TL | Lag in MB |
    +----------------------+--------------+---------+---------+----+-----------+
    | commander-postgres-0 | 10.111.3.167 | Replica | stopped |    |   unknown |
    | commander-postgres-1 | 10.111.2.239 | Leader  | running |  5 |           |
    +----------------------+--------------+---------+---------+----+-----------+

    Make sure that both DB instances are in the running state

    d8 k -n d8-commander exec -t commander-postgres-0 -- patronictl list
    + Cluster: commander-postgres --------+---------+---------+----+-----------+
    | Member               | Host         | Role    | State   | TL | Lag in MB |
    +----------------------+--------------+---------+---------+----+-----------+
    | commander-postgres-0 | 10.111.3.167 | Replica | running |  6 |         0 |
    | commander-postgres-1 | 10.111.2.239 | Leader  | running |  6 |           |
    +----------------------+--------------+---------+---------+----+-----------+

    Check that the disk of the new DB replica was created with the necessary storageClass

    d8 k -n d8-commander get pvc --selector application=spilo
    NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
    pgdata-commander-postgres-0   Bound    pvc-fd80fde4-d0e2-4b5f-9e3a-eac998191f11   2Gi        RWO            network-hdd    36h
    pgdata-commander-postgres-1   Bound    pvc-7af2f442-3097-4fe3-a795-5ad18bb11351   2Gi        RWO            network-ssd    2m54s
  5. Delete the disk and pod of the first postgres instance

    d8 k -n d8-commander delete pvc pgdata-commander-postgres-0 --wait=false
    d8 k -n d8-commander delete po commander-postgres-0

    Check logs

    d8 k -n d8-commander logs commander-postgres-0
    
    2024-05-19 20:43:33,293 INFO: Lock owner: commander-postgres-1; I am commander-postgres-0
    2024-05-19 20:43:33,293 INFO: establishing a new patroni connection to the postgres cluster
    2024-05-19 20:43:33,357 INFO: no action. I am (commander-postgres-0), a secondary, and following a leader (commander-postgres-1)

    Check that the disk was created with the correct storageClass

    d8 k -n d8-commander get pvc
    NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
    pgdata-commander-postgres-0   Bound    pvc-fd80fde4-d0e2-4b5f-9e3a-eac998191f11   2Gi        RWO            network-ssd    2m6s
    pgdata-commander-postgres-1   Bound    pvc-7af2f442-3097-4fe3-a795-5ad18bb11351   2Gi        RWO            network-ssd    7m11s
  6. Perform the master switch one more time

    d8 k -n d8-commander exec -it commander-postgres-0  -- patronictl failover
    
    Current cluster topology
    + Cluster: commander-postgres --------+---------+---------+----+-----------+
    | Member               | Host         | Role    | State   | TL | Lag in MB |
    +----------------------+--------------+---------+---------+----+-----------+
    | commander-postgres-0 | 10.111.3.189 | Replica | running |  6 |         0 |
    | commander-postgres-1 | 10.111.2.239 | Leader  | running |  6 |           |
    +----------------------+--------------+---------+---------+----+-----------+
    Candidate ['commander-postgres-0'] []: commander-postgres-0
    Are you sure you want to failover cluster commander-postgres, demoting current leader commander-postgres-1? [y/N]: y
    2024-05-19 20:46:11.69855 Successfully failed over to "commander-postgres-0"
    + Cluster: commander-postgres --------+---------+---------+----+-----------+
    | Member               | Host         | Role    | State   | TL | Lag in MB |
    +----------------------+--------------+---------+---------+----+-----------+
    | commander-postgres-0 | 10.111.3.189 | Leader  | running |  6 |           |
    | commander-postgres-1 | 10.111.2.239 | Replica | stopped |    |   unknown |
    +----------------------+--------------+---------+---------+----+-----------+

    Make sure that both DB instances are in the running state

    d8 k -n d8-commander exec -t commander-postgres-0 -- patronictl list
    + Cluster: commander-postgres --------+---------+---------+----+-----------+
    | Member               | Host         | Role    | State   | TL | Lag in MB |
    +----------------------+--------------+---------+---------+----+-----------+
    | commander-postgres-0 | 10.111.3.189 | Leader  | running |  6 |         0 |
    | commander-postgres-1 | 10.111.2.239 | Replica | running |  6 |           |
    +----------------------+--------------+---------+---------+----+-----------+
  7. Reducing the number of replicas of the PostgreSQL database (optional)

    This step must be skipped if the HighAvailability is active and PostgreSQL has 2 replicas

    d8 k -n d8-commander patch postgresqls.acid.zalan.do commander-postgres --type=merge -p '{"spec":{"numberOfInstances":1}}'
    
    postgresql.acid.zalan.do/commander-postgres patched

    Check the operator logs

    d8 k -n d8-operator-postgres logs deployments/operator-postgres
    
    {"cluster-name":"d8-commander/commander-postgres","level":"info","msg":"cluster has been updated","pkg":"controller","time":"2024-05-19T20:50:22Z","worker":0}
  8. Delete the disk and pod of the first instance (if HighAvailability mode is active and PostgreSQL has 2 replicas)

    This step must be skipped if HighAvailability mode is not active

    d8 k -n d8-commander delete pvc pgdata-commander-postgres-1 --wait=false
    d8 k -n d8-commander delete po commander-postgres-1

    Check logs

    d8 k -n d8-commander logs commander-postgres-1
    
    2024-05-19 20:53:33,293 INFO: Lock owner: commander-postgres-0; I am commander-postgres-1
    2024-05-19 20:53:33,293 INFO: establishing a new patroni connection to the postgres cluster
    2024-05-19 20:53:33,357 INFO: no action. I am (commander-postgres-1), a secondary, and following a leader (commander-postgres-0)

    Check that the disk was created with the necessary storageClass

    d8 k -n d8-commander get pvc
    NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
    pgdata-commander-postgres-0   Bound    pvc-fd80fde4-d0e2-4b5f-9e3a-eac998191f11   2Gi        RWO            network-ssd    7m6s
    pgdata-commander-postgres-1   Bound    pvc-7af2f442-3097-4fe3-a795-5ad18bb11351   2Gi        RWO            network-ssd    1m11s

    Make sure that both DB instances are in the running state

    d8 k -n d8-commander exec -t commander-postgres-0 -- patronictl list
    + Cluster: commander-postgres --------+---------+---------+----+-----------+
    | Member               | Host         | Role    | State   | TL | Lag in MB |
    +----------------------+--------------+---------+---------+----+-----------+
    | commander-postgres-0 | 10.111.3.189 | Leader  | running |  6 |         0 |
    | commander-postgres-1 | 10.111.2.239 | Replica | running |  6 |           |
    +----------------------+--------------+---------+---------+----+-----------+
  • Delete the unused disk of the temporary database replica (if HighAvailability is not active)

    This step must be skipped if HighAvailability is active and PostgreSQL has 2 replicas

    d8 k -n d8-commander delete pvc pgdata-commander-postgres-1
    
    persistentvolumeclaim "pgdata-commander-postgres-1" deleted

Option 2

  1. Perform a backup of the database instance

    d8 k -n d8-commander exec -t commander-postgres-0 -- su - postgres -c "pg_dump -Fc -b -v -d commander" > commander.dump
  2. Turn off the commander module

    d8 k patch moduleconfig commander --type=merge -p '{"spec":{"enabled":false}}'
    
    moduleconfig.deckhouse.io/commander patched

    Wait until the deckhouse queue will be empty

    d8 system queue main
    
    Queue 'main': length 0, status: 'waiting for task 5s'

    Check that the d8-commander namespace has been deleted

    d8 k get namespace d8-commander
    Error from server (NotFound): namespaces "d8-commander" not found
  3. Set the required storage class and enable the commander module

     d8 k patch moduleconfig commander --type=merge -p '{"spec":{"enabled":true,"settings":{"postgres":{"internal":{"storageClass":"<NEW_STORAGECLASS_NAME>"}}}}}'
    
     moduleconfig.deckhouse.io/commander patched

    Wait until the deckhouse queue will be empty

    d8 system queue main
    
    Queue 'main': length 0, status: 'waiting for task 5s'

    Check that the DB instance has the Running status

    d8 k -n d8-commander get po commander-postgres-0
    
    NAME                   READY   STATUS    RESTARTS   AGE
    commander-postgres-0   1/1     Running   0          2m4s
  4. Restoring a previously saved backup of the database

    d8 k -n d8-commander exec -it commander-postgres-0 -- su - postgres -c "pg_restore -v -c --if-exists -Fc -d commander" < commander.dump