The module lifecycle stage: General Availability
The module has requirements for installation
Deckhouse Commander internals
Deckhouse Commander components
Deckhouse Commander has an external dependency — PostgreSQL database.
The API server is the central component. Data is stored in PostgreSQL. Options for installing Deckhouse Commander with a DBMS are listed in the Installation section.
The API server provides both external APIs — web applications and for external integration — and internal APIs for working with clusters.
Web Application uses the API to manage clusters and other Commander entities. It also provides
seamless access to the application cluster’s admin web UI (DKP UI, the console module):
browsing Kubernetes resources and the web terminal. Requests to the application cluster’s
Kubernetes API are routed through Commander under the same account the user signed in with.
Asynchronous operations — tasks — are used to manage clusters. The cluster manager is a service that monitors tasks and executes them. Tasks can be cluster installation, cluster deletion, or cluster state reconciliation with the specified configuration.
When a cluster is created, an installation task is created. Then a free cluster manager instance takes the task to work on. The same happens for cluster update, delete, or reconciliation operations.
After acquiring a task, the cluster manager determines which Deckhouse Kubernetes Platform (DKP) version the task requires, starts or reuses a running dhctl server of that version, and runs the required operation in it. One dhctl server performs only one operation at a time. The cluster manager automatically starts and stops dhctl servers and scales the number of their replicas based on the number of incoming tasks. This lets the system adapt to the current load.
Every application cluster runs the commander-agent module. Deckhouse Commander enables it
automatically. After the initial cluster installation, the agent opens a reverse TLS tunnel to
the management cluster and keeps it open. With the tunnel in place, Commander does not need
inbound network reachability to the application cluster’s Kubernetes API. The tunnel carries
traffic to the application cluster’s Kubernetes API: it is used during Check (check) and
Change (converge) phases to reconcile the application cluster’s infrastructure
configuration with the desired configuration (creating, deleting, and upgrading nodes, upgrading
DKP components), and by the cluster admin web UI to retrieve objects from the application
cluster’s Kubernetes API.
Cluster telemetry is sent through the same Commander HTTPS API endpoint that the agent uses to establish the tunnel. Telemetry includes basic metrics (total number of CPUs, total memory, number of nodes, and total storage), DKP and Kubernetes versions, and DKP component availability.
The resource conversion mechanism also uses this same Commander HTTPS API: the agent requests the desired configuration, applies it in the application cluster, and sends back the status and the applied configuration.
Interaction with the infrastructure provider API is separate and can originate from either the
management cluster or the application cluster, without using the commander-agent reverse TLS
tunnel.
Direct SSH to a master node (22/TCP) is used during the initial cluster installation and
deletion, during attach and detach operations, and as a fallback when commander-agent cannot
establish or hold the reverse tunnel. The agent itself runs on the application cluster’s master
nodes by default.
When the billing feature flag is on, Commander also collects resource-consumption metrics from every application cluster to calculate cost. These metrics are delivered to a dedicated long-term Prometheus in the management cluster through Prometheus remote write. Remote write uses the same Commander HTTPS API endpoint where the agent opens the tunnel, but it is a separate push mechanism from the application cluster, not traffic over the reverse TLS tunnel. Commander renders billing dashboards and exports CSV cost reports from the stored data.
On the application-cluster side, commander-agent handles billing integration:
- it configures the remote-write stream of resource-consumption metrics;
- it reports node group, cloud instance-class, and
StorageClassdescriptions to Commander — these are used in the billing UI to bind compute classes and storage classes to real cluster objects; - it labels every node with
billing.commander.deckhouse.io/name, so the node can be matched to its compute class and priced according to the active tariff.
Commander builds the binding contract (which nodes should carry which label) and delivers it to
the agent. The agent directly patches Node objects, adding the label to every node that belongs
to the bound node group.
Component placement and networking
Deckhouse Commander can be enabled in any DKP cluster. For the application clusters it manages, that cluster becomes the management cluster. Other Deckhouse system modules can run in the same cluster alongside Commander. In a shared setup, user workloads may also run in this cluster — on separate nodes.
By default, Commander runs on nodes with the commander role (label
node-role.deckhouse.io/commander); if there are none, it falls back to system nodes
(node-role.deckhouse.io/system). To pin Commander to specific nodes, set the
nodeSelector parameter. If those nodes carry a
taint, also set tolerations. Billing components
inherit the same settings.
commander-agent runs in every application cluster and, by default, on master nodes. It keeps a
persistent reverse TLS tunnel to the management cluster, so during normal operation Commander
does not open inbound connections to the application cluster’s Kubernetes API. Direct SSH to a
master node (22/TCP) is used during the initial cluster installation, deletion, attach and detach
operations, and as a fallback when the reverse tunnel is unavailable.
Network connectivity requirements
The following network connectivity is required between the management and application clusters:
- 22/TCP from dhctl servers to all master nodes of application clusters — for the initial cluster installation, deletion, attach and detach operations, and as a fallback when the reverse tunnel is unavailable.
- Access from dhctl servers to cloud provider APIs — for managing infrastructure resources of application clusters.
- 443/TCP from the application cluster to the management cluster — through ingress on the
management cluster frontend nodes. The actual port depends on the ingress controller settings.
Traffic goes to two domains:
- Commander domain (
commander.<publicDomainTemplate>) —commander-agentuses it to establish the reverse TLS tunnel, send telemetry, and run the configuration conversion mechanism. Prometheus in the application cluster also sends billing remote write data to the/prometheus/api/v1/writepath on this same domain. Through the reverse tunnel, Commander accesses the application cluster’s Kubernetes API for Check and Change phases and the cluster admin web UI. - Dex domain (
dex.<publicDomainTemplate>) — Dex in the application cluster connects to it for OIDC discovery and token validation. This is required forDexProvider/commander, which lets the application cluster trust accounts from the management cluster.
- Commander domain (
- Access to the container registry from which the dhctl image is pulled — from the nodes running the cluster manager, because it accesses the registry directly, and from the nodes running dhctl servers, because those nodes pull the image of the required dhctl version.
The “agent → management cluster” channel only works over HTTPS. Running Deckhouse Commander without HTTPS is not supported.
Network quality requirements and timeouts
Deckhouse Commander is designed to work over the public internet and tolerates unstable network links. There are no strict latency requirements — all interactions use retry logic and persistent connections with automatic reconnection.
Key timeouts that affect operation:
| Component | Parameter | Value |
|---|---|---|
| SSH (bootstrap, destroy, attach, detach) | Connect timeout | 10 sec |
| SSH | KeepAlive interval | 15 sec |
| Agent → Commander API (all requests, including resource conversion) | HTTP timeout | 30 sec |
| Agent → Commander API | Retry on error | up to 3 attempts |
| Agent: resource sync interval | Converge interval | 30 sec |
| Reverse tunnel (AMPG) | TCP KeepAlive | enabled (OS default) |
| Reverse tunnel (AMPG): backend connection wait | Acquire timeout | 60 sec |
| Dex: ID token lifetime | idTokenTTL |
10 min (configurable) |
| Dex: auth request lifetime | authRequests |
10 min |
| Prometheus remote write | Send timeout | 30 sec (Prometheus default) |
When connectivity between the application and management clusters is lost:
- The reverse TLS tunnel (yamux) automatically reconnects on disconnect.
- The agent continues trying to reach the Commander API every 30 seconds.
- Prometheus buffers remote write metrics and will deliver them once connectivity is restored.
Application cluster configuration and resource synchronization
Each application cluster is synchronized through two independent channels — infrastructure configuration and Kubernetes resources have their own executors and sources of truth. These are two separate processes, even though both are shown on the cluster page.
Infrastructure configuration: Change and Check operations
Infrastructure configuration covers the tabs in the Infrastructure group on the cluster page: Kubernetes, Placement, Post-bootstrap script, and SSH Parameters. Deckhouse Commander applies and verifies this configuration directly from the management cluster:
- Change brings the cluster infrastructure — nodes, control plane, initial DKP configuration, provider resources — to the desired state. It runs on cluster creation and when edits to the infrastructure tabs are saved in the web UI. In the Auto change application mode (see below) it runs automatically; in Manual mode it goes through a change request that requires approval.
- Check compares the desired and the actual infrastructure state without making changes. It runs automatically at the configured frequency (the Reconciliation Interval in the workspace or per-cluster parameters).
If an operation fails, retry it manually with the Retry button on the cluster page.
The change application mode and the reconciliation interval are configured on the workspace Parameters → Settings tab and may be overridden per cluster.
These settings do not control Kubernetes resource group synchronization. When the change
application mode is switched from Auto to Manual, commander-agent continues to
synchronize resource groups inside the application cluster according to their control modes.
Kubernetes resource groups: agent-driven synchronization
The “Kubernetes” group on the cluster page contains numbered tabs — Kubernetes resource groups.
Each group is a set of YAML manifests that commander-agent applies inside the application
cluster. These groups are unrelated to the Change and Check operations: a separate
agent loop reconciles them, independently of any infrastructure operations that may be
running at the same time.
Each group has one of the following synchronization modes, set in the cluster template and switchable on the cluster page:
- Force Creation — the agent applies the group’s manifests on every reconcile cycle and restores any modified or deleted resources;
- Create on Install — the group’s resources are created only during cluster installation. The agent does not control them during the cluster lifecycle; subsequent in-cluster edits are not reverted;
- Ignored — group control is disabled in Commander: previously applied resources stay in the application cluster, but the agent stops synchronizing them. After that the resources become ordinary Kubernetes objects — they can be edited or deleted manually.
Issues with commander-agent (loss of connectivity, errors when applying manifests) surface on
the cluster page: in the overall cluster status, the synchronization badge next to the cluster
name, and the Kubernetes tab (per-group and per-resource details). See Cluster
status in the user guide for the full list of cluster statuses.
Protection of controlled resources
To prevent out-of-band changes from drifting away from the desired configuration, resources under
active synchronization are protected from external modifications. The protection covers resources
from groups in “Force Creation” mode and Commander’s own service resources (DexProvider, billing,
RBAC, projects, agent secrets). An attempt to edit or delete such a resource via kubectl is
rejected by the application cluster’s API server. To exempt a group from this protection, switch
it to “Ignored” — after that its resources become ordinary Kubernetes objects.
A subset of Commander’s service resources (RBAC, projects, Dex and billing configurations) is protected more strictly: when the corresponding group leaves Commander control or its source becomes temporarily unavailable, such resources are deleted rather than left in the cluster without Commander supervision.
Enabling and disabling Deckhouse Commander capabilities
Certain Deckhouse Commander capabilities are enabled and disabled via the commander ModuleConfig, field spec.settings.featureFlags. Apply changes in the management cluster where Deckhouse Commander runs.
Billing and cost management — billingEnabled
| Parameter | Description |
|---|---|
| Default | Off (false) until billingEnabled: true is set explicitly |
| When enabled | Billing components are deployed; the Billing section appears in the UI |
| When disabled | Set billingEnabled: false |
Enable:
d8 k patch mc commander --type merge --patch '{"spec":{"settings":{"featureFlags":{"billingEnabled":true}}}}'After billing is enabled, the Billing section becomes available in the top navigation bar on the workspace list screen or at URL {COMMANDER_ADDRESS}/billing/.
In addition, the commander-agent component in every application cluster starts:
- configuring the local Prometheus to send resource consumption metrics (CPU, memory, storage) to Prometheus in the management cluster — a PrometheusRemoteWrite resource is created in the application cluster for this purpose. See How Commander calculates cost for details;
- sending Commander the descriptions of node groups, cloud-provider instance classes, and Kubernetes storage classes — these are used in the billing UI to bind compute classes and storage classes to real cluster objects;
- adding a service label
billing.commander.deckhouse.io/nameto every node, with the name of the compute class the node belongs to. Commander uses this label to match the node with its compute class and apply the price defined in the current tariff to the node’s consumption.
Commander builds the binding contract (node group → label value) and delivers it to the agent.
The agent directly patches Node objects, adding the label to every node of the bound node
group.
Consumption metrics are stored in a dedicated billing Prometheus (billing-prometheus) in the
management cluster. It stores only the data required for cost calculation: container CPU and
memory consumption, pod and PVC resource requests, volume usage, pod phases, controller-to-pod
relationships, namespace, pod, and node labels, and PVC information.
Prometheus in the application cluster sends these data to the management cluster through
Prometheus remote write over the Commander HTTPS API. This is a separate Prometheus HTTP(S)
request, commander-agent does not send metrics itself. The agent configures the
PrometheusRemoteWrite resource and passes connection parameters to Prometheus.
Billing metadata is stored in the Commander database. This includes tariffs, compute classes,
storage classes, resource bindings, and report schedules. Generated reports use separate
billing-reports storage.
See the billing documentation for how this data is used to compute cost.
If the section does not appear, check access management in the billing documentation.
Disable:
d8 k patch mc commander --type merge --patch '{"spec":{"settings":{"featureFlags":{"billingEnabled":false}}}}'After disabling, if needed manually delete remaining resources:
d8 k -n d8-commander delete pvc -l app=billing-reports
d8 k -n d8-commander delete pvc -l app=billing-prometheusUser documentation: Billing and cost management.
Projects — projectsEnabled
| Parameter | Description |
|---|---|
| Default | Off (false) until projectsEnabled: true is set explicitly |
| When enabled | The Projects section is available in the UI |
| When disabled | Set projectsEnabled: false |
Enable:
d8 k patch mc commander --type merge --patch '{"spec":{"settings":{"featureFlags":{"projectsEnabled":true}}}}'Disable:
d8 k patch mc commander --type merge --patch '{"spec":{"settings":{"featureFlags":{"projectsEnabled":false}}}}'User documentation: User guide — Projects.
Authentication in application clusters via DexProvider
Every cluster attached to Deckhouse Commander automatically trusts users authenticated by the
management cluster’s Dex. Once a user signs into Commander, they can open the application
cluster’s admin web UI (DKP UI, the console module) without signing in again.
This does not require a separate account on the application cluster.
Commander reconciles the trust link automatically through a pair of resources:
- In the application cluster — a
DexProvidernamedcommander, type OIDC. Its issuer points at the management cluster’s Dex URL. Its trust bundle carries the management cluster’s root CA, so TLS to the Dex endpoint validates without extra system-wide trust anchors. The provider requests theopenid,profile,email, andgroupsscopes. This singleDexProvideris enough for the application cluster to rely on management-cluster identities; clusters may still have additionalDexProviderresources for local identities. - In the management cluster — a paired
DexClientnamedcommander-agent-<id>and aSecretwith its OIDC client secret. TheDexClientcarries a redirect URI pointing to the application cluster’s Dex, so the management-cluster Dex correctly accepts login callbacks.
Both resources are created when a cluster is installed or attached, and removed on detach. No manual steps are required. If a resource is edited or removed by hand, the next reconciliation restores it.
How Sign-In Works
When a user opens the application cluster’s admin web UI from Commander, the browser reaches the
application cluster’s Dex. The application cluster’s Dex uses the commander provider and
redirects the browser to the management cluster’s Dex.
The management cluster’s Dex authenticates the user and redirects the browser back to the application cluster’s Dex using a redirect URI that belongs to that application cluster. The application cluster’s Dex then completes sign-in for the cluster admin web UI. If the user already has an active session in the management cluster’s Dex after signing in to Commander, they usually do not need to enter their login and password again.
The allowedUserGroups parameter of the commander module only controls who can sign in to
the Commander web UI. It does not grant any permissions inside Deckhouse Commander itself; roles
are configured separately. See Access control for details.
Do not edit, disable, or delete the auto-managed DexProvider/commander in an application
cluster or the paired DexClient and its Secret in the management cluster. Removing them
breaks all Commander-mediated logins into the application cluster. The next reconciliation
restores them, so manual tuning is lost in any case. If you need additional authentication
settings for an application cluster, add extra DexProvider resources alongside the managed
one — the Commander-managed provider is the minimum required to keep the cluster reachable from
Commander and is not exclusive.
Data encryption
Deckhouse Commander encrypts sensitive data stored in the database using keys that are automatically generated when the module is enabled and stored in the commander-envs secret.
It is extremely important to save the keys to a secure location in order to be able to restore the database in case of any problems. If there are no keys, it will be impossible to restore the data!
$ d8 k -n d8-commander get secret commander-envs -oyaml
apiVersion: v1
data:
ACTIVE_RECORD_ENCRYPTION_DETERMINISTIC_KEY: YVBBNVh5QUxoZjc1Tk5uTXphc3BXN2FrVGZacDBsUFk=
ACTIVE_RECORD_ENCRYPTION_KEY_DERIVATION_SALT: eEVZMGR0NlRaY0FNZzUySzdPODR3WXpranZiQTYySHo=
ACTIVE_RECORD_ENCRYPTION_PRIMARY_KEY: RUdZOFdodWxVT1hpeHlib2Q3Wld3TUlMNjhSOW81a0M=
kind: Secret
metadata:
...
name: commander-envs
namespace: d8-commander
type: OpaqueCollecting logs of the history of changes
In Deckhouse Commander version 1.9 and later, events related to the history of changes are printed to the standard output and are tagged with the ["audit"] label. These logs can be collected and sent using the log-shipper module.
Logs example:
{"level":"INFO","time":"2025-06-18 14:22:15 +0300","request_id":"ea09d409dc3c95dcf658fc2c2838084b","pid":19,"tags":["audit"],"auditable_type":"ClusterSettings","auditable_id":"8a0041ef-6c30-48bc-b3ca-e9db3e22be47","action":"create","user_type":"User","remote_address":"82.150.57.81","request_uuid":"ea09d409dc3c95dcf658fc2c2838084b","workspace_slug":"xcjtd","user_name":"admin@company.my","audited_changes":{"cluster_manager":{"sync":{"mode":"auto"},"check_interval":1}}}
{"level":"INFO","time":"2025-06-18 14:22:15 +0300","request_id":"ea09d409dc3c95dcf658fc2c2838084b","pid":19,"tags":["audit"],"auditable_type":"Cluster","auditable_id":"056f7fe5-7d22-4a76-b5e2-f225c0a99613","action":"create","user_type":"User","remote_address":"82.150.57.81","request_uuid":"ea09d409dc3c95dcf658fc2c2838084b","workspace_slug":"xcjtd","user_name":"admin@company.my","audited_changes":{"name":"mycluster","archived_at":null}}
{"level":"INFO","time":"2025-06-18 14:23:57 +0300","request_id":"a1eaf50bbc87a8cca4cd17d8be8fffdb","pid":12,"tags":["audit"],"auditable_type":"ClusterSettings","auditable_id":"707c46b1-b2c8-4fab-9392-8216a2058219","action":"create","user_type":"AuthToken","remote_address":"238.106.231.86","request_uuid":"a1eaf50bbc87a8cca4cd17d8be8fffdb","workspace_slug":"bfqcc","user_name":"api-user","audited_changes":{"cluster_manager":{"sync":{"mode":"auto"},"check_interval":1}}}
{"level":"INFO","time":"2025-06-18 14:23:57 +0300","request_id":"a1eaf50bbc87a8cca4cd17d8be8fffdb","pid":12,"tags":["audit"],"auditable_type":"Cluster","auditable_id":"42d432aa-8250-4ef0-b260-51639e1445d0","action":"create","user_type":"AuthToken","remote_address":"238.106.231.86","request_uuid":"a1eaf50bbc87a8cca4cd17d8be8fffdb","workspace_slug":"bfqcc","user_name":"api-user","audited_changes":{"name":"15731486914-1-con-1-30","archived_at":null}}
{"level":"INFO","time":"2025-06-18 14:28:56 +0300","request_id":"069566a46c004e53b686189587d484a9","pid":19,"tags":["audit"],"auditable_type":"ClusterSettings","auditable_id":"402a4d4d-5c14-4466-a1f3-3d990d7cf35a","action":"create","user_type":"User","remote_address":"30.231.184.26","request_uuid":"069566a46c004e53b686189587d484a9","workspace_slug":"xcjtd","user_name":"user@company.my","audited_changes":{"cluster_manager":{"sync":{"mode":"auto"},"check_interval":1}}}
{"level":"INFO","time":"2025-06-18 14:28:56 +0300","request_id":"069566a46c004e53b686189587d484a9","pid":19,"tags":["audit"],"auditable_type":"Cluster","auditable_id":"9ee687d4-18fe-423c-bbaa-e8e46ea47e67","action":"create","user_type":"User","remote_address":"30.231.184.26","request_uuid":"069566a46c004e53b686189587d484a9","workspace_slug":"xcjtd","user_name":"user@company.my","audited_changes":{"name":"mycluster2","archived_at":null}}
{"level":"INFO","time":"2025-06-18 14:29:06 +0300","request_id":"d29b248fbce414db8b71f821a3b1886e","pid":12,"tags":["audit"],"auditable_type":"Cluster","auditable_id":"e0f3c3de-2129-4b75-b927-72a8eb26902b","action":"update","user_type":"User","remote_address":"30.231.184.26","request_uuid":"d29b248fbce414db8b71f821a3b1886e","workspace_slug":"xcjtd","user_name":"user@company.my","audited_changes":{"archived_at":[null,"2025-06-18T14:29:05.943+03:00"]}}Configuration example:
apiVersion: deckhouse.io/v1alpha2
kind: ClusterLoggingConfig
metadata:
name: commander-audit-logs
spec:
destinationRefs:
- loki-example
kubernetesPods:
labelSelector:
matchLabels:
app: backend
namespaceSelector:
labelSelector:
matchLabels:
kubernetes.io/metadata.name: d8-commander
labelFilter:
- field: message
operator: Regex
values:
- .*\[\"audit\"\].*
type: KubernetesPods
---
apiVersion: deckhouse.io/v1alpha1
kind: ClusterLogDestination
metadata:
name: loki-example
spec:
type: Loki
loki:
endpoint: http://loki-example.loki.svc:3100For more detailed configuration information, see the documentation of the log-shipper module.
Changing the storage class
Option 1 (preferred)
-
Perform a backup of the database instance
d8 k -n d8-commander exec -t commander-postgres-0 -- su - postgres -c "pg_dump -Fc -b -v -d commander" > commander.dump -
Change storageClass in the module settings, replacing
<NEW_STORAGECLASS_NAME>with the name of the necessary storage classThe list of available storage classes can be found using the command
d8 k get storageclassesd8 k patch moduleconfig commander --type=merge -p '{"spec":{"settings":{"postgres":{"internal":{"storageClass":"<NEW_STORAGECLASS_NAME>"}}}}}' moduleconfig.deckhouse.io/commander patchedWait until the deckhouse queue will be empty
d8 system queue main Queue 'main': length 0, status: 'waiting for task 5s'Check the logs of the postgres operator
d8 k -n d8-operator-postgres logs deployments/operator-postgres {"cluster-name":"d8-commander/commander-postgres","level":"info","msg":"cluster has been updated","pkg":"controller","time":"2024-05-19T20:36:22Z","worker":0} -
Increase the number of replicas of the PostgreSQL database (optional)
This step must be skipped if the HighAvailability mode is active and PostgreSQL has 2 replicas
d8 k -n d8-commander patch postgresqls.acid.zalan.do commander-postgres --type=merge -p '{"spec":{"numberOfInstances":2}}' postgresql.acid.zalan.do/commander-postgres patchedCheck the logs of the operator and the postgres instance
d8 k -n d8-operator-postgres logs deployments/operator-postgres {"cluster-name":"d8-commander/commander-postgres","level":"info","msg":"cluster has been updated","pkg":"controller","time":"2024-05-19T20:36:22Z","worker":0}d8 k -n d8-commander logs commander-postgres-1 2024-05-19 20:38:15,648 INFO: no action. I am (commander-postgres-1), a secondary, and following a leader (commander-postgres-0) -
Perform the master switch
d8 k -n d8-commander exec -it commander-postgres-0 -- patronictl failover Current cluster topology + Cluster: commander-postgres --------+---------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +----------------------+--------------+---------+---------+----+-----------+ | commander-postgres-0 | 10.111.3.167 | Leader | running | 5 | | | commander-postgres-1 | 10.111.2.239 | Replica | running | 5 | 0 | +----------------------+--------------+---------+---------+----+-----------+ Candidate ['commander-postgres-1'] []: commander-postgres-1 Are you sure you want to failover cluster commander-postgres, demoting current leader commander-postgres-0? [y/N]: y 2024-05-19 20:40:52.63041 Successfully failed over to "commander-postgres-1" + Cluster: commander-postgres --------+---------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +----------------------+--------------+---------+---------+----+-----------+ | commander-postgres-0 | 10.111.3.167 | Replica | stopped | | unknown | | commander-postgres-1 | 10.111.2.239 | Leader | running | 5 | | +----------------------+--------------+---------+---------+----+-----------+Make sure that both DB instances are in the
runningstated8 k -n d8-commander exec -t commander-postgres-0 -- patronictl list + Cluster: commander-postgres --------+---------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +----------------------+--------------+---------+---------+----+-----------+ | commander-postgres-0 | 10.111.3.167 | Replica | running | 6 | 0 | | commander-postgres-1 | 10.111.2.239 | Leader | running | 6 | | +----------------------+--------------+---------+---------+----+-----------+Check that the disk of the new DB replica was created with the necessary
storageClassd8 k -n d8-commander get pvc --selector application=spilo NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pgdata-commander-postgres-0 Bound pvc-fd80fde4-d0e2-4b5f-9e3a-eac998191f11 2Gi RWO network-hdd 36h pgdata-commander-postgres-1 Bound pvc-7af2f442-3097-4fe3-a795-5ad18bb11351 2Gi RWO network-ssd 2m54s -
Delete the disk and pod of the first postgres instance
d8 k -n d8-commander delete pvc pgdata-commander-postgres-0 --wait=false d8 k -n d8-commander delete po commander-postgres-0Check logs
d8 k -n d8-commander logs commander-postgres-0 2024-05-19 20:43:33,293 INFO: Lock owner: commander-postgres-1; I am commander-postgres-0 2024-05-19 20:43:33,293 INFO: establishing a new patroni connection to the postgres cluster 2024-05-19 20:43:33,357 INFO: no action. I am (commander-postgres-0), a secondary, and following a leader (commander-postgres-1)Check that the disk was created with the correct
storageClassd8 k -n d8-commander get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pgdata-commander-postgres-0 Bound pvc-fd80fde4-d0e2-4b5f-9e3a-eac998191f11 2Gi RWO network-ssd 2m6s pgdata-commander-postgres-1 Bound pvc-7af2f442-3097-4fe3-a795-5ad18bb11351 2Gi RWO network-ssd 7m11s -
Perform the master switch one more time
d8 k -n d8-commander exec -it commander-postgres-0 -- patronictl failover Current cluster topology + Cluster: commander-postgres --------+---------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +----------------------+--------------+---------+---------+----+-----------+ | commander-postgres-0 | 10.111.3.189 | Replica | running | 6 | 0 | | commander-postgres-1 | 10.111.2.239 | Leader | running | 6 | | +----------------------+--------------+---------+---------+----+-----------+ Candidate ['commander-postgres-0'] []: commander-postgres-0 Are you sure you want to failover cluster commander-postgres, demoting current leader commander-postgres-1? [y/N]: y 2024-05-19 20:46:11.69855 Successfully failed over to "commander-postgres-0" + Cluster: commander-postgres --------+---------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +----------------------+--------------+---------+---------+----+-----------+ | commander-postgres-0 | 10.111.3.189 | Leader | running | 6 | | | commander-postgres-1 | 10.111.2.239 | Replica | stopped | | unknown | +----------------------+--------------+---------+---------+----+-----------+Make sure that both DB instances are in the
runningstated8 k -n d8-commander exec -t commander-postgres-0 -- patronictl list + Cluster: commander-postgres --------+---------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +----------------------+--------------+---------+---------+----+-----------+ | commander-postgres-0 | 10.111.3.189 | Leader | running | 6 | 0 | | commander-postgres-1 | 10.111.2.239 | Replica | running | 6 | | +----------------------+--------------+---------+---------+----+-----------+ -
Reducing the number of replicas of the PostgreSQL database (optional)
This step must be skipped if the HighAvailability is active and PostgreSQL has 2 replicas
d8 k -n d8-commander patch postgresqls.acid.zalan.do commander-postgres --type=merge -p '{"spec":{"numberOfInstances":1}}' postgresql.acid.zalan.do/commander-postgres patchedCheck the operator logs
d8 k -n d8-operator-postgres logs deployments/operator-postgres {"cluster-name":"d8-commander/commander-postgres","level":"info","msg":"cluster has been updated","pkg":"controller","time":"2024-05-19T20:50:22Z","worker":0} -
Delete the disk and pod of the first instance (if HighAvailability mode is active and PostgreSQL has 2 replicas)
This step must be skipped if HighAvailability mode is not active
d8 k -n d8-commander delete pvc pgdata-commander-postgres-1 --wait=false d8 k -n d8-commander delete po commander-postgres-1Check logs
d8 k -n d8-commander logs commander-postgres-1 2024-05-19 20:53:33,293 INFO: Lock owner: commander-postgres-0; I am commander-postgres-1 2024-05-19 20:53:33,293 INFO: establishing a new patroni connection to the postgres cluster 2024-05-19 20:53:33,357 INFO: no action. I am (commander-postgres-1), a secondary, and following a leader (commander-postgres-0)Check that the disk was created with the necessary
storageClassd8 k -n d8-commander get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pgdata-commander-postgres-0 Bound pvc-fd80fde4-d0e2-4b5f-9e3a-eac998191f11 2Gi RWO network-ssd 7m6s pgdata-commander-postgres-1 Bound pvc-7af2f442-3097-4fe3-a795-5ad18bb11351 2Gi RWO network-ssd 1m11sMake sure that both DB instances are in the
runningstated8 k -n d8-commander exec -t commander-postgres-0 -- patronictl list + Cluster: commander-postgres --------+---------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +----------------------+--------------+---------+---------+----+-----------+ | commander-postgres-0 | 10.111.3.189 | Leader | running | 6 | 0 | | commander-postgres-1 | 10.111.2.239 | Replica | running | 6 | | +----------------------+--------------+---------+---------+----+-----------+
-
Delete the unused disk of the temporary database replica (if HighAvailability is not active)
This step must be skipped if HighAvailability is active and PostgreSQL has 2 replicas
d8 k -n d8-commander delete pvc pgdata-commander-postgres-1 persistentvolumeclaim "pgdata-commander-postgres-1" deleted
Option 2
-
Perform a backup of the database instance
d8 k -n d8-commander exec -t commander-postgres-0 -- su - postgres -c "pg_dump -Fc -b -v -d commander" > commander.dump -
Turn off the
commandermoduled8 k patch moduleconfig commander --type=merge -p '{"spec":{"enabled":false}}' moduleconfig.deckhouse.io/commander patchedWait until the deckhouse queue will be empty
d8 system queue main Queue 'main': length 0, status: 'waiting for task 5s'Check that the
d8-commandernamespace has been deletedd8 k get namespace d8-commander Error from server (NotFound): namespaces "d8-commander" not found -
Set the required storage class and enable the
commandermoduled8 k patch moduleconfig commander --type=merge -p '{"spec":{"enabled":true,"settings":{"postgres":{"internal":{"storageClass":"<NEW_STORAGECLASS_NAME>"}}}}}' moduleconfig.deckhouse.io/commander patchedWait until the deckhouse queue will be empty
d8 system queue main Queue 'main': length 0, status: 'waiting for task 5s'Check that the DB instance has the
Runningstatusd8 k -n d8-commander get po commander-postgres-0 NAME READY STATUS RESTARTS AGE commander-postgres-0 1/1 Running 0 2m4s -
Restoring a previously saved backup of the database
d8 k -n d8-commander exec -it commander-postgres-0 -- su - postgres -c "pg_restore -v -c --if-exists -Fc -d commander" < commander.dump