The module lifecycle stageExperimental
The module has requirements for installation

Enabling Required Modules

The sds-elastic module is in Experimental stage. Experimental modules are not enabled by default. Set allowExperimentalModules: true in the deckhouse ModuleConfig before enabling the module.

Enable sds-elastic together with its companion modules:

  • sds-node-configurator — owns the BlockDevice and LVMVolumeGroup CRDs that ElasticCluster selects from.
  • csi-ceph — owns the CephClusterConnection and CephStorageClass CRDs the controller writes into.
  • snapshot-controller — required for VolumeSnapshot support (optional).
d8 k apply -f - <<EOF
apiVersion: v1
kind: List
items:
  - apiVersion: deckhouse.io/v1alpha1
    kind: ModuleConfig
    metadata:
      name: sds-node-configurator
    spec:
      enabled: true
      version: 1
  - apiVersion: deckhouse.io/v1alpha1
    kind: ModuleConfig
    metadata:
      name: snapshot-controller
    spec:
      enabled: true
      version: 1
  - apiVersion: deckhouse.io/v1alpha1
    kind: ModuleConfig
    metadata:
      name: csi-ceph
    spec:
      enabled: true
      version: 1
  - apiVersion: deckhouse.io/v1alpha1
    kind: ModuleConfig
    metadata:
      name: sds-elastic
    spec:
      enabled: true
      version: 1
EOF

Wait until every module reaches the Ready state:

d8 k get module sds-node-configurator snapshot-controller csi-ceph sds-elastic -w

Selecting Data Nodes

settings.dataNodes.nodeSelector declares which Kubernetes Nodes are eligible to host sds-elastic data. The controller places the label storage.deckhouse.io/sds-elastic-node="" on every matching Node and removes it from Nodes that no longer match.

Downstream consumers — the sds-node-configurator agent (it picks up BlockDevice discovery on data nodes) and your ElasticCluster.spec.storage.nodeSelector — use this label as a nodeAffinity term.

apiVersion: deckhouse.io/v1alpha1
kind: ModuleConfig
metadata:
  name: sds-elastic
spec:
  enabled: true
  version: 1
  settings:
    dataNodes:
      nodeSelector:
        node-role.deckhouse.io/storage: ""

If the field is omitted, the empty selector matches every Node — every Node in the cluster gets storage.deckhouse.io/sds-elastic-node="".

Narrowing dataNodes.nodeSelector does not redistribute data. If a Node that already hosts OSDs falls outside the new selector, its storage.deckhouse.io/sds-elastic-node label is removed and data on that Node becomes unreachable until the Node is brought back under the selector.

Preparing Storage Nodes

ElasticCluster consumes BlockDevice CRs (managed by sds-node-configurator) selected by labels and provisions one OSD per matched device.

  1. Pick the nodes that will host Ceph daemons and label them. The example uses node-role.deckhouse.io/storage:

    d8 k label node <node-name> node-role.deckhouse.io/storage=
  2. Make sure each storage node has at least one unused raw block device (no partitions, filesystem, or LVM signatures). sds-node-configurator discovers them and creates a corresponding BlockDevice CR. Verify:

    d8 k get blockdevices.storage.deckhouse.io -o wide
  3. Add a label that the ElasticCluster will use to select OSD-eligible devices. The example uses app=elastic-osd:

    d8 k label blockdevice <bd-name> app=elastic-osd

Deploying an ElasticCluster

The example below bootstraps a Ceph cluster on every node carrying the node-role.deckhouse.io/storage label, consuming every BlockDevice labelled app=elastic-osd.

d8 k apply -f - <<EOF
apiVersion: storage.deckhouse.io/v1alpha1
kind: ElasticCluster
metadata:
  name: ceph-prod
spec:
  storage:
    nodeSelector:
      matchExpressions:
        - { key: node-role.deckhouse.io/storage, operator: Exists }
    blockDeviceSelector:
      matchLabels:
        app: elastic-osd
  network:
    public: 10.12.0.0/16
    cluster: 10.12.0.0/16
EOF

Wait until the ElasticCluster reports Ready:

d8 k get elasticcluster ceph-prod -w

The Phase column is expected to switch from Pending to InProgress and finally to Ready. The full per-stage progression is exposed through conditions: StorageReadyCephClusterReadyCredentialsReadyCsiCephReady → aggregate Ready.

Verify the underlying objects:

d8 k get lvmvolumegroup -l sds-elastic.deckhouse.io/cluster=ceph-prod
d8 k get lvmlogicalvolume -l sds-elastic.deckhouse.io/cluster=ceph-prod
d8 k get pv -l sds-elastic.deckhouse.io/cluster=ceph-prod
d8 k -n d8-sds-elastic get pod -owide

The controller also creates an internal ElasticClusterCredential that mirrors rook-ceph-mon Secret fields:

d8 k get elasticclustercredential ceph-prod -o yaml

BlockDevice Adoption and Ownership

Once an ElasticCluster selects a BlockDevice for the first time, the controller patches it with the sds-elastic.deckhouse.io/cluster=<cluster-name> label. The label is the durable record of which cluster owns the device and drives several behaviors:

  • Single owner per BlockDevice. If a BlockDevice matches the blockDeviceSelector of two ElasticCluster resources, the second one cannot adopt it. The controller refuses to overwrite the existing label and surfaces StorageReady=False with Reason=OwnershipConflict and a message listing each contested BD and its current owner. No LVMVolumeGroup, LVMLogicalVolume, or local PersistentVolume is created until every conflict is resolved — even free BDs in the selector remain unadopted while a conflict is pending.

    To resolve a conflict, decide which cluster should own the BD and clear the label on the other side:

    d8 k label blockdevice <bd-name> sds-elastic.deckhouse.io/cluster-

    Or remove the conflicting ElasticCluster entirely. The next reconcile picks the BD up.

  • Sticky adoption — adopted BlockDevices stay with the cluster. Once a BD has been labelled by the controller, it remains part of the cluster’s working set even if it later drifts out of blockDeviceSelector or nodeSelector (for example, the operator narrows the selector, the device’s labels change, or its node is relabelled). This is intentional: the OSD on top of it is already provisioned, the local PV is bound to a specific node, and dropping it from the working set would shrink CephCluster.spec.storageClassDeviceSets[0].count and risk data unavailability. The cluster’s OSD count is therefore monotonic for the lifetime of an ElasticCluster — it can grow when new BDs match the selector but never shrinks on its own.

    As a side effect, sds-node-configurator flips BlockDevice.status.consumable to false once a VG appears on the device. Sticky adoption prevents this from kicking the BD out of the working set on the very next reconcile.

  • Releasing a BlockDevice. There is no automatic disown path on this experimental stage (planned as part of B20 — OwnerReferences and finalizer-driven teardown). Deleting the ElasticCluster does NOT cascade to the per-device objects: the controller only removes the Rook CephCluster and the csi-ceph CephClusterConnection, leaving the LVMVolumeGroup / LVMLogicalVolume / local PersistentVolume and the BD label for you to clean up by label (see Deleting Resources below). To retire a single BD from a live cluster, manually delete the corresponding LVMLogicalVolume and LVMVolumeGroup, and only then clear the label:

    d8 k delete lvmlogicalvolume <name>
    d8 k delete lvmvolumegroup <name>
    d8 k label blockdevice <bd-name> sds-elastic.deckhouse.io/cluster-

    Doing this while pools still hold useful data risks losing replicas.

  • Editing the selectors after creation. ElasticCluster.spec.storage.nodeSelector and spec.storage.blockDeviceSelector are editable after creation — kubectl edit elasticcluster <name> and adjust the matchers. The validating webhook on UPDATE enforces two safety rails:

    • Orphan-guard. If an edit would leave an already-adopted BD outside the new selector pair (its labels no longer match blockDeviceSelector, or its status.nodeName is no longer in the set produced by nodeSelector), the webhook rejects the request and lists the offending BDs. Adopted BDs cannot be released automatically — follow the manual procedure above first.
    • Pre-flight conflict detection. If a widening edit would pull in a BD already labelled by another ElasticCluster, the webhook rejects the request and reports the contested BDs along with their current owners. Resolve the conflict (clear the label, or delete the other EC) before retrying.

    spec.network remains immutable on UPDATE: changing the public/cluster CIDRs on a live cluster invalidates mon endpoints and host-network bindings, and there is no safe automatic remediation. To change the network configuration, delete and re-create the ElasticCluster.

Declaring StorageClasses

Pools and the matching csi-ceph StorageClasses are declared per ElasticStorageClass. One ESC produces one Ceph pool + one CephStorageClass named after the ESC.

RBD pool with default replication (3 replicas)

d8 k apply -f - <<EOF
apiVersion: storage.deckhouse.io/v1alpha1
kind: ElasticStorageClass
metadata:
  name: ceph-prod-rbd
spec:
  clusterRef: ceph-prod
  type: RBD
  replication: ConsistencyAndAvailability
EOF

CephFS pool with default replication (3 replicas)

d8 k apply -f - <<EOF
apiVersion: storage.deckhouse.io/v1alpha1
kind: ElasticStorageClass
metadata:
  name: ceph-prod-cephfs
spec:
  clusterRef: ceph-prod
  type: CephFS
  replication: ConsistencyAndAvailability
EOF

The ErasureCodedCompact replication mode is temporarily disabled and cannot be selected.

Pool that survives two simultaneous host failures (HighRedundancy)

d8 k apply -f - <<EOF
apiVersion: storage.deckhouse.io/v1alpha1
kind: ElasticStorageClass
metadata:
  name: ceph-prod-rbd-hr
spec:
  clusterRef: ceph-prod
  type: RBD
  replication: HighRedundancy
EOF

HighRedundancy produces a 4-replica pool (size=4, min_size=2, requireSafeReplicaSize=true):

  • two simultaneous host failures keep I/O continuous (2 replicas equal min_size);
  • a third simultaneous failure pauses I/O but does not lose data — Ceph backfills the surviving copy onto free cluster space and resumes;
  • data loss only at the fourth simultaneous failure.

The mode requires at least 5 storage nodes (4 for the pool’s CRUSH placement at failureDomain=host and 5 to host a 5-mon quorum). The first time you create a HighRedundancy ESC against an ElasticCluster, the controller automatically promotes the underlying CephCluster to mon.count=5, mgr.count=3 (the standard topology is 3, 2). The promotion is sticky: deleting the last HighRedundancy ESC does NOT roll the counts back, because silently weakening a live cluster’s fault-tolerance guarantee is unsafe.

A validating webhook gates ESC creation on the same thresholds so the sticky promotion cannot fire on an undersized cluster. CREATE of an ESC with replication: HighRedundancy is rejected when:

  • the parent ElasticCluster referenced by spec.clusterRef does not exist;
  • fewer than 5 nodes match ElasticCluster.spec.storage.nodeSelector (the 5-mon quorum floor);
  • adopted BlockDevice resources of the parent EC live on fewer than 4 distinct nodes (the 4-replica CRUSH placement floor).

So the bootstrap order is fixed: apply the ElasticCluster first, wait until at least four storage nodes have adopted BDs (check via kubectl get bd -l sds-elastic.deckhouse.io/cluster=<ec> or EC.status.phase=Ready), and only then apply the HighRedundancy ESC. Trying to ship the EC and the HR ESC in the same kubectl apply is rejected by admission — the EC arrives first, but its adopted-BD set is still empty when the ESC admission runs.

The audit trail lives on ElasticCluster.status.cephTopology:

d8 k get elasticcluster ceph-prod -o jsonpath='{.status.cephTopology}'
# {"monCount":5,"mgrCount":3,"reason":"HighRedundancyESCPresent","lastPromotedAt":"2026-…"}

Possible reason values: Standard, HighRedundancyESCPresent, StickyHighWaterMark. To force a recompute (for example, after deliberately scaling down to a smaller cluster), clear the field via the status subresource and trigger a reconcile:

d8 k patch elasticcluster ceph-prod \
  --type=merge --subresource=status \
  -p '{"status":{"cephTopology":null}}'

Wait until each ESC reports Ready:

d8 k get elasticstorageclass -w

The conditions transition is PoolReadyCsiStorageClassReady → aggregate Ready.

Verify the resulting csi-ceph objects and Kubernetes StorageClasses:

d8 k get cephclusterconnection
d8 k get cephstorageclass
d8 k get sc

A CephClusterConnection named after the parent ElasticCluster (ceph-prod) and one CephStorageClass per ElasticStorageClass (ceph-prod-rbd, ceph-prod-cephfs) are expected. Each csi-ceph CephStorageClass produces a Kubernetes StorageClass with the same name, ready to be consumed by PersistentVolumeClaim resources.

The internal helm-managed StorageClass sds-elastic-osd (provisioner kubernetes.io/no-provisioner, volumeBindingMode: WaitForFirstConsumer) backs OSD-local PersistentVolumes and is intentionally not user-facing — ElasticStorageClass resources cannot reuse this name (the validating webhook rejects them).

Deleting Resources

Deleting an ElasticCluster

Deleting an ElasticCluster is reversible as long as no ElasticStorageClass still references it: the controller removes only the resources you cannot delete by hand — the Rook CephCluster and the csi-ceph CephClusterConnection, both protected by the vendor-cr-validation webhook. The OSD disks and the mon store are left intact, so the cluster can be re-created from the same devices.

Order of operations:

  1. Delete every dependent ElasticStorageClass first (see below). The controller refuses to start the cluster teardown while any ESC references it.

  2. Delete the ElasticCluster:

    d8 k delete elasticcluster ceph-prod

    Held by a finalizer, the controller deletes the CephCluster and CephClusterConnection, then releases the CR.

  3. Clean up the remaining controller-labelled objects by hand — they are intentionally preserved (no automatic cascade):

    # inspect what is still labelled with the cluster name
    d8 k get pv,lvmlogicalvolume,lvmvolumegroup -l sds-elastic.deckhouse.io/cluster=ceph-prod
    
    d8 k delete pv -l sds-elastic.deckhouse.io/cluster=ceph-prod
    d8 k delete lvmlogicalvolume -l sds-elastic.deckhouse.io/cluster=ceph-prod
    d8 k delete lvmvolumegroup -l sds-elastic.deckhouse.io/cluster=ceph-prod
    # finally clear the cluster label from the BlockDevices
    d8 k label blockdevice -l sds-elastic.deckhouse.io/cluster=ceph-prod sds-elastic.deckhouse.io/cluster-

    Keep the ElasticClusterCredential if you plan to re-create the cluster with the same identity.

While the teardown is in progress the ElasticCluster Ready condition explains what is blocking it:

Reason Meaning Action
StorageClassesExist One or more ElasticStorageClass still reference this cluster. Delete the listed ElasticStorageClass objects first.
VolumesExist The storage backend still has bound PersistentVolumes. Delete the remaining PersistentVolumes; teardown then continues automatically.
Terminating Backend resources are being removed. Wait for completion.

Deleting an ElasticStorageClass

Delete an ElasticStorageClass to remove the corresponding pool and CephStorageClass:

d8 k delete elasticstorageclass ceph-prod-rbd

Deleting an ElasticStorageClass is destructive: it tears down the underlying storage pool / filesystem and the data stored in it. Make sure no application still needs the data first.

Held by a finalizer, the controller runs an ordered teardown:

  1. It refuses to delete anything while any PersistentVolume provisioned from this StorageClass is still Bound. Delete the consuming PersistentVolumeClaims first — this guard cannot be overridden.
  2. Once nothing is bound, it removes the CephStorageClass and tears down the backing pool / filesystem.

For block (RBD) classes, a pool that still holds data is preserved by default. To permanently delete it (the data in the pool is lost), authorise the destructive purge with the force-deletion annotation:

d8 k annotate elasticstorageclass ceph-prod-rbd sds-elastic.deckhouse.io/force-deletion=true

For shared-filesystem (CephFS) classes there is no force override: the filesystem is removed automatically once it is empty, which you achieve by deleting the remaining PersistentVolumes for the StorageClass.

While the teardown is in progress the ElasticStorageClass Ready condition explains what is blocking it:

Reason Meaning Action
BoundVolumesExist PersistentVolumes provisioned from this StorageClass are still bound. Delete the consuming PersistentVolumeClaims. The force annotation does not override this.
DataPresentInPool The block pool still holds data (RBD only). Set sds-elastic.deckhouse.io/force-deletion=true to permanently delete the pool and its data.
FilesystemNotEmpty The filesystem still has volumes (CephFS only). Delete the remaining PersistentVolumes for this StorageClass.
Terminating Backend resources are being removed. Wait for completion.

PV / LVM / BlockDevice cleanup after deleting an ElasticCluster is manual (see above); end-to-end OwnerReferences-driven GC is tracked as backlog item B20.

Disabling the Module

Disabling the module stops the controller and the Rook operator. Data stored in Ceph clusters managed by this module may become unavailable or be lost. Always delete every ElasticCluster, ElasticStorageClass and ElasticClusterCredential object before disabling the module.

A validating webhook on the sds-elastic ModuleConfig rejects setting spec.enabled: false while any ElasticCluster still exists. This prevents accidentally tearing down the controller and the Rook operator while a live Ceph cluster (OSD data on host disks) is still under management. Follow the ordered teardown below; the disable is accepted only once the last ElasticCluster is gone.

  1. Delete every ElasticStorageClass and wait until the controller has removed the pools and csi-ceph StorageClasses:

    d8 k get elasticstorageclasses.storage.deckhouse.io

    Wait until the command returns No resources found.

  2. Delete every ElasticCluster and wait for cluster teardown:

    d8 k get elasticclusters.storage.deckhouse.io

    Wait until the command returns No resources found.

  3. Optionally remove the ElasticClusterCredential. It is a cluster-scoped identity backup and does not gate the disable (only a live ElasticCluster blocks it). Delete it unless you plan to re-create the cluster with the same identity:

    d8 k get elasticclustercredentials.storage.deckhouse.io
    d8 k delete elasticclustercredential <name>
  4. Disable the module. Disabling requires the modules.deckhouse.io/allow-disabling: "true" annotation on the ModuleConfig:

    d8 k annotate moduleconfig sds-elastic modules.deckhouse.io/allow-disabling=true --overwrite
    d8 k patch moduleconfig sds-elastic --type=merge -p '{"spec":{"enabled":false}}'

Forcing the Module Off While ElasticClusters Remain

This bypasses the safety guard. Use it only for disaster recovery, when you deliberately want to keep the ElasticCluster CRs and their on-disk data but stop the module from managing them. The Ceph cluster will be left orphaned (no operator), and the controller finalizers on the leftover CRs are stripped by the module-delete hook so the API server can garbage-collect them. OSD data on host disks and dataDirHostPath are not erased, but they are no longer managed and may become unrecoverable through normal means.

If you must disable the module without deleting the ElasticClusters first, set the sds-elastic.deckhouse.io/force-disable: "true" annotation on the ModuleConfig. With this annotation present, the webhook allows spec.enabled: false regardless of how many ElasticClusters exist:

d8 k annotate moduleconfig sds-elastic sds-elastic.deckhouse.io/force-disable=true --overwrite
d8 k annotate moduleconfig sds-elastic modules.deckhouse.io/allow-disabling=true --overwrite
d8 k patch moduleconfig sds-elastic --type=merge -p '{"spec":{"enabled":false}}'

Checking Cluster Health

The controller exposes coarse-grained progress on each CR through conditions. For an ElasticCluster:

d8 k describe elasticcluster <cluster-name>

Useful conditions: StorageReady, CephClusterReady, CredentialsReady, CsiCephReady, UpgradeReady, UpgradeInProgress, and the aggregate Ready.

The UPGRADING printcolumn (and the underlying UpgradeInProgress condition) tracks the per-daemon convergence picture Rook publishes under CephCluster.status.ceph.versions.overall. While the map carries more than one key the cluster is mid-rollout and UPGRADING stays True for the whole window — including the mon → mgr → osd → mds rolling phases when CephCluster.status.phase=Progressing and the FSM gates downstream stages. UpgradeInProgress flips back to False only once versions.overall has a single key matching the desired version. Note that EC.status.cephVersion.running (the Ceph printcolumn) reports the lagging version present in versions.overall while daemons disagree, so it shows what callers will still hit on the slowest-rolling daemon (typically OSDs), not Rook’s already-bumped target marker.

For an ElasticStorageClass:

d8 k describe elasticstorageclass <esc-name>

Useful conditions: PoolReady, CsiStorageClassReady, and the aggregate Ready.

For a deeper Ceph-level inspection, exec into a Rook toolbox pod:

d8 k -n d8-sds-elastic exec -it deploy/rook-ceph-tools -- ceph status
d8 k -n d8-sds-elastic exec -it deploy/rook-ceph-tools -- ceph osd tree