The documentation is under development and may contain incomplete information.

Cluster node management

You can manage cluster nodes using the node-manager module. node-manager provides the following main features:

  1. Management of multiple nodes as a linked group (NodeGroup):
    • Specifying metadata to be applied to all nodes in a group.
    • Monitoring a group of nodes as an entire object: with segmentation of nodes in graphs, aggregation of alerts about node unavailability, and notifications when a certain number or percentage of nodes in a group is unavailable.
  2. Installation, update, and configuration of the node software (containerd, kubelet, etc.), and connection of a node to a cluster:
    • Installation of an operating system (refer to the list of supported operating systems) regardless of the infrastructure type, whether in a cloud environment or on physical hardware.
    • Basic operating system (OS) configuration: disabling auto-update, installing required packages, configuring logging settings, etc.
    • Configuration of nginx to balance requests from nodes between API servers (kubelet), including setting up automatic updates of the upstream server list.
    • Installing and configuring the containerd and Kubernetes CRI, adding a node to a cluster.
    • Managing node updates and disruptions:
      • Automatic selection of an acceptable minor version of Kubernetes for a group of nodes based on its configuration (kubernetesVersion), the default version of the entire cluster, and the current version of the control plane. The node update isn’t allowed if it comes ahead of the control plane update.
      • Only one node in a group is updated at a time, and only when all nodes in the group are available.
      • Two node update options are available:
        • Normal, which always occur automatically.
        • Updates requiring certain disruptions, such as the kernel update, containerd version change, major kubelet version change, and so on. When automatic disruptive updates are enabled, the update is preceded by a node drain process. This setting can be disabled.
    • Monitoring of update status and progress.
  3. Cluster scaling.
  4. Managing Linux users on nodes.

Node types

DVP is intended to run on bare-metal servers, therefore, the following sections cover the management of Static nodes.

To learn about other node types and cloud provider options, refer to the Deckhouse Kubernetes Platform (DKP) documentation.

Node group

To manage nodes, DVP uses node groups that are described in the NodeGroup resources. Each node group performs specific tasks, for example:

  • A group for control-plane components of Kubernetes.
  • A group for monitoring components.
  • A group for control-plane components of DVP.
  • A group of nodes with virtual machines (vm-worker nodes).
  • A group of nodes with container applications (worker nodes) and so on.

Grouping nodes and distributing components between node groups depend on cluster tasks. For DVP cluster configuration examples, refer to Platform installation.

Nodes in a group share common metadata and parameters. This lets you automatically configure them according to the group configuration. Also, DVP keeps track of the number of nodes in a group and updates the installed software.

The following monitoring features are available for such node groups:

  • Grouping of node parameters on group graphs.
  • Grouping of node unavailability alerts.
  • Alerts about unavailability of a certain number or percentage of nodes in a group.

Deployment, configuration, and update of Kubernetes nodes

Deployment of Kubernetes nodes

Deckhouse automatically performs the following immutable operations to deploy cluster nodes:

  1. Configuration and optimization of the operating system to work with containerd and Kubernetes:
    • Installation of the necessary packages from the repositories of a corresponding distribution.
    • Configuration of the kernel operation parameters, logging parameters, log rotation, and other system parameters.
  2. Installation of the required versions of containerd and kubelet. Adding a node to the Kubernetes cluster.
  3. Configuration of nginx and updating the upstream list to balance requests from a node to the Kubernetes API.

Keeping nodes up to date

Two types of updates can be applied to keep cluster nodes up to date:

  • Common updates: These updates are always applied automatically and don’t cause the node to stop or reboot.
  • Disruptive updates: These updates may include a kernel or containerd version update, a major kubelet version change, and so on. For this type of updates, you can select manual or automatic mode using the disruptions parameter section. In the automatic mode, the update can only be applied after a node is properly suspended (drained).

Only one node in the group is updated at any given time, and this is only possible when all nodes in the group are in the available state.

The node-manager module has a set of built-in monitoring metrics. These metrics let you track update progress and receive notifications about encountered issues or when it’s necessary to manually grant a certain permission to update.

Working with static nodes

Limitations

When working with static nodes, there are certain limitations to the node-manager module’s features:

  • No node provisioning: Resources such as bare-metal servers, virtual machines, and linked resources are provisioned manually. The subsequent configuration of resources, such as connecting a node to the cluster, setting up monitoring and so on, is made either fully or partly automatically.
  • No node autoscaling: Using the staticInstances.count parameter in CAPS lets you maintain the specified number of nodes in a group. DVP will attempt to maintain the specified number of nodes in the group, cleaning up unnecessary nodes and setting up new ones as needed by picking them from the StaticInstance resources that are in Pending state.

Managing a static node manually

You can use pre-made scripts to configure or clear up a node and connect or disconnect it from a cluster.

To configure a server (VM) and add a node to the cluster, download and run a special bootstrap script. This script is generated for each static node group (each NodeGroup resource) and located in the d8-cloud-instance-manager/manual-bootstrap-for-<NODEGROUP-NAME> secret. For an example of adding a static node to a cluster, refer to the corresponding section.

To disconnect a cluster node and clean up the server (VM), run the /var/lib/bashible/cleanup_static_node.sh script, which comes with each static node. For an example of disconnecting a cluster node and cleaning up the server, refer to Removing a node from a cluster.

Managing a static node automatically

The automatic static node management is performed via CAPS.

CAPS connects to the server (VM) using the StaticInstance and SSHCredentials resources, configures the node, and adds it to the cluster.

If necessary (for example, if the StaticInstance resource associated with the server is deleted or the number of group nodes is reduced), CAPS connects to the cluster node, clears it, and disconnects it from the cluster.

Managing an existing node automatically

This feature is supported in Deckhouse 1.63 or higher.

To transfer an existing cluster node under CAPS management, prepare the StaticInstance and SSHCredentials resources for this node, as you would do for automatic management following the guidelines above. When preparing the StaticInstance resource, annotate it as static.node.deckhouse.io/skip-bootstrap-phase: "".

Configuring a node via CAPS

CAPS is an implementation of a declarative management provider for static nodes (bare-metal servers or virtual machines) for the Kubernetes Cluster API. CAPS is an additional layer of abstraction above the existing Deckhouse functionality that provides automatic static node configuration and cleanup using scripts generated for each node group.

CAPS provides the following features:

  • Configuring a bare-metal server or virtual machine to connect to a Kubernetes cluster.
  • Adding a node to a Kubernetes cluster.
  • Removing a node from a Kubernetes cluster.
  • Cleaning a bare-metal server or virtual machine after a node is disconnected from a Kubernetes cluster.

CAPS uses the following custom resources:

  • StaticInstance: Each StaticInstance resource details a specific host (server or VM) that is managed using CAPS.
  • SSHCredentials: Contains the SSH credentials required to connect to the host. SSHCredentials is specified in the credentialsRef parameter of the StaticInstance resource.
  • NodeGroup: The staticInstances parameter section defines the required number of nodes in a group and the filter for the StaticInstance resources that can be used in the group.

CAPS is enabled automatically if you fill in the staticInstances parameter section in the NodeGroup. If the staticInstances parameter section in the NodeGroup is empty, then you have to manually configure and clean up nodes in that group, instead of using CAPS. For details on adding and cleaning up a node in a cluster, refer to Managing a static node manually.

The general procedure for working with static nodes when using CAPS is as follows:

  1. Preparing resources.

    Before bringing a bare-metal server or virtual machine under the CAPS management, the following preliminary steps may be necessary:

    • Preparing the storage system, adding mount points, and so on.
    • Installing OS-specific packages. For example, installing the ceph-common package if the server uses CEPH volumes.
    • Configuring the network connectivity. For example, between the server and cluster nodes.
    • Configuring the SSH access to the server, creating a user with the root-level access via sudo. A good practice is to create a separate user and unique keys for each server.
  2. Creating a SSHCredentials resource.

    The SSHCredentials resource contains parameters required by CAPS to connect to a server via SSH. A single SSHCredentials resource can be used to connect to multiple servers, but it’s a good practice to create unique users and access keys to connect to each server. In this case, the SSHCredentials resource will also be unique for each server.

  3. Creating a StaticInstance resource.

    For every server (VM) in the cluster, an individual StaticInstance resource is created. It contains the IP address for connecting and a link to the SSHCredentials resource with the data required for connecting.

    The following is a list of possible states of StaticInstance, the associated servers (VMs), and cluster nodes:

    • Pending: The server isn’t configured and there’s no associated node in the cluster.
    • Bootstrapping: The procedure for configuring the server (VM) and connecting the node to the cluster is in progress.
    • Running: The server is configured, and the associated node is added to the cluster.
    • Cleaning: The procedure of cleaning up the server and disconnecting a node from the cluster is in progress.

    You can transfer the existing manually-bootstrapped cluster node under CAPS management by annotating its StaticInstance with static.node.deckhouse.io/skip-bootstrap-phase: "".

  4. Creating a NodeGroup resource.

    When using CAPS, focus on the nodeType parameter (it must be Static) and the staticInstances parameter section of the NodeGroup resource.

    The staticInstances.labelSelector parameter section defines a filter that CAPS applies to select the StaticInstance resources to be used for a group. The filter allows only certain StaticInstance to be used for specific node groups and also allows a single StaticInstance to be used in different node groups. You can choose not to define a filter to use any available StaticInstance for a node group.

    The staticInstances.count parameter specifies the desired number of nodes in the group. If the parameter is changed, CAPS starts adding or removing the desired number of nodes (this process runs in parallel).

    Using the data in the staticInstances parameter section, CAPS attempts to maintain the specified number of nodes in the group (the count parameter). If a node needs to be added to the group, CAPS selects the StaticInstance resource that matches the filter and is in the Pending state, configures the server (VM), and adds the node to the cluster. If a node needs to be removed from the group, CAPS selects the StaticInstance resource that is in the Running state, cleans up the server (VM), and disconnects the node from the cluster. The corresponding StaticInstance resource then goes to the Pending state and can be reused.

    For an example of adding a node, refer to Adding a static node using CAPS.

How to interpret node group states?

Ready: A node group contains the minimum required number of scheduled nodes with the the Ready status for all zones.

Example 1. A node group in the Ready state:

apiVersion: deckhouse.io/v1
kind: NodeGroup
metadata:
  name: ng1
spec:
  nodeType: CloudEphemeral
  cloudInstances:
    maxPerZone: 5
    minPerZone: 1
status:
  conditions:
  - status: "True"
    type: Ready
---
apiVersion: v1
kind: Node
metadata:
  name: node1
  labels:
    node.deckhouse.io/group: ng1
status:
  conditions:
  - status: "True"
    type: Ready

Example 2. A node group in the Not Ready state:

apiVersion: deckhouse.io/v1
kind: NodeGroup
metadata:
  name: ng1
spec:
  nodeType: CloudEphemeral
  cloudInstances:
    maxPerZone: 5
    minPerZone: 2
status:
  conditions:
  - status: "False"
    type: Ready
---
apiVersion: v1
kind: Node
metadata:
  name: node1
  labels:
    node.deckhouse.io/group: ng1
status:
  conditions:
  - status: "True"
    type: Ready

Updating: A node group contains at least one node with an annotation prefixed with update.node.deckhouse.io (for example, update.node.deckhouse.io/waiting-for-approval).

WaitingForDisruptiveApproval: A node group contains at least one node, which has the annotation update.node.deckhouse.io/disruption-required but doesn’t have the annotation update.node.deckhouse.io/disruption-approved.

Scaling: Calculated only for node groups of the CloudEphemeral type. Can be in True state in two cases:

  1. When the number of nodes is less than the desired number of nodes in the group. That is, when it’s necessary to increase the number of nodes in the group.
  2. When a node is marked for deletion or when the number of nodes is greater than the desired number of nodes. That is, when it’s necessary to reduce the number of nodes in the group.

The desired number of nodes is the sum of all replicas in a node group.

Example. The desired number of nodes is 2:

apiVersion: deckhouse.io/v1
kind: NodeGroup
metadata:
  name: ng1
spec:
  nodeType: CloudEphemeral
  cloudInstances:
    maxPerZone: 5
    minPerZone: 2
status:
...
  desired: 2
...

Error: Contains the latest error that occurred when creating a node in a node group.

Impact of NodeGroup parameters

NodeGroup parameter Disruption update Node provisioning kubelet restart
chaos - - -
cloudInstances.classReference - + -
cloudInstances.maxSurgePerZone - - -
cri.containerd.maxConcurrentDownloads - - +
cri.type - (NotManaged) / + (other) - -
disruptions - - -
kubelet.maxPods - - +
kubelet.rootDir - - +
kubernetesVersion - - +
nodeTemplate - - -
static - - +
update.maxConcurrent - - -

For details about all parameters, refer to the NodeGroup custom resource description.

Changing the InstanceClass or instancePrefix parameter in the Deckhouse configuration won’t result in a RollingUpdate. Deckhouse will create new MachineDeployment objects and delete the old ones. The number of MachineDeployment objects ordered at the same time is determined by the cloudInstances.maxSurgePerZone parameter.

During the disruption update, pods are evicted from the node. If a pod couldn’t be evicted, the eviction attempt is retried every 20 seconds until a global timeout of 5 minutes is reached. After that, the pods that couldn’t be evicted are removed.

How do I allocate nodes to specific loads?

You can’t use the deckhouse.io domain in labels and taints keys of the NodeGroup resource. deckhouse.io is reserved for Deckhouse components. Use the dedicated or dedicated.client.com keys.

There are two ways to accomplish this task:

  1. You can set labels to spec.nodeTemplate.labels in NodeGroup to use them in the spec.nodeSelector or spec.affinity.nodeAffinity parameters. In this case, you select nodes that the scheduler will use for running the target application.
  2. You cat set taints to spec.nodeTemplate.taints in NodeGroup and then remove them using the spec.tolerations parameter. In this case, you disallow running applications on these nodes unless those applications are explicitly allowed.

Deckhouse tolerates the dedicated taints by default, so we recommend using the dedicated key with any value for taints on your dedicated nodes.️

To use custom keys for taints (for example, dedicated.client.com), add the key’s value to the modules.placement.customTolerationKeys parameter. This way, Deckhouse can deploy system components (for example, cni-flannel) to these dedicated nodes.

System components

Deckhouse components use labels and taints for node allocation. System components can be allocated to separate nodes using the following NodeGroup:

nodeTemplate:
  labels:
    node-role.deckhouse.io/system: ""
  taints:
    - effect: NoExecute
      key: dedicated.deckhouse.io
      value: system

How do I allocate nodes to virtual machines?

For virtual machines to run on nodes in a specific group, after creating the group, create a VirtualMachineClass resource with nodeSelector.

Example for the vm-worker group:

apiVersion: deckhouse.io/v1
kind: NodeGroup
metadata:
  name: vm-worker
spec:
  nodeType: Static

A VirtualMachineClass resource with nodeSelector for the vm-worker group:

apiVersion: virtualization.deckhouse.io/v1alpha2
kind: VirtualMachineClass
metadata:
  name: vm-worker
spec:
  nodeSelector:
    matchExpressions:
    - key: node.deckhouse.io/group
      operator: In
      values:
        - vm-worker

An excerpt from a manifest for a virtual machine that will be running on the vm-worker group’s nodes:

apiVersion: virtualization.deckhouse.io/v1alpha2
kind: VirtualMachine
metadata:
  name: vm-name
spec:
  virtualMachineClassName: vm-workers
  # More VM fields ...