Deckhouse Kubernetes Platform (DKP) provides a Kubernetes monitoring solution based on Prometheus and Grafana. DKP automatically configures metrics collection in the cluster from nodes, pods, and key cluster components (etcd, kube-apiserver, CoreDNS), which enables preset dashboards for analyzing CPU, memory, disk, and network usage.
Cluster monitoring is enabled by default in the Default and Managed module bundles.
All components, including Prometheus and Alertmanager, operate in a fault-tolerant mode and can be used in cloud environments and on bare-metal servers.
The principles of Prometheus operation is covered in Configuring a system for collecting and storing metrics.
Several types of monitoring are implemented in DKP:
- Hardware resource monitoring
- Kubernetes monitoring
- Ingress monitoring
- Control plane monitoring
- Network interaction monitoring
- Extended monitoring
- Cluster SLA monitoring
DKP includes an alerting system that supports sending event notifications, including to external systems.
Hardware resource monitoring
Tracking of cluster hardware resource capacity is provided with graphs showing utilization of:
- CPU
- Memory
- Disk
- Network
Graphs are available with aggregation by:
- Pods
- Controllers
- Namespaces
- Nodes
Kubernetes monitoring
The module monitoring-kubernetes is designed for basic cluster node monitoring.
It provides secure metrics collection and offers a basic set of rules for monitoring:
- Current container runtime version (docker, containerd) on the node and its compliance with versions allowed for use.
- Overall cluster monitoring subsystem health (Dead man’s switch).
- Available file descriptors, sockets, free space, and inodes.
- Operation of
kube-state-metrics,node-exporter,kube-dns. - Cluster node state (NotReady, drain, cordon).
- Time synchronization state on nodes.
- Cases of prolonged CPU steal exceeding.
- Conntrack table state on nodes.
- Pods with incorrect state (as a possible consequence of kubelet issues) and more.
Ingress monitoring
Statistics collection for ingress-nginx in Prometheus is implemented with detailed metrics (response time, codes, geography, etc.), available in different dimensions (namespace, vhost, ingress). Data is visualized in Grafana with interactive dashboards.
Detailed description is available in the section about Ingress monitoring.
The module is enabled by default in the Default and Managed module bundles.
Disabling collection of detailed statistics from Ingress resources
By default, DKP collects detailed statistics from all Ingress resources in the cluster, which generates a high load on the monitoring system.
To disable statistics collection, add the label ingress.deckhouse.io/discard-metrics: "true" to the corresponding namespace or Ingress resource.
-
Example of disabling statistics (metrics) collection for all Ingress resources in the
review-1namespace:d8 k label ns review-1 ingress.deckhouse.io/discard-metrics=true -
Example of disabling statistics (metrics) collection for all
test-siteIngress resources in thedevelopmentnamespace:d8 k label ingress test-site -n development ingress.deckhouse.io/discard-metrics=true
Control plane monitoring
Control plane monitoring is performed using the monitoring-kubernetes-control-plane module, which organizes secure metrics collection and provides a basic set of monitoring rules for the following cluster components:
- kube-apiserver
- kube-controller-manager
- kube-scheduler
- kube-etcd
Cluster monitoring
DKP securely collects monitoring metrics and configures rules.
DKP monitoring capabilities:
- Monitoring current container runtime version (containerd) on the node and its compliance with versions allowed for use in DKP.
- Monitoring cluster monitoring subsystem health (“Dead man’s switch”).
- Monitoring available file descriptors, sockets, free space, and inodes.
- Monitoring cluster node state (NotReady, drain, cordon).
- Operation of
kube-state-metrics,node-exporter,kube-dns. - Monitoring time synchronization state on nodes.
- Monitoring cases of prolonged CPU steal exceeding.
- Monitoring Conntrack table state on nodes.
- Monitoring pods with incorrect state (as a possible consequence of kubelet issues).
- Monitoring control plane components (implemented by the
monitoring-kubernetes-control-planemodule).
Extended monitoring mode
DKP supports an extended monitoring mode via the extended-monitoring module, allowing you to configure:
- Monitoring secrets in the cluster (Secret objects) and TLS certificate expiration in them.
- Collecting Kubernetes cluster events as metrics.
- Monitoring container image availability in registry used by controllers (Deployments, StatefulSets, DaemonSets, CronJobs).
- Monitoring objects in namespaces that have the
extended-monitoring.deckhouse.io/enabled=""label.
The module can send alerts based on the following metrics:
- Free space and inodes on node disks
- Node utilization
- Pod and container image availability
- Certificate expiration
- Other cluster events
Alerts
Monitoring in DKP includes event notifications. The standard delivery includes a set of basic warnings covering cluster state and its components. There is also the ability to add custom alerts.
For the list of all available alerts in the DKP monitoring system, refer to the corresponding documentation page.
Sending alerts to external systems
DKP supports sending alerts using Alertmanager:
- Via SMTP protocol
- To PagerDuty
- To Slack
- To Telegram
- Via Webhook
- Through any other channels supported in Alertmanager
Examples of DKP monitoring integration with external systems are available in Configuring integrations.
Cluster SLA monitoring
Availability assessment in DKP is performed by the upmeter module.
Composition of the upmeter module:
- agent: Runs on master nodes and performs availability probes, sends results to the server.
- upmeter: Collects results and maintains an API server for their retrieval.
- front:
- status: Shows availability level for the last 10 minutes (requires authorization, but it can be disabled).
- webui: Shows a dashboard with statistics on probes and availability groups (requires authorization).
- smoke-mini: Maintains continuous smoke testing using StatefulSet.
The module sends about 100 metric readings every 5 minutes. This value depends on the number of enabled Deckhouse Kubernetes Platform modules.