Available in editions:  CE, BE, SE, SE+, EE

The module lifecycle stageGeneral Availability

The monitoring-deckhouse module provides comprehensive monitoring, alerting, and observability for the Deckhouse Kubernetes Platform itself. It monitors the health, performance, and proper operation of Deckhouse core components to ensure platform stability and reliability.

This module is a critical observability component that works in conjunction with the prometheus module to provide insights into Deckhouse’s operational state.

The module deploys monitoring resources that:

  • Collect Deckhouse metrics — Scrapes metrics from the Deckhouse pod using PodMonitor resources, including:
    • Self metrics on port 4222 via /metrics endpoint
    • Custom hook-generated metrics via /metrics/hooks endpoint
    • Module execution metrics, hook performance, and system health indicators
  • Define alerting rules — Provides comprehensive Prometheus alerting rules organized into several categories:
    • Deckhouse availability — Monitors pod health, readiness, and uptime
    • Deckhouse malfunctioning — Detects excessive restarts, registry access issues, hung processes
    • Release management — Tracks release channel subscriptions, pending updates, and manual approvals
    • Module management — Monitors module state, validation errors, and deprecated configurations
    • CNI checks — Detects multiple CNI configurations and misconfigurations
    • OS requirements — Identifies nodes running deprecated operating system versions
  • Provide Grafana dashboards — Includes pre-built Grafana dashboards for visualizing:
    • Deckhouse performance metrics
    • Module execution statistics
    • Hook run times and resource usage
    • Queue processing and convergence status

Metrics collection

The module configures a PodMonitor that scrapes two endpoints from the Deckhouse pod:

  1. Deckhouse metrics (/metrics) — Core Deckhouse operational metrics:
    • deckhouse_live_ticks — Health indicator incrementing every 10 seconds
    • deckhouse_registry_errors — Registry connectivity issues
    • deckhouse_module_hook_run_seconds — Module hook execution duration
    • deckhouse_tasks_queue_action_duration_seconds — Task queue processing times
    • And many more operational metrics
  2. Hook metrics (/metrics/hooks) — Custom metrics generated by Deckhouse hooks with honorLabels: true to preserve hook-specific labels

Integration with Observability module

When the observability module is enabled, this module automatically creates:

  • ClusterObservabilityMetricsRulesGroup resources for Prometheus rules
  • ClusterObservabilityDashboard resources for Grafana dashboards

This enables centralized management and multi-tenancy support for monitoring resources.

Requirements

  • prometheus module must be enabled (automatic dependency)
  • operator-prometheus module should be enabled for PodMonitor support