Available in editions:  CE, BE, SE, SE+, EE

The module lifecycle stageGeneral Availability

The monitoring-deckhouse module provides comprehensive monitoring, alerting, and observability for the Deckhouse Kubernetes Platform itself. It monitors the health, performance, and proper operation of DKP core components to ensure platform stability and reliability.

This module is a critical observability component that works in conjunction with the prometheus module to provide insights into DKP operational state.

The module deploys monitoring resources that:

  • Collect Deckhouse metrics: Module scrapes metrics from the Deckhouse pod using PodMonitor resources, including:
    • Self metrics on port 4222 via /metrics endpoint.
    • Custom hook-generated metrics via /metrics/hooks endpoint.
    • Module execution metrics, hook performance, and system health indicators.
  • Define alerting rules: Module provides comprehensive Prometheus alerting rules organized into several categories:
    • DKP availability: Monitors pod health, readiness, and uptime.
    • DKP malfunctioning: Detects excessive restarts, registry access issues, hung processes.
    • Release management: Tracks release channel subscriptions, pending updates, and manual approvals.
    • Module management: Monitors module state, validation errors, and deprecated configurations.
    • CNI checks: Detects multiple CNI configurations and misconfigurations.
    • OS requirements: Identifies nodes running deprecated operating system versions.
  • Provide Grafana dashboards: Includes pre-built Grafana dashboards for visualizing:
    • DKP performance metrics.
    • Module execution statistics.
    • Hook run times and resource usage.
    • Queue processing and convergence status.

Metrics collection

The module configures a PodMonitor that scrapes two endpoints from the Deckhouse pod:

  1. DKP metrics (/metrics): Core DKP operational metrics:
    • deckhouse_live_ticks: Health indicator incrementing every 10 seconds.
    • deckhouse_registry_errors: Registry connectivity issues.
    • deckhouse_module_hook_run_seconds: Module hook execution duration.
    • deckhouse_tasks_queue_action_duration_seconds: Task queue processing times.
    • And many more operational metrics.
  2. Hook metrics (/metrics/hooks): Custom metrics generated by DKP hooks with honorLabels: true to preserve hook-specific labels.

Integration with Observability module

When the observability module is enabled, this module automatically creates:

  • ClusterObservabilityMetricsRulesGroup resources for Prometheus rules.
  • ClusterObservabilityDashboard resources for Grafana dashboards.

This enables centralized management and multi-tenancy support for monitoring resources.

Requirements

  • prometheus module must be enabled (automatic dependency).
  • operator-prometheus module should be enabled for PodMonitor support.