The module is available only in Deckhouse Enterprise Edition.

Architecture of Deckhouse Observability Platform

Deckhouse Observability Platform has a microservice architecture where the entire system is divided into many components, each performing a specific task. These components can be grouped by their purpose.

Below is a schematic diagram with the component groups:

image

Detailed information about each group will be provided further below.

UI + Control-plane

The UI + Control-plane component group is responsible for managing the platform through the user interface and coordinating all internal processes.

This group includes the following main components:

image

  • The Backend is a stateless component that implements the functionality of the user dashboard. In the backend, users can manage the storage, view dashboards, graphs, alerts, and triggers.
  • The Config Validator is a stateless component that validates different configurations and PromQL expressions in triggers.
  • The Usage Collector is a stateless component that collects metrics from various system components and writes them to Metric Storage for further statistics display.
  • The Reconciler is a stateless component that configures all internal components of the Deckhouse Kubernetes Platform.
  • Grafana is a data visualization tool integrated into the system’s backend, used for displaying dashboards.
  • Auth is a stateless component that authenticates all API requests based on API tokens. Upon successful authentication, requests are routed to the corresponding components of the Deckhouse Kubernetes Platform.
  • PostgreSQL is the database used for storing configuration data and other metadata.

Notification Components

The Notification Components group is responsible for executing rules, processing alerts, and sending notifications.

This group includes the following main components:

image

  • The Ruler is a stateless component that executes PromQL queries and saves the result as a new metric. Additionally, the Ruler checks triggers by executing PromQL queries, and if the firing condition is met, it sends an alert to Alertgate.
  • Alertgate is a component responsible for receiving information about triggered alerts. It saves the current state of triggers in the database (PostgreSQL) and sends them to Alertmanager. For some triggers, Alertgate also performs additional verification of resolve conditions: metric disappearance or successful resolution. If resolved, it continues to send alerts.
  • Alertmanager is a stateful component that receives alerts from the Ruler, groups them, deduplicates them, and sends notifications according to the configured delivery channels such as email, Telegram, webhook, and others.

Metrics Storage

The Metrics Storage group performs the following key functions: receiving metrics, converting them into blocks, sending them to long-term storage (Object Storage), and reading both current and long-term stored data.

This group includes the following main components:

image

  • The Metrics Collector is responsible for receiving metrics from opAgent, converting requests, and sending them to the Distributor. It is also responsible for receiving historical data from opAgent (historical being data that doesn’t fall into the current active block):
    • Metrics Reception – receives metrics from opAgent, converts requests into the required format, and passes the data to the Distributor.
    • Historical Data Processing – receives and processes historical data.
  • Nginx is a stateless component used for routing and load balancing:
    • Request Routing – directs requests to the appropriate components.
    • Load Balancing – distributes the load among components to ensure high performance and fault tolerance.
  • The Distributor is a stateless component that accepts metrics and performs their verification:
    • Metrics Verification – verifies the correctness and compliance of metrics with the established restrictions for the given tenant.
    • Metrics Sharding – divides metrics into packages based on sharding rules.
    • Rate Limiting – limits the rate of incoming data based on project limits.
    • Replication – sends copies of packages to multiple ingesters in parallel, according to the replication factor.
  • The Ingester is a stateful component responsible for temporarily storing and preparing metrics before long-term storage:
    • Write-Ahead Logging (WAL) – writes all incoming metrics to the disk.
    • Block Formation – forms two-hour metric blocks in memory.
    • Block Storage – writes blocks to local disk and then sends them to long-term storage (removes blocks from the local disk after 6 hours).
  • Query-frontend is a stateless component that speeds up read queries for metrics:
    • Query Splitting – splits range queries into sub-queries for parallel execution.
    • Caching – caches query results to speed up subsequent queries.
    • Query Scheduling – asdds queries to the queue and passes them to Querier for execution.
  • The Querier is a stateless component responsible for executing read queries for metrics.
    • Data Extraction – extracts metrics from Block Storage and Ingester.
    • Query Execution – executes PromQL expressions to obtain the necessary data.
  • Store Gateway is a stateful component providing access to data in long-term storage:
    • Block Indexing – stores in-memory indexes of all blocks, determining which data is needed for each query.
    • Data Extraction – allows extracting only the necessary data from Block Storage to reduce resource consumption.
  • The Compactor is a stateless component that optimizes metric block storage:
    • Block Compression – merges several blocks into one optimized larger block.
    • Data Deduplication – eliminates duplicate data, reducing storage costs.
    • Obsolete Block Removal – deletes blocks according to the retention policy. Etcd is a stateful component used for storing data required for replication and sharding:
    • Metadata Storage – stores settings and metadata necessary for system operation.

Logs Storage

The Logs Storage group performs the following key functions: receiving logs, converting them into blocks, sending them to long-term storage (Object Storage), and reading both current and long-term stored data.

This group includes the following main components:

image

  • Gateway is a stateless component, implemented based on NGINX, that routes all incoming requests:
    • Routing – directs incoming requests to the appropriate system components for further processing.
  • The Distributor is a stateless component that processes incoming write requests from clients. It is the first step in the log data writing pathway:
    • Verification – checks each stream for correctness and compliance with the project’s established restrictions.
    • Pre-processing – normalizes labels for caching and hashing.
    • Rate Limiting – limits the rate of incoming data based on project limits.
    • Forwarding – sends verified data to several ingesters in parallel, using the replication factor.
  • The Ingester is a stateful component responsible for log storage and their forwarding to long-term storage:
    • Write-Ahead Logging (WAL) – Writes all incoming metrics to the disk.
    • Block Formation – forms two-hour metric blocks in memory.
    • Block Storage – frites blocks to local disk and then sends them to long-term storage.
  • Query-frontend is an optional service that speeds up read query execution for logs:
    • Query Splitting – splits queries into sub-queries for parallel execution.
    • Caching – caches query results to speed up subsequent operations.
    • Query Scheduling – uses an internal queue to handle requests.
  • The Querier is a stateless component that executes read queries for logs:
    • Data Extraction – extracts logs from Ingester and long-term storage.
    • Query Execution – processes queries in the LogQL language.
  • Index Gateway is a component responsible for processing and servicing metadata queries:
    • Query Processing – processes queries for data search in the index.
    • Caching – caches query results to enhance performance.
  • Compactor is a component that optimizes storage and deletes obsolete data:
    • Data Compression – merges several index files into one optimized file.
    • Data Deletion – cleans up old and unnecessary files according to the storage policy.
  • Etcd is a stateful component used for storing metadata required for replication and sharding:
    • Metadata Storage – stores settings and metadata necessary for system operation.
  • Index-curator is a stateless component that analyzes the space usage in long-term storage and publishes statistics by projects in the form of metrics.

Agent Distribution

The Agent Distribution group is responsible for updating, distributing, and collecting logs from opAgent.

This group includes the following main components:

  • agent-updater is a stateless component responsible for the opAgent update strategy. It controls the current version of opAgent and supports canary updates of agents.
  • docker-registry is a stateful component that distributes the opAgent chart and Docker images with opAgent.
  • logs-collector is a stateless component that receives logs from all agents, enriches them with additional information, and outputs them to stdout.

Note: Detailed descriptions of each subsystem will be provided in the relevant sections of the documentation.

Block Storage (S3)

Block Storage (S3) is a stateful component that serves as S3-compatible data storage for long-term metrics storage. This component is implemented based on Ceph S3 and deployed using ceph-operator, which is installed as a separate module in the Deckhouse Kubernetes Platform.

The component includes the following elements:

  • The Manager is responsible for managing the Ceph cluster, including its configuration and monitoring.
  • The Monitor collects and displays real-time Ceph cluster status metrics.
  • OSD stores user data and performs tasks related to replication, recovery, and data balancing within the cluster.
  • RADOS Gateway provides an S3-compatible interface for accessing Ceph object storage, offering a RESTful API compatible with Amazon S3.
  • Ceph-tools includes a set of utilities for diagnosing and managing the Ceph cluster, including tools for administration and monitoring.
  • Rook-webhook implements a mutating webhook for all Ceph components, ensuring their operation in restricted mode.