Application Performance Monitoring

The module lifecycle stage: General Availability
The module has requirements for installation

Overview

When an application is made up of several services, finding the cause of a slowdown or error can be challenging. A user’s request passes through a chain of calls, and it is unclear exactly where the problem occurs.

Application Performance Monitoring solves this. The platform collects data about each request, automatically analyzes response times and errors across services, and shows you exactly where the problem lies.

How this differs from viewing traces and logs separately

The platform lets you view traces and logs without the APM section — through the Data Overview section, where you write queries manually and receive raw data. However, this requires knowing in advance what to look for.

Performance monitoring works differently: the platform analyzes all incoming traces on its own, calculates metrics for each service, and presents the overall picture. Instead of sifting through individual traces, you immediately see which services are running normally and which are experiencing problems. Both approaches complement each other: performance monitoring shows the big picture and leads you to a problematic request, while manual data exploration lets you investigate a specific situation in detail.

What you get

Service overview — which services are running normally, which have increased response times, and where errors are occurring.
Dependency map — which services interact with each other and where in the chain the problem arises.
Request breakdown — the path of a specific request through all services with the duration of each step.
Log correlation — from a problematic request straight to the logs of the specific service at the time of the error.
Configurable thresholds — you define what counts as an acceptable response time for your service.

This feature is in alpha.

Quick Start

Make sure Application Performance Monitoring has been enabled by the administrator.
Open the project in the platform and go to the Instructions section.
Choose a connection method:
- If your application does not send trace data yet, select “Application Instrumentation” and follow the step-by-step guide for your programming language. The platform address and token are filled in automatically.
- If you already have trace collection set up (for example, via OpenTelemetry Collector), select “Trace Streaming Setup” and specify the platform address as the destination.
Start your application and make a few requests.
Open the APM section in the project’s side menu. Within one to two minutes, your service will appear in the list.

If the service does not appear, see Troubleshooting.

Core Concepts

Trace

Imagine a user clicks the “Pay” button. The request goes through an API gateway, the order service, the payment service, and the database. A trace is a record of that entire journey: which services were involved, how long each step took, and where an error occurred.

In the APM section, you can find a trace by service name, time, status, or other parameters, and view it as a timeline diagram — a sequence of steps with their durations.

Span

A single step within a trace. For example: handling an HTTP request, calling another service, or running a SQL query against a database. Each span has a name, duration, status, and additional attributes.

Apdex

A score that reflects how satisfied users are with a service’s response speed. Expressed as a number from 0 to 1: a value of 0.95 means the vast majority of requests were handled within an acceptable time.

You define what response time is considered acceptable. The default is 512 milliseconds. Requests within this threshold are considered fast; requests exceeding four times the threshold are considered unacceptably slow.

In the service overview, Apdex is displayed in a dedicated column. You can change the threshold for each service — for example, a background report-processing service may have a higher acceptable response time than a user-facing API.

Service health

The platform automatically determines each service’s health based on its metrics:

Status	When it applies
Critical	Apdex below 0.70, or error rate above 5%, or response time (95th percentile) exceeds 3 seconds
Warning	Apdex below 0.85, or error rate above 1%, or response time (95th percentile) exceeds 1 second
Healthy	All indicators are within acceptable limits
No data	No incoming traffic

Thresholds can be changed for each service individually (see Setting acceptable response time for a service).

Virtual dependencies

Not all system components send traces. Databases, caches, and message queues are typically not instrumented, but the platform recognizes them automatically from the attributes of outgoing calls from your services. On the dependency map, these components appear as separate nodes with the appropriate type: database, cache, queue, or external service.

Scope

If multiple Kubernetes clusters or Kubernetes namespaces send data to the platform, you can focus the overview on a specific cluster and Kubernetes namespace using the switcher at the top of the page. Services running outside Kubernetes appear in a separate group.

Connecting Applications

New application

If your application does not send trace data yet, use the built-in instructions:

Open the project in the platform and go to the Instructions section.
Select “Application Instrumentation”.
Choose your programming language. Supported: Java, Python, Node.js, Go, .NET, Ruby, PHP, C++, Elixir, Rust.
Follow the step-by-step guide. The platform address and authorization token are filled in automatically. Installation options are available for virtual machines, Kubernetes, Docker, and Windows.

The instructions include a verification step and a troubleshooting section for each language.

Existing trace collection

If you already have trace collection set up (OpenTelemetry Collector, Jaeger, Zipkin, or Grafana Alloy), simply route the data to the platform. When using tools other than OpenTelemetry, make sure your traces contain the required attributes — see What data is needed for full functionality for details.

Address: https://api.<your installation domain>
Authorization: Authorization: Bearer <token> header. The token must have the “Write traces” permission. You can create a token in the project settings under “API Tokens”.

Supported protocols:

Protocol	Path
OTLP over HTTP (recommended)	`/otlp/v1/traces` or `/v1/traces`
OTLP over gRPC	gRPC connection to `api.<domain>`
Jaeger	`/jaeger/api/traces`
Zipkin	`/zipkin/spans`

For detailed configuration instructions for OpenTelemetry Collector, Jaeger, and Alloy, open the Instructions section in the project and select “Trace Streaming Setup” or “OpenTelemetry Collector”.

What data is needed for full functionality

APM analyzes span attributes within traces. Each feature requires a specific set of attributes.

Required attributes:

Attribute	Purpose	OpenTelemetry auto-instrumentation
`resource.service.name`	Service name. Without it, the service will not be identified in the overview	Must be set manually via the `OTEL_SERVICE_NAME` environment variable. Without it, a default value is used
`span.kind` = `SERVER` or `CONSUMER`	Handling incoming requests (HTTP, gRPC) or queue messages. Only spans with these types appear in the service overview	Set automatically for HTTP servers, gRPC servers, and queue consumers
`span.kind` = `CLIENT`	Outgoing calls to other services. Required for building the dependency map (the map is built from CLIENT — SERVER pairs)	Set automatically for HTTP clients, gRPC clients, and database drivers

Recommended attributes:

Attribute	Purpose	OpenTelemetry auto-instrumentation
`db.system`, `messaging.system`	Dependency type (database, queue). Used to display virtual dependencies on the map	Set automatically when the corresponding library instrumentation is active
`server.address`	Address of the called service. Used to display external dependencies on the map	Set automatically for HTTP clients
`k8s.cluster.name`, `k8s.namespace.name`	Filtering by Kubernetes cluster and namespace in the scope switcher. Without them, the service appears in the “Outside Kubernetes” group	Not set automatically. Requires a Kubernetes resource detector, the `k8sattributes` processor in OpenTelemetry Collector, or manual configuration via `OTEL_RESOURCE_ATTRIBUTES`
`resource.service.namespace`	Grouping services by namespace	Not set automatically. Set via `OTEL_RESOURCE_ATTRIBUTES`

If the required attributes are missing, traces will be available in search (the “Traces” section), but the service overview and dependency map will be empty.

When using tools other than OpenTelemetry (Jaeger SDK, Zipkin), make sure your traces contain the listed attributes. Jaeger and Zipkin support span.kind and service name, but Kubernetes attributes and dependency types may be absent.

Connecting logs

If you send logs to the platform in addition to traces, you can link between them: from a request’s details you can jump to the service’s logs at the time of that request, and from a log entry you can jump to the corresponding trace.

For this linking to work, logs must contain a trace identifier (trace_id). When using OpenTelemetry SDK, the identifier is added automatically.

To set up log shipping, go to the Instructions section and choose the appropriate method (Vector, Promtail, or OpenTelemetry Collector).

Working with APM

Finding a service with increased response time

Open the APM section in the project’s side menu. The Overview page displays a table of all services with key indicators: requests per second, error rate, response time (median and percentiles), and Apdex.
If needed, select the desired cluster and namespace in the scope switcher.
Enable the “Unhealthy only” filter — the table will show only services with a Warning or Critical status.
The table is sorted by default by impact on users: services with high traffic and low Apdex appear first.
Click a service name to go to its detailed view.

Determining the cause of a specific request’s slowdown

In the service detail view, find the “Key Operations” section. Operations with high response times or low Apdex point to problematic areas.
Click “Open traces” — the trace search will open filtered to the selected service.
Select a trace with a long duration and open it.
The timeline diagram shows all request operations as a tree. Each bar represents a single operation with its duration. The “Latency Analysis” section shows the self time of each operation (excluding child calls) and highlights the most expensive one.

Finding the root cause of errors

In the service overview, find a service with a high error rate.
Open the service detail view and find the operation with errors.
Click “Open traces” and filter by error status.
Open a trace with an error. The “Attributes” tab will show error details; the “Events” tab will show the stack trace (if the instrumentation recorded one).
If logs are connected, click “Logs” to see the service’s log entries at the time of that request.

Viewing service dependencies

Open APM → Map. It shows all discovered services and connections between them.
Nodes represent services. Edges represent calls between them. Databases, caches, queues, and external services appear as separate nodes with the appropriate labels.
Click a node to see service metrics. Click an edge to see connection metrics: calls per second, error rate, and response time.
In the service detail view, the “Show on map” button focuses the map on the selected service and its surroundings.

Setting acceptable response time for a service

By default, all services use the same thresholds. If the defaults are not appropriate for a specific service (for example, a background report-processing service may have a higher acceptable response time than a user-facing API):

Open the service detail view.
Click “Service thresholds”.
Adjust the values you need:

Parameter	Default	Description
Acceptable response time	512 ms	Requests faster than this are considered fast
Apdex (warning)	0.85	Below this value — Warning status
Apdex (critical)	0.70	Below this value — Critical status
Error rate (warning)	1%	Above this value — Warning
Error rate (critical)	5%	Above this value — Critical
Response time, 95th percentile (warning)	1000 ms	Above — Warning
Response time, 95th percentile (critical)	3000 ms	Above — Critical
Percentage of requests exceeding threshold	10%	Above — Warning

Click “Save”. The service health will be recalculated with the new thresholds. To revert to defaults, click “Reset”.

The acceptable response time is chosen from a fixed set of values: 32, 64, 128, 256, 512, 1024, 2048, or 4096 milliseconds.

Finding a specific trace

The APM → Traces section provides two search methods:

Filter builder — visual parameter selection: service name, operation name, status, duration, arbitrary attributes.
TraceQL query language — for complex queries with combinations of conditions.

TraceQL examples:

{ resource.service.name = "checkout-api" && status = error }
{ duration > 1s }
{ span.http.status_code >= 500 }

A quick jump to a trace by its ID is also available.

In the trace detail view:

Timeline diagram — a tree of all request operations with their durations.
Flame graph — a visualization of nesting and proportions.
Latency Analysis — self time of each operation and highlighting of the most expensive one.
Attributes, Events, and Links tabs — additional information about the operation.

Viewing logs for a problematic request

This section is available when logs are sent to the platform.

The APM → Logs section offers two search methods:

Filter builder — parameter selection: service, severity level, namespace, message text.
LogQL query language — for complex filtering conditions.

To navigate from a trace to logs: in the trace detail view, click “Logs” — the log entries filtered by that trace’s ID will open.

To navigate from a log to a trace: in a log entry, click “View trace” — the detail view for the corresponding request will open.

Troubleshooting

Service does not appear in the overview

If your application’s traces are available in search (the “Traces” section) but the service does not appear in the overview, the possible causes are:

No server spans. The overview only shows services that handle incoming requests (HTTP, gRPC, messages from queues). If an application only calls other services but does not accept incoming calls, it will not appear in the overview. Make sure the instrumentation creates server spans for incoming request handlers.
A different scope is selected. The scope switcher at the top of the page filters services by cluster and namespace. Make sure the scope matches your application.
Not enough time has passed. Data appears in the overview within one to two minutes after the first trace.

“Data not being received” message

This message means the platform is not receiving trace data for the current project. Check:

The application is running and instrumentation is configured. Go to the Instructions section to verify the configuration.
The authorization token has the “Write traces” permission.
The platform address and protocol are correct.

No dependencies on the map

The dependency map is built from caller–callee pairs. If the map is empty:

Make sure outgoing calls (HTTP clients, database drivers, queue clients) are instrumented. When using OpenTelemetry auto-instrumentation, this happens automatically.

If the map shows nodes with IP addresses instead of meaningful names, the instrumentation is not providing dependency type attributes. Add instrumentation for the relevant libraries.

Service appears with two different names

Different instances of the same service are sending different names in their trace data. Make sure the OTEL_SERVICE_NAME environment variable has the same value across all instances and deployment environments.

Some spans in a trace are missing

In the trace detail view, a “Missing spans” warning badge appears when some operations have a parent span that is not present in the trace. This means that part of the trace data did not reach the platform or was dropped along the way.

Possible causes:

Sampling. The SDK or OpenTelemetry Collector is configured to drop a fraction of spans (head sampling, tail sampling, probabilistic). If different services use different sampling policies, only part of the request path will end up in the trace.
Collector failures or restarts. Spans can be lost if the collector did not flush a buffered batch before a restart or if it hit a memory limit.
Retention expiration. If the trace is older than the retention period, some of its spans have already been deleted on the platform.

What to check:

Sampling consistency: identical rules across all instrumented services (or only one sampling level is used — either in the SDK or in the collector, but not both).
OpenTelemetry Collector stability: no frequent restarts, sufficient memory headroom, exporter queues are not overflowing.
Trace age: if it is older than the retention period, the missing spans are expected behavior.

How soon will data appear after connecting

Data appears in the service overview within one to two minutes after the platform receives the first trace from your application.

Errors when sending data

Messages like rate_limited, trace_too_large, or live_traces_exceeded indicate that project limits have been exceeded. To increase the limits, contact the administrator or adjust the values in the project settings (see Trace limits).

gRPC connection not working

On some network configurations, gRPC connections may not work. In this case, use the OTLP over HTTP protocol: set the environment variable OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf.

Logs not displaying

If the “Logs” section in APM is unavailable or empty, check:

Log storage is enabled at the platform level.
The application is sending logs to the platform (see Instructions → log streaming setup).
Logs contain a trace ID for linking with requests.

Overview

How this differs from viewing traces and logs separately

What you get

Quick Start

Core Concepts

Trace

Span

Apdex

Service health

Virtual dependencies

Scope

Connecting Applications

New application

Existing trace collection

What data is needed for full functionality

Connecting logs

Working with APM

Finding a service with increased response time

Determining the cause of a specific request’s slowdown

Finding the root cause of errors

Viewing service dependencies

Setting acceptable response time for a service

Finding a specific trace

Viewing logs for a problematic request

Troubleshooting

Service does not appear in the overview

“Data not being received” message

No dependencies on the map

Service appears with two different names

Some spans in a trace are missing

How soon will data appear after connecting

Errors when sending data

gRPC connection not working

Logs not displaying

An error has occurred

Tell us what you didn’t like.

Application Performance Monitoring

Overview

How this differs from viewing traces and logs separately

What you get

Quick Start

Core Concepts

Trace

Span

Apdex

Service health

Virtual dependencies

Scope

Connecting Applications

New application

Existing trace collection

What data is needed for full functionality

Connecting logs

Working with APM

Finding a service with increased response time

Determining the cause of a specific request’s slowdown

Finding the root cause of errors

Viewing service dependencies

Setting acceptable response time for a service

Finding a specific trace

Viewing logs for a problematic request

Troubleshooting

Service does not appear in the overview

“Data not being received” message

No dependencies on the map

Service appears with two different names

Some spans in a trace are missing

How soon will data appear after connecting

Errors when sending data

gRPC connection not working

Logs not displaying

An error has occurred

Tell us what you didn’t like.

Request trial access

Thank you

Error

Request callback

Thank you

Something went wrong

Book your sessions

Thank you

Error

Request demo

Thank you

Error

Get the PCI SSC Compliance Report

Thank you

Error