The module lifecycle stage: General Availability
The module has requirements for installation
Overview
When an application is made up of several services, finding the cause of a slowdown or error can be challenging. A user’s request passes through a chain of calls, and it is unclear exactly where the problem occurs.
Application Performance Monitoring solves this. The platform collects data about each request, automatically analyzes response times and errors across services, and shows you exactly where the problem lies.
How this differs from viewing traces and logs separately
The platform lets you view traces and logs without the APM section — through the Data Overview section, where you write queries manually and receive raw data. However, this requires knowing in advance what to look for.
Performance monitoring works differently: the platform analyzes all incoming traces on its own, calculates metrics for each service, and presents the overall picture. Instead of sifting through individual traces, you immediately see which services are running normally and which are experiencing problems. Both approaches complement each other: performance monitoring shows the big picture and leads you to a problematic request, while manual data exploration lets you investigate a specific situation in detail.
What you get
- Service overview — which services are running normally, which have increased response times, and where errors are occurring.
- Dependency map — which services interact with each other and where in the chain the problem arises.
- Request breakdown — the path of a specific request through all services with the duration of each step.
- Log correlation — from a problematic request straight to the logs of the specific service at the time of the error.
- Configurable thresholds — you define what counts as an acceptable response time for your service.
This feature is in alpha.
Quick Start
- Make sure Application Performance Monitoring has been enabled by the administrator.
- Open the project in the platform and go to the Instructions section.
- Choose a connection method:
- If your application does not send trace data yet, select “Application Instrumentation” and follow the step-by-step guide for your programming language. The platform address and token are filled in automatically.
- If you already have trace collection set up (for example, via OpenTelemetry Collector), select “Trace Streaming Setup” and specify the platform address as the destination.
- Start your application and make a few requests.
- Open the APM section in the project’s side menu. Within one to two minutes, your service will appear in the list.
If the service does not appear, see Troubleshooting.
Core Concepts
Trace
Imagine a user clicks the “Pay” button. The request goes through an API gateway, the order service, the payment service, and the database. A trace is a record of that entire journey: which services were involved, how long each step took, and where an error occurred.
In the APM section, you can find a trace by service name, time, status, or other parameters, and view it as a timeline diagram — a sequence of steps with their durations.
Span
A single step within a trace. For example: handling an HTTP request, calling another service, or running a SQL query against a database. Each span has a name, duration, status, and additional attributes.
Apdex
A score that reflects how satisfied users are with a service’s response speed. Expressed as a number from 0 to 1: a value of 0.95 means the vast majority of requests were handled within an acceptable time.
You define what response time is considered acceptable. The default is 512 milliseconds. Requests within this threshold are considered fast; requests exceeding four times the threshold are considered unacceptably slow.
In the service overview, Apdex is displayed in a dedicated column. You can change the threshold for each service — for example, a background report-processing service may have a higher acceptable response time than a user-facing API.
Service health
The platform automatically determines each service’s health based on its metrics:
| Status | When it applies |
|---|---|
| Critical | Apdex below 0.70, or error rate above 5%, or response time (95th percentile) exceeds 3 seconds |
| Warning | Apdex below 0.85, or error rate above 1%, or response time (95th percentile) exceeds 1 second |
| Healthy | All indicators are within acceptable limits |
| No data | No incoming traffic |
Thresholds can be changed for each service individually (see Setting acceptable response time for a service).
Virtual dependencies
Not all system components send traces. Databases, caches, and message queues are typically not instrumented, but the platform recognizes them automatically from the attributes of outgoing calls from your services. On the dependency map, these components appear as separate nodes with the appropriate type: database, cache, queue, or external service.
Scope
If multiple Kubernetes clusters or Kubernetes namespaces send data to the platform, you can focus the overview on a specific cluster and Kubernetes namespace using the switcher at the top of the page. Services running outside Kubernetes appear in a separate group.
Connecting Applications
New application
If your application does not send trace data yet, use the built-in instructions:
- Open the project in the platform and go to the Instructions section.
- Select “Application Instrumentation”.
- Choose your programming language. Supported: Java, Python, Node.js, Go, .NET, Ruby, PHP, C++, Elixir, Rust.
- Follow the step-by-step guide. The platform address and authorization token are filled in automatically. Installation options are available for virtual machines, Kubernetes, Docker, and Windows.
The instructions include a verification step and a troubleshooting section for each language.
Existing trace collection
If you already have trace collection set up (OpenTelemetry Collector, Jaeger, Zipkin, or Grafana Alloy), simply route the data to the platform. When using tools other than OpenTelemetry, make sure your traces contain the required attributes — see What data is needed for full functionality for details.
- Address:
https://api.<your installation domain> - Authorization:
Authorization: Bearer <token>header. The token must have the “Write traces” permission. You can create a token in the project settings under “API Tokens”.
Supported protocols:
| Protocol | Path |
|---|---|
| OTLP over HTTP (recommended) | /otlp/v1/traces or /v1/traces |
| OTLP over gRPC | gRPC connection to api.<domain> |
| Jaeger | /jaeger/api/traces |
| Zipkin | /zipkin/spans |
For detailed configuration instructions for OpenTelemetry Collector, Jaeger, and Alloy, open the Instructions section in the project and select “Trace Streaming Setup” or “OpenTelemetry Collector”.
What data is needed for full functionality
APM analyzes span attributes within traces. Each feature requires a specific set of attributes.
Required attributes:
| Attribute | Purpose | OpenTelemetry auto-instrumentation |
|---|---|---|
resource.service.name |
Service name. Without it, the service will not be identified in the overview | Must be set manually via the OTEL_SERVICE_NAME environment variable. Without it, a default value is used |
span.kind = SERVER or CONSUMER |
Handling incoming requests (HTTP, gRPC) or queue messages. Only spans with these types appear in the service overview | Set automatically for HTTP servers, gRPC servers, and queue consumers |
span.kind = CLIENT |
Outgoing calls to other services. Required for building the dependency map (the map is built from CLIENT — SERVER pairs) | Set automatically for HTTP clients, gRPC clients, and database drivers |
Recommended attributes:
| Attribute | Purpose | OpenTelemetry auto-instrumentation |
|---|---|---|
db.system, messaging.system |
Dependency type (database, queue). Used to display virtual dependencies on the map | Set automatically when the corresponding library instrumentation is active |
server.address |
Address of the called service. Used to display external dependencies on the map | Set automatically for HTTP clients |
k8s.cluster.name, k8s.namespace.name |
Filtering by Kubernetes cluster and namespace in the scope switcher. Without them, the service appears in the “Outside Kubernetes” group | Not set automatically. Requires a Kubernetes resource detector, the k8sattributes processor in OpenTelemetry Collector, or manual configuration via OTEL_RESOURCE_ATTRIBUTES |
resource.service.namespace |
Grouping services by namespace | Not set automatically. Set via OTEL_RESOURCE_ATTRIBUTES |
If the required attributes are missing, traces will be available in search (the “Traces” section), but the service overview and dependency map will be empty.
When using tools other than OpenTelemetry (Jaeger SDK, Zipkin), make sure your traces contain the listed attributes. Jaeger and Zipkin support span.kind and service name, but Kubernetes attributes and dependency types may be absent.
Connecting logs
If you send logs to the platform in addition to traces, you can link between them: from a request’s details you can jump to the service’s logs at the time of that request, and from a log entry you can jump to the corresponding trace.
For this linking to work, logs must contain a trace identifier (trace_id). When using OpenTelemetry SDK, the identifier is added automatically.
To set up log shipping, go to the Instructions section and choose the appropriate method (Vector, Promtail, or OpenTelemetry Collector).
Working with APM
Finding a service with increased response time
- Open the APM section in the project’s side menu. The Overview page displays a table of all services with key indicators: requests per second, error rate, response time (median and percentiles), and Apdex.
- If needed, select the desired cluster and namespace in the scope switcher.
- Enable the “Unhealthy only” filter — the table will show only services with a Warning or Critical status.
- The table is sorted by default by impact on users: services with high traffic and low Apdex appear first.
- Click a service name to go to its detailed view.
Determining the cause of a specific request’s slowdown
- In the service detail view, find the “Key Operations” section. Operations with high response times or low Apdex point to problematic areas.
- Click “Open traces” — the trace search will open filtered to the selected service.
- Select a trace with a long duration and open it.
- The timeline diagram shows all request operations as a tree. Each bar represents a single operation with its duration. The “Latency Analysis” section shows the self time of each operation (excluding child calls) and highlights the most expensive one.
Finding the root cause of errors
- In the service overview, find a service with a high error rate.
- Open the service detail view and find the operation with errors.
- Click “Open traces” and filter by error status.
- Open a trace with an error. The “Attributes” tab will show error details; the “Events” tab will show the stack trace (if the instrumentation recorded one).
- If logs are connected, click “Logs” to see the service’s log entries at the time of that request.
Viewing service dependencies
- Open APM → Map. It shows all discovered services and connections between them.
- Nodes represent services. Edges represent calls between them. Databases, caches, queues, and external services appear as separate nodes with the appropriate labels.
- Click a node to see service metrics. Click an edge to see connection metrics: calls per second, error rate, and response time.
- In the service detail view, the “Show on map” button focuses the map on the selected service and its surroundings.
Setting acceptable response time for a service
By default, all services use the same thresholds. If the defaults are not appropriate for a specific service (for example, a background report-processing service may have a higher acceptable response time than a user-facing API):
- Open the service detail view.
- Click “Service thresholds”.
- Adjust the values you need:
| Parameter | Default | Description |
|---|---|---|
| Acceptable response time | 512 ms | Requests faster than this are considered fast |
| Apdex (warning) | 0.85 | Below this value — Warning status |
| Apdex (critical) | 0.70 | Below this value — Critical status |
| Error rate (warning) | 1% | Above this value — Warning |
| Error rate (critical) | 5% | Above this value — Critical |
| Response time, 95th percentile (warning) | 1000 ms | Above — Warning |
| Response time, 95th percentile (critical) | 3000 ms | Above — Critical |
| Percentage of requests exceeding threshold | 10% | Above — Warning |
Click “Save”. The service health will be recalculated with the new thresholds. To revert to defaults, click “Reset”.
The acceptable response time is chosen from a fixed set of values: 32, 64, 128, 256, 512, 1024, 2048, or 4096 milliseconds.
Finding a specific trace
The APM → Traces section provides two search methods:
- Filter builder — visual parameter selection: service name, operation name, status, duration, arbitrary attributes.
- TraceQL query language — for complex queries with combinations of conditions.
TraceQL examples:
{ resource.service.name = "checkout-api" && status = error }
{ duration > 1s }
{ span.http.status_code >= 500 }A quick jump to a trace by its ID is also available.
In the trace detail view:
- Timeline diagram — a tree of all request operations with their durations.
- Flame graph — a visualization of nesting and proportions.
- Latency Analysis — self time of each operation and highlighting of the most expensive one.
- Attributes, Events, and Links tabs — additional information about the operation.
Viewing logs for a problematic request
This section is available when logs are sent to the platform.
The APM → Logs section offers two search methods:
- Filter builder — parameter selection: service, severity level, namespace, message text.
- LogQL query language — for complex filtering conditions.
To navigate from a trace to logs: in the trace detail view, click “Logs” — the log entries filtered by that trace’s ID will open.
To navigate from a log to a trace: in a log entry, click “View trace” — the detail view for the corresponding request will open.
Troubleshooting
Service does not appear in the overview
If your application’s traces are available in search (the “Traces” section) but the service does not appear in the overview, the possible causes are:
-
No server spans. The overview only shows services that handle incoming requests (HTTP, gRPC, messages from queues). If an application only calls other services but does not accept incoming calls, it will not appear in the overview. Make sure the instrumentation creates server spans for incoming request handlers.
-
A different scope is selected. The scope switcher at the top of the page filters services by cluster and namespace. Make sure the scope matches your application.
-
Not enough time has passed. Data appears in the overview within one to two minutes after the first trace.
“Data not being received” message
This message means the platform is not receiving trace data for the current project. Check:
- The application is running and instrumentation is configured. Go to the Instructions section to verify the configuration.
- The authorization token has the “Write traces” permission.
- The platform address and protocol are correct.
No dependencies on the map
The dependency map is built from caller–callee pairs. If the map is empty:
- Make sure outgoing calls (HTTP clients, database drivers, queue clients) are instrumented. When using OpenTelemetry auto-instrumentation, this happens automatically.
If the map shows nodes with IP addresses instead of meaningful names, the instrumentation is not providing dependency type attributes. Add instrumentation for the relevant libraries.
Service appears with two different names
Different instances of the same service are sending different names in their trace data. Make sure the OTEL_SERVICE_NAME environment variable has the same value across all instances and deployment environments.
Some spans in a trace are missing
In the trace detail view, a “Missing spans” warning badge appears when some operations have a parent span that is not present in the trace. This means that part of the trace data did not reach the platform or was dropped along the way.
Possible causes:
- Sampling. The SDK or OpenTelemetry Collector is configured to drop a fraction of spans (head sampling, tail sampling, probabilistic). If different services use different sampling policies, only part of the request path will end up in the trace.
- Collector failures or restarts. Spans can be lost if the collector did not flush a buffered batch before a restart or if it hit a memory limit.
- Retention expiration. If the trace is older than the retention period, some of its spans have already been deleted on the platform.
What to check:
- Sampling consistency: identical rules across all instrumented services (or only one sampling level is used — either in the SDK or in the collector, but not both).
- OpenTelemetry Collector stability: no frequent restarts, sufficient memory headroom, exporter queues are not overflowing.
- Trace age: if it is older than the retention period, the missing spans are expected behavior.
How soon will data appear after connecting
Data appears in the service overview within one to two minutes after the platform receives the first trace from your application.
Errors when sending data
Messages like rate_limited, trace_too_large, or live_traces_exceeded indicate that project limits have been exceeded. To increase the limits, contact the administrator or adjust the values in the project settings (see Trace limits).
gRPC connection not working
On some network configurations, gRPC connections may not work. In this case, use the OTLP over HTTP protocol: set the environment variable OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf.
Logs not displaying
If the “Logs” section in APM is unavailable or empty, check:
- Log storage is enabled at the platform level.
- The application is sending logs to the platform (see Instructions → log streaming setup).
- Logs contain a trace ID for linking with requests.