How do I collect metrics from applications running outside of the cluster?

  1. Configure a Service similar to the one that collects metrics from your application (but do not set the spec.selector parameter).
  2. Create Endpoints for this Service and explicitly specify the IP:PORT pairs that your applications use to expose metrics.

    Note that port names in Endpoints must match those in the Service.

An example:

Application metrics are freely available (no TLS involved) at

apiVersion: v1
kind: Service
  name: my-app
  namespace: my-namespace
  labels: my-app
  - name: http-metrics
    port: 9114
apiVersion: v1
kind: Endpoints
  name: my-app
  namespace: my-namespace
  - addresses:
    - ip:
    - name: http-metrics
      port: 9114

How do I create custom Grafana dashboards?

The custom Grafana dashboards can be added to the project using the infrastructure as a code approach. To add your dashboard to Grafana, create the dedicated GrafanaDashboardDefinition custom resource in the cluster.

An example:

kind: GrafanaDashboardDefinition
  name: my-dashboard
  folder: My folder # The folder where the custom dashboard will be located
  definition: |
      "annotations": {
        "list": [
            "builtIn": 1,
            "datasource": "-- Grafana --",
            "enable": true,
            "hide": true,
            "iconColor": "rgba(0, 211, 255, 1)",
            "limit": 100,

Caution! System dashboards and dashboards added using GrafanaDashboardDefinition cannot be modified via the Grafana interface.

How do I add alerts and/or recording rules?

The CustomPrometheusRules resource allows you to add alerts.


groups — is the only parameter where you need to define alert groups. The structure of the groups is similar to that of prometheus-operator.

An example:

kind: CustomPrometheusRules
  name: my-rules
  - name: cluster-state-alert.rules
    - alert: CephClusterErrorState
        description: Storage cluster is in error state for more than 10m.
        summary: Storage cluster is in error state
        plk_markup_format: markdown
      expr: |
        ceph_health_status{job="rook-ceph-mgr"} > 1

How do I provision additional Grafana Datasources?

The GrafanaAdditionalDatasource allows you to provision additional Grafana Datasources.

A detailed description of the resource parameters is available in the Grafana documentation.

An example:

kind: GrafanaAdditionalDatasource
  name: another-prometheus
  type: prometheus
  access: Proxy
  basicAuth: true
  basicAuthUser: foo
    timeInterval: 30s
    httpMethod: POST
    basicAuthPassword: bar

How do I enable secure access to metrics?

To enable secure access to metrics, we strongly recommend using kube-rbac-proxy.

How do I add an additional alertmanager?

Create a service with the main that points to your Alertmanager.

Optional annotations:

  • — the prefix to add to HTTP requests;
    • It is set to “/” by default.

Caution! Currently, only the plain HTTP scheme is supported.

An example:

apiVersion: v1
kind: Service
  name: my-alertmanager
  namespace: my-monitoring
  labels: main
  annotations: /myprefix/
  type: ClusterIP
  clusterIP: None
  - name: http
    port: 80
    protocol: TCP
    targetPort: http
    app: my-alertmanager

Caution!! If you create Endpoint for a Service manually (e.g., to use an external alertmanager), you must specify the port name both in the Service and in Endpoints.

How do I ignore unnecessary alerts in alertmanager?

The solution comes down to configuring alert routing in the Alertmanager.

You will need to:

  1. Create a parameterless receiver.
  2. Route unwanted alerts to this receiver.

Below is the sample alertmanager.yaml for this kind of a situation:

- name: blackhole
  # the parameterless receiver is similar to "/dev/null".
- name: some-other-receiver
  # ...
  - match:
      alertname: DeadMansSwitch
    receiver: blackhole
  - match_re:
      service: ^(foo1|foo2|baz)$
    receiver: blackhole
  - receiver: some-other-receiver

A detailed description of all parameters can be found in the official documentation.

Why can’t different scrape Intervals be set for individual targets?

The Prometheus developer Brian Brazil provides, probably, the most comprehensive answer to this question. In short, different scrapeIntervals are likely to cause the following complications:

  • Increasing configuration complexity;
  • Problems with writing queries and creating graphs;
  • Short intervals are more like profiling an app, and Prometheus isn’t the best tool to do this in most cases.

The most appropriate value for scrapeInterval is in the range of 10-60s.

How do I limit Prometheus resource consumption?

To avoid situations when VPA requests more resources for Prometheus or Longterm Prometheus than those available on the corresponding node, you can explicitly limit VPA using module parameters:

  • vpa.longtermMaxCPU
  • vpa.longtermMaxMemory
  • vpa.maxCPU
  • vpa.maxMemory