How do I collect metrics from applications running outside of the cluster?

  1. Configure a Service similar to the one that collects metrics from your application (but do not set the spec.selector parameter).
  2. Create Endpoints for this Service and explicitly specify the IP:PORT pairs that your applications use to expose metrics.

    Note that port names in Endpoints must match those in the Service.

An example

Application metrics are freely available (no TLS involved) at http://10.182.10.5:9114/metrics.

apiVersion: v1
kind: Service
metadata:
  name: my-app
  namespace: my-namespace
  labels:
    prometheus.deckhouse.io/custom-target: my-app
spec:
  ports:
  - name: http-metrics
    port: 9114
---
apiVersion: v1
kind: Endpoints
metadata:
  name: my-app
  namespace: my-namespace
subsets:
  - addresses:
    - ip: 10.182.10.5
    ports:
    - name: http-metrics
      port: 9114

How do I create custom Grafana dashboards?

Custom Grafana dashboards can be added to the project using the infrastructure as a code approach. To add your dashboard to Grafana, create the dedicated GrafanaDashboardDefinition Custom Resource in the cluster.

An example:

apiVersion: deckhouse.io/v1
kind: GrafanaDashboardDefinition
metadata:
  name: my-dashboard
spec:
  folder: My folder # The folder where the custom dashboard will be located
  definition: |
    {
      "annotations": {
        "list": [
          {
            "builtIn": 1,
            "datasource": "-- Grafana --",
            "enable": true,
            "hide": true,
            "iconColor": "rgba(0, 211, 255, 1)",
            "limit": 100,
...

Caution! System dashboards and dashboards added using GrafanaDashboardDefinition cannot be modified via the Grafana interface.

How do I add alerts and/or recording rules?

The CustomPrometheusRules resource allows you to add alerts.

Parameters:

  • groups — is the only parameter where you need to define alert groups. The structure of the groups is similar to that of prometheus-operator.

An example:

apiVersion: deckhouse.io/v1
kind: CustomPrometheusRules
metadata:
  name: my-rules
spec:
  groups:
  - name: cluster-state-alert.rules
    rules:
    - alert: CephClusterErrorState
      annotations:
        description: Storage cluster is in error state for more than 10m.
        summary: Storage cluster is in error state
        plk_markup_format: markdown
      expr: |
        ceph_health_status{job="rook-ceph-mgr"} > 1

How do I provision additional Grafana data sources?

The GrafanaAdditionalDatasource allows you to provision additional Grafana data sources.

A detailed description of the resource parameters is available in the Grafana documentation.

See the datasource type in the documentation for the specific datasource.

An example:

apiVersion: deckhouse.io/v1
kind: GrafanaAdditionalDatasource
metadata:
  name: another-prometheus
spec:
  type: prometheus
  access: Proxy
  url: https://another-prometheus.example.com/prometheus
  basicAuth: true
  basicAuthUser: foo
  jsonData:
    timeInterval: 30s
    httpMethod: POST
  secureJsonData:
    basicAuthPassword: bar

How do I enable secure access to metrics?

To enable secure access to metrics, we strongly recommend using kube-rbac-proxy.

How do I add Alertmanager?

Create a custom resource CustomAlertmanager with type Internal.

Example:

apiVersion: deckhouse.io/v1alpha1
kind: CustomAlertmanager
metadata:
  name: webhook
spec:
  type: Internal
  internal:
    route:
      groupBy: ['job']
      groupWait: 30s
      groupInterval: 5m
      repeatInterval: 12h
      receiver: 'webhook'
    receivers:
    - name: 'webhook'
      webhookConfigs:
      - url: 'http://webhookserver:8080/'

Refer to the description of the CustomAlertmanager custom resource for more information about the parameters.

How do I add an additional Alertmanager?

Create a custom resource CustomAlertmanager with the type External, it can point to Alertmanager through the FQDN or Kubernetes service.

FQDN Alertmanager example:

apiVersion: deckhouse.io/v1alpha1
kind: CustomAlertmanager
metadata:
  name: my-fqdn-alertmanager
spec:
  external:
    address: https://alertmanager.mycompany.com/myprefix
  type: External

Alertmanager with a Kubernetes service:

apiVersion: deckhouse.io/v1alpha1
kind: CustomAlertmanager
metadata:
  name: my-service-alertmanager
spec:
  external:
    service: 
      namespace: myns
      name: my-alertmanager
      path: /myprefix/
  type: External

Refer to the description of the CustomAlertmanager Custom Resource for more information about the parameters.

How do I ignore unnecessary alerts in Alertmanager?

The solution comes down to configuring alert routing in the Alertmanager.

You will need to:

  1. Create a parameterless receiver.
  2. Route unwanted alerts to this receiver.

Below is the sample alertmanager.yaml for this kind of a situation:

receivers:
- name: blackhole
  # the parameterless receiver is similar to "/dev/null".
- name: some-other-receiver
  # ...
route:
  routes:
  - match:
      alertname: DeadMansSwitch
    receiver: blackhole
  - match_re:
      service: ^(foo1|foo2|baz)$
    receiver: blackhole
  - receiver: some-other-receiver

A detailed description of all parameters can be found in the official documentation.

Why can’t different scrape Intervals be set for individual targets?

The Prometheus developer Brian Brazil provides, probably, the most comprehensive answer to this question. In short, different scrapeIntervals are likely to cause the following complications:

  • Increasing configuration complexity;
  • Problems with writing queries and creating graphs;
  • Short intervals are more like profiling an app, and Prometheus isn’t the best tool to do this in most cases.

The most appropriate value for scrapeInterval is in the range of 10-60s.

How do I limit Prometheus resource consumption?

To avoid situations when VPA requests more resources for Prometheus or Longterm Prometheus than those available on the corresponding node, you can explicitly limit VPA using module parameters:

  • vpa.longtermMaxCPU
  • vpa.longtermMaxMemory
  • vpa.maxCPU
  • vpa.maxMemory

How do I get access to Prometheus metrics from Lens?

Attention!!! Using this configuration creates a service in which Prometheus metrics are available without authorization.

To provide Lens access to Prometheus metrics, you need to create some resources in a cluster.

Resource templates to be created…

---
apiVersion: v1
kind: Namespace
metadata:
  name: lens-proxy
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus-lens-proxy
  namespace: lens-proxy
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus-lens-proxy:prometheus-access
rules:
- apiGroups: ["monitoring.coreos.com"]
  resources: ["prometheuses/http"]
  resourceNames: ["main", "longterm"]
  verbs: ["get", "create", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus-lens-proxy:prometheus-access
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus-lens-proxy:prometheus-access
subjects:
- kind: ServiceAccount
  name: prometheus-lens-proxy
  namespace: lens-proxy
---
apiVersion: v1
kind: Secret
metadata:
  name: prometheus-lens-proxy-sa
  namespace: lens-proxy
  annotations:
    kubernetes.io/service-account.name: prometheus-lens-proxy
type: kubernetes.io/service-account-token
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-lens-proxy-conf
  namespace: lens-proxy
data:
  "39-log-format.sh": |
    cat > /etc/nginx/conf.d/log-format.conf <<"EOF"
    log_format  body  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"'
                      ' req body: $request_body';
    EOF
  "40-prometheus-proxy-conf.sh": |
    #!/bin/sh
    prometheus_service="$(getent hosts prometheus.d8-monitoring | awk '{print $2}')"
    nameserver="$(awk '/nameserver/{print $2}' < /etc/resolv.conf)"
    cat > /etc/nginx/conf.d/prometheus.conf <<EOF
    server {
      listen 80 default_server;
      resolver ${nameserver} valid=30s;
      set \$upstream ${prometheus_service};
      location / {
        proxy_http_version 1.1;
        proxy_set_header Authorization "Bearer ${BEARER_TOKEN}";
        proxy_pass https://\$upstream:9090$request_uri;
      }
      access_log /dev/stdout body;
    }
    EOF
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-lens-proxy
  namespace: lens-proxy
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-lens-proxy
  template:
    metadata:
      labels:
        app: prometheus-lens-proxy
    spec:
      containers:
      - name: nginx
        image: nginx:1.21.4-alpine
        env:
        - name: BEARER_TOKEN
          valueFrom:
            secretKeyRef:
              name: prometheus-lens-proxy-sa
              key: token
        ports:
        - containerPort: 80
        volumeMounts:
        - mountPath: /docker-entrypoint.d/40-prometheus-proxy-conf.sh
          subPath: "40-prometheus-proxy-conf.sh"
          name: prometheus-lens-proxy-conf
        - mountPath: /docker-entrypoint.d/39-log-format.sh
          name: prometheus-lens-proxy-conf
          subPath: 39-log-format.sh
      serviceAccountName: prometheus-lens-proxy
      volumes:
      - name: prometheus-lens-proxy-conf
        configMap:
          name: prometheus-lens-proxy-conf
          defaultMode: 0755
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus-lens-proxy
  namespace: lens-proxy
spec:
  selector:
    app: prometheus-lens-proxy
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 80

After the resources deployment, Prometheus metrics will be available at address lens-proxy/prometheus-lens-proxy:8080. Lens Prometheus type — Prometheus Operator.

Starting from the version 5.2.7, Lens requires pod and namespace labels to be present on node-exporter metrics. Otherwise, node resource consumption will not appear on Lens charts.

To fix this, apply the following resource:

A resource that fix the display of metrics…

apiVersion: deckhouse.io/v1
kind: CustomPrometheusRules
metadata:
  name: lens-hack
spec:
  groups:
  - name: lens-hack
    rules:
    - expr: node_cpu_seconds_total{mode=~"user|system", pod!~".+", namespace!~".+"}
        * on(node) group_left(namespace, pod) kube_pod_info{namespace="d8-monitoring",
        created_by_name="node-exporter"}
      record: node_cpu_seconds_total
    - expr: node_filesystem_size_bytes{mountpoint="/", pod!~".+", namespace!~".+"}
        * on(node) group_left(namespace, pod) kube_pod_info{namespace="d8-monitoring",
        created_by_name="node-exporter"}
      record: node_filesystem_size_bytes
    - expr: node_filesystem_avail_bytes{mountpoint="/", pod!~".+", namespace!~".+"}
        * on(node) group_left(namespace, pod) kube_pod_info{namespace="d8-monitoring",
        created_by_name="node-exporter"}
      record: node_filesystem_avail_bytes
    - expr: node_memory_MemTotal_bytes{pod!~".+", namespace!~".+"} * on(node) group_left(namespace,
        pod) kube_pod_info{namespace="d8-monitoring", created_by_name="node-exporter"}
      record: node_memory_MemTotal_bytes
    - expr: node_memory_MemFree_bytes{pod!~".+", namespace!~".+"} * on(node) group_left(namespace,
        pod) kube_pod_info{namespace="d8-monitoring", created_by_name="node-exporter"}
      record: node_memory_MemFree_bytes
    - expr: node_memory_Buffers_bytes{pod!~".+", namespace!~".+"} * on(node) group_left(namespace,
        pod) kube_pod_info{namespace="d8-monitoring", created_by_name="node-exporter"}
      record: node_memory_Buffers_bytes
    - expr: node_memory_Cached_bytes{pod!~".+", namespace!~".+"} * on(node) group_left(namespace,
        pod) kube_pod_info{namespace="d8-monitoring", created_by_name="node-exporter"}
      record: node_memory_Cached_bytes

How do I set up a ServiceMonitor or PodMonitor to work with Prometheus?

Add the prometheus: main label to the PodMonitor or ServiceMonitor. Add the label prometheus.deckhouse.io/monitor-watcher-enabled: "true" to the namespace where the PodMonitor or ServiceMonitor was created.

Example:

---
apiVersion: v1
kind: Namespace
metadata:
  name: frontend
  labels:
    prometheus.deckhouse.io/monitor-watcher-enabled: "true"
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: example-app
  namespace: frontend
  labels:
    prometheus: main
spec:
  selector:
    matchLabels:
      app: example-app
  endpoints:
    - port: web

How to expand disk size

  1. To request a larger volume for a PVC, edit the PVC object and specify a larger size in spec.resources.requests.storage field.
    • You can only expand a PVC if its storage class’s allowVolumeExpansion field is set to true.
  2. If storage doesn’t support online resize, the message Waiting for user to (re-)start a pod to finish file system resize of volume on node. will appear in the PersistentVolumeClaim status.
  3. Restart the Pod to complete the file system resizing.