How do I collect metrics from applications running outside of the cluster?
- Configure a Service similar to the one that collects metrics from your application (but do not set the
spec.selector
parameter). - Create Endpoints for this Service and explicitly specify the
IP:PORT
pairs that your applications use to expose metrics.Note that port names in Endpoints must match those in the Service.
An example
Application metrics are freely available (no TLS involved) at http://10.182.10.5:9114/metrics
.
apiVersion: v1
kind: Service
metadata:
name: my-app
namespace: my-namespace
labels:
prometheus.deckhouse.io/custom-target: my-app
spec:
ports:
- name: http-metrics
port: 9114
---
apiVersion: v1
kind: Endpoints
metadata:
name: my-app
namespace: my-namespace
subsets:
- addresses:
- ip: 10.182.10.5
ports:
- name: http-metrics
port: 9114
How do I create custom Grafana dashboards?
Custom Grafana dashboards can be added to the project using the infrastructure as a code approach.
To add your dashboard to Grafana, create the dedicated GrafanaDashboardDefinition
Custom Resource in the cluster.
An example:
apiVersion: deckhouse.io/v1
kind: GrafanaDashboardDefinition
metadata:
name: my-dashboard
spec:
folder: My folder # The folder where the custom dashboard will be located
definition: |
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"limit": 100,
...
Caution! System dashboards and dashboards added using GrafanaDashboardDefinition cannot be modified via the Grafana interface.
How do I add alerts and/or recording rules?
The CustomPrometheusRules
resource allows you to add alerts.
Parameters:
groups
— is the only parameter where you need to define alert groups. The structure of the groups is similar to that of prometheus-operator.
An example:
apiVersion: deckhouse.io/v1
kind: CustomPrometheusRules
metadata:
name: my-rules
spec:
groups:
- name: cluster-state-alert.rules
rules:
- alert: CephClusterErrorState
annotations:
description: Storage cluster is in error state for more than 10m.
summary: Storage cluster is in error state
plk_markup_format: markdown
expr: |
ceph_health_status{job="rook-ceph-mgr"} > 1
How do I provision additional Grafana data sources?
The GrafanaAdditionalDatasource
allows you to provision additional Grafana data sources.
A detailed description of the resource parameters is available in the Grafana documentation.
See the datasource type in the documentation for the specific datasource.
An example:
apiVersion: deckhouse.io/v1
kind: GrafanaAdditionalDatasource
metadata:
name: another-prometheus
spec:
type: prometheus
access: Proxy
url: https://another-prometheus.example.com/prometheus
basicAuth: true
basicAuthUser: foo
jsonData:
timeInterval: 30s
httpMethod: POST
secureJsonData:
basicAuthPassword: bar
How do I enable secure access to metrics?
To enable secure access to metrics, we strongly recommend using kube-rbac-proxy.
How do I add Alertmanager?
Create a custom resource CustomAlertmanager
with type Internal
.
Example:
apiVersion: deckhouse.io/v1alpha1
kind: CustomAlertmanager
metadata:
name: webhook
spec:
type: Internal
internal:
route:
groupBy: ['job']
groupWait: 30s
groupInterval: 5m
repeatInterval: 12h
receiver: 'webhook'
receivers:
- name: 'webhook'
webhookConfigs:
- url: 'http://webhookserver:8080/'
Refer to the description of the CustomAlertmanager custom resource for more information about the parameters.
How do I add an additional Alertmanager?
Create a custom resource CustomAlertmanager
with the type External
, it can point to Alertmanager through the FQDN or Kubernetes service.
FQDN Alertmanager example:
apiVersion: deckhouse.io/v1alpha1
kind: CustomAlertmanager
metadata:
name: my-fqdn-alertmanager
spec:
external:
address: https://alertmanager.mycompany.com/myprefix
type: External
Alertmanager with a Kubernetes service:
apiVersion: deckhouse.io/v1alpha1
kind: CustomAlertmanager
metadata:
name: my-service-alertmanager
spec:
external:
service:
namespace: myns
name: my-alertmanager
path: /myprefix/
type: External
Refer to the description of the CustomAlertmanager Custom Resource for more information about the parameters.
How do I ignore unnecessary alerts in Alertmanager?
The solution comes down to configuring alert routing in the Alertmanager.
You will need to:
- Create a parameterless receiver.
- Route unwanted alerts to this receiver.
Below is the sample alertmanager.yaml
for this kind of a situation:
receivers:
- name: blackhole
# the parameterless receiver is similar to "/dev/null".
- name: some-other-receiver
# ...
route:
routes:
- match:
alertname: DeadMansSwitch
receiver: blackhole
- match_re:
service: ^(foo1|foo2|baz)$
receiver: blackhole
- receiver: some-other-receiver
A detailed description of all parameters can be found in the official documentation.
Why can’t different scrape Intervals be set for individual targets?
The Prometheus developer Brian Brazil provides, probably, the most comprehensive answer to this question. In short, different scrapeIntervals are likely to cause the following complications:
- Increasing configuration complexity;
- Problems with writing queries and creating graphs;
- Short intervals are more like profiling an app, and Prometheus isn’t the best tool to do this in most cases.
The most appropriate value for scrapeInterval is in the range of 10-60s.
How do I limit Prometheus resource consumption?
To avoid situations when VPA requests more resources for Prometheus or Longterm Prometheus than those available on the corresponding node, you can explicitly limit VPA using module parameters:
vpa.longtermMaxCPU
vpa.longtermMaxMemory
vpa.maxCPU
vpa.maxMemory
How do I get access to Prometheus metrics from Lens?
⛔ Attention!!! Using this configuration creates a service in which Prometheus metrics are available without authorization.
To provide Lens access to Prometheus metrics, you need to create some resources in a cluster.
After the resources deployment, Prometheus metrics will be available at address lens-proxy/prometheus-lens-proxy:8080
.
Lens Prometheus type — Prometheus Operator
.
Starting from the version 5.2.7
, Lens requires pod
and namespace
labels to be present on node-exporter metrics.
Otherwise, node resource consumption will not appear on Lens charts.
To fix this, apply the following resource:
How do I set up a ServiceMonitor or PodMonitor to work with Prometheus?
Add the prometheus: main
label to the PodMonitor or ServiceMonitor.
Add the label prometheus.deckhouse.io/monitor-watcher-enabled: "true"
to the namespace where the PodMonitor or ServiceMonitor was created.
Example:
---
apiVersion: v1
kind: Namespace
metadata:
name: frontend
labels:
prometheus.deckhouse.io/monitor-watcher-enabled: "true"
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: example-app
namespace: frontend
labels:
prometheus: main
spec:
selector:
matchLabels:
app: example-app
endpoints:
- port: web
How to expand disk size
- To request a larger volume for a PVC, edit the PVC object and specify a larger size in
spec.resources.requests.storage
field.- You can only expand a PVC if its storage class’s
allowVolumeExpansion
field is set to true.
- You can only expand a PVC if its storage class’s
- If storage doesn’t support online resize, the message
Waiting for user to (re-)start a pod to finish file system resize of volume on node.
will appear in the PersistentVolumeClaim status. - Restart the Pod to complete the file system resizing.