The page displays a list of all alerts of monitoring in the Deckhouse Kubernetes Platform.
Alerts are grouped by modules. To the right of the alert name, there are icons indicating the minimum DKP edition in which the alert is available, and the alert severity.
For each alert, a summary is provided, and if available, the detailed alert description can be viewed by expanding it.
Module admission-policy-engine
-
D8AdmissionPolicyEngineNotBootstrapped
CE
S7
Admission-policy-engine module hasn't been bootstrapped for 10 minutes.
The admission-policy-engine module couldn’t bootstrap.
Steps to troubleshoot:
-
Verify that the module’s components are up and running:
kubectl get pods -n d8-admission-policy-engine
-
Check logs for issues, such as missing constraint templates or incomplete CRD creation:
kubectl logs -n d8-system -lapp=deckhouse --tail=1000 | grep admission-policy-engine
-
-
OperationPolicyViolation
CE
S7
At least one object violates the configured cluster operation policies.
You have configured operation policies for the cluster, and one or more existing objects are violating these policies.
To identify violating objects:
-
Run the following Prometheus query:
count by (violating_namespace, violating_kind, violating_name, violation_msg) ( d8_gatekeeper_exporter_constraint_violations{ violation_enforcement="deny", source_type="OperationPolicy" } )
-
Alternatively, check the admission-policy-engine Grafana dashboard.
-
-
PodSecurityStandardsViolation
CE
S7
At least one pod violates the configured cluster pod security standards.
You have configured Pod Security Standards, and one or more running pods are violating these standards.
To identify violating pods:
-
Run the following Prometheus query:
count by (violating_namespace, violating_name, violation_msg) ( d8_gatekeeper_exporter_constraint_violations{ violation_enforcement="deny", violating_namespace=~".*", violating_kind="Pod", source_type="PSS" } )
-
Alternatively, check the admission-policy-engine Grafana dashboard.
-
-
SecurityPolicyViolation
CE
S7
At least one object violates the configured cluster security policies.
You have configured security policies for the cluster, and one or more existing objects are violating these policies.
To identify violating objects:
-
Run the following Prometheus query:
count by (violating_namespace, violating_kind, violating_name, violation_msg) ( d8_gatekeeper_exporter_constraint_violations{ violation_enforcement="deny", source_type="SecurityPolicy" } )
-
Alternatively, check the admission-policy-engine Grafana dashboard.
-
Module cert-manager
-
CertmanagerCertificateExpired
CE
S4
Certificate {{$labels.exported_namespace}}/{{$labels.name}} is not provided.
Certificate is not provided.
To check the certificate details, run the following command:
kubectl -n {{$labels.exported_namespace}} describe certificate {{$labels.name}}
-
CertmanagerCertificateExpiredSoon
CE
S4
Certificate will expire soon.
The certificate
{{$labels.exported_namespace}}/{{$labels.name}}
will expire in less than two weeks.To check the certificate details, run the following command:
kubectl -n <NAMESPACE> describe certificate <CERTIFICATE-NAME>
-
CertmanagerCertificateOrderErrors
CE
S5
Cert-manager couldn't order a certificate.
Cert-manager received responses with the status code
{{ $labels.status }}
when requesting{{ $labels.scheme }}://{{ $labels.host }}{{ $labels.path }}
.This can affect certificate ordering and prolongation in the future. For details, check the cert-manager logs using the following command:
kubectl -n d8-cert-manager logs -l app=cert-manager -c cert-manager
Module chrony
-
NodeTimeOutOfSync
CE
S5
Clock on the node {{$labels.node}} is drifting.
Time on the node
{{$labels.node}}
is out of sync and drifts apart from the NTP server clock by {{ $value }} seconds.To resolve the time synchronization issues:
- Fix network errors:
- Ensure the upstream time synchronization servers defined in the chrony configuration are available.
- Eliminate large packet loss and excessive latency to upstream time synchronization servers.
- Modify the NTP servers list defined in the chrony configuration.
- Fix network errors:
-
NTPDaemonOnNodeDoesNotSynchronizeTime
CE
S5
NTP daemon on the node {{$labels.node}} haven't synchronized time for too long.
Steps to troubleshoot:
-
Check if the chrony pod is running on the node by executing the following command:
kubectl -n d8-chrony --field-selector spec.nodeName="{{$labels.node}}" get pods
-
Verify the chrony daemon’s status by executing the following command:
kubectl -n d8-chrony exec <POD_NAME> -- /opt/chrony-static/bin/chronyc sources
-
Resolve the time synchronization issues:
- Fix network errors:
- Ensure the upstream time synchronization servers defined in the chrony configuration are available.
- Eliminate large packet loss and excessive latency to upstream time synchronization servers.
- Modify the NTP servers list defined in the chrony configuration.
- Fix network errors:
-
Module cloud-provider-yandex
-
D8YandexNatInstanceConnectionsQuotaUtilization
CE
S4
Yandex nat-instance connections quota utilization is above 85% over the last 5 minutes.
Nat-instance connections quota should be increased by Yandex technical support.
-
NATInstanceWithDeprecatedAvailabilityZone
CE
S9
NAT Instance {{ $labels.name }} is in deprecated availability zone.
Availability zone
ru-central1-c
is deprecated by Yandex.Cloud. You should migrate your NAT Instance toru-central1-a
orru-central1-b
zone.You can use the following instructions to migrate.
IMPORTANT The following actions are destructive changes and cause downtime (typically a several tens of minutes, also it depending on the response time of Yandex Cloud).
-
Migrate NAT Instance.
Get
providerClusterConfiguration.withNATInstance
:kubectl -n d8-system exec -ti svc/deckhouse-leader -c deckhouse -- deckhouse-controller module values -g cloud-provider-yandex -o json | jq -c | jq '.cloudProviderYandex.internal.providerClusterConfiguration.withNATInstance'
-
If you specified
withNATInstance.natInstanceInternalAddress
and/orwithNATInstance.internalSubnetID
in providerClusterConfiguration, you need to remove them with the following command:kubectl -n d8-system exec -ti svc/deckhouse-leader -c deckhouse -- deckhouse-controller edit provider-cluster-configuration
-
If you specified
withNATInstance.externalSubnetID
and/orwithNATInstance.natInstanceExternalAddress
in providerClusterConfiguration, you need to change these to the appropriate values.You can get address and subnetID from Yandex.Cloud console or with CLI
Change
withNATInstance.externalSubnetID
andwithNATInstance.natInstanceExternalAddress
with the following command:kubectl -n d8-system exec -ti svc/deckhouse-leader -c deckhouse -- deckhouse-controller edit provider-cluster-configuration
-
-
Run the appropriate edition and version of the Deckhouse installer container on the local machine (change the container registry address if necessary) and do converge.
-
Get edition and version of the Deckhouse:
DH_VERSION=$(kubectl -n d8-system get deployment deckhouse -o jsonpath='{.metadata.annotations.core\.deckhouse\.io\/version}') DH_EDITION=$(kubectl -n d8-system get deployment deckhouse -o jsonpath='{.metadata.annotations.core\.deckhouse\.io\/edition}' | tr '[:upper:]' '[:lower:]') echo "DH_VERSION=$DH_VERSION DH_EDITION=$DH_EDITION"
-
Run the installer:
docker run --pull=always -it -v "$HOME/.ssh/:/tmp/.ssh/" registry.deckhouse.io/deckhouse/${DH_EDITION}/install:${DH_VERSION} bash
-
Do converge:
dhctl converge --ssh-agent-private-keys=/tmp/.ssh/<SSH_KEY_FILENAME> --ssh-user=<USERNAME> --ssh-host <MASTER-NODE-0-HOST>
-
-
Update route table
-
Get route table name
kubectl -n d8-system exec -ti svc/deckhouse-leader -c deckhouse -- deckhouse-controller module values -g cloud-provider-yandex -o json | jq -c | jq '.global.clusterConfiguration.cloud.prefix'
-
Get NAT Instance name:
kubectl -n d8-system exec -ti svc/deckhouse-leader -c deckhouse -- deckhouse-controller module values -g cloud-provider-yandex -o json | jq -c | jq '.cloudProviderYandex.internal.providerDiscoveryData.natInstanceName'
-
Get NAT Instance internal IP
yc compute instance list | grep -e "INTERNAL IP" -e <NAT_INSTANCE_NAME_FROM_PREVIOUS_STEP>
-
Update route
yc vpc route-table update --name <ROUTE_TABLE_NAME_FROM_PREVIOUS_STEP> --route "destination=0.0.0.0/0,next-hop=<NAT_INSTANCE_INTERNAL_IP_FROM_PREVIOUS_STEP>"
-
-
-
NodeGroupNodeWithDeprecatedAvailabilityZone
CE
S9
NodeGroup {{ $labels.node_group }} contains Nodes with deprecated availability zone.
Availability zone
ru-central1-c
is deprecated by Yandex.Cloud. You should migrate your Nodes, Disks and LoadBalancers toru-central1-a
,ru-central1-b
orru-central1-d
(introduced in v1.56). To check which Nodes should be migrated, usekubectl get node -l "topology.kubernetes.io/zone=ru-central1-c"
command.You can use Yandex Migration Guide (mostly applicable to the `ru-central1-d’ zone only).
IMPORTANT You cannot migrate public IP addresses between zones. Check out the Yandex Migration Guide for details.
Module cni-cilium
-
CiliumAgentEndpointsNotReady
CE
S4
More than half of all known Endpoints are not ready in agent {{ $labels.namespace }}/{{ $labels.pod }}.
Check what’s going on:
kubectl -n {{ $labels.namespace }} logs {{ $labels.pod }}
-
CiliumAgentMapPressureCritical
CE
S4
eBPF map {{ $labels.map_name }} is more than 90% full in agent {{ $labels.namespace }}/{{ $labels.pod }}.
We’ve reached resource limit of eBPF maps. Consult with vendor for possible remediation steps.
-
CiliumAgentMetricNotFound
CE
S4
Some of the metrics are not coming from the agent {{ $labels.namespace }}/{{ $labels.pod }}.
Use the following commands to check what’s going on:
kubectl -n {{ $labels.namespace }} logs {{ $labels.pod }}
kubectl -n {{ $labels.namespace }} exec -ti {{ $labels.pod }} cilium-health status
We need to cross-check the metrics with the neighboring agent. Also the absence of metrics is an indirect sign that new pods cannot be created on the node because of the inability to connect to the agent. It is important to get a more specific way of determining the above situation and create a more accurate alert for the inability to connect new pods to the agent.
-
CiliumAgentPolicyImportErrors
CE
S4
Agent {{ $labels.namespace }}/{{ $labels.pod }} fails to import policies.
Check what’s going on:
kubectl -n {{ $labels.namespace }} logs {{ $labels.pod }}
-
CiliumAgentUnreachableHealthEndpoints
CE
S4
Some node's health endpoints are not reachable by agent {{ $labels.namespace }}/{{ $labels.pod }}.
Check what’s going on:
kubectl -n {{ $labels.namespace }} logs {{ $labels.pod }}
-
CniCiliumNonStandardVXLANPortFound
CE
S4
There is non-standard VXLAN port in Cilium config
There is non-standard VXLAN port in Cilium config:
{{$labels.port}}
does not fit the recommended range (4298 if the virtualization module enabled or 4299 for regular deckhouse setup).Consider configuring the
tunnel-port
parameter incilium-configmap
ConfigMap (d8-cni-cilium
namespace) according the recommended range. If you know why you need the non-standard port, just ignore the alert. -
CniCiliumOrphanEgressGatewayPolicyFound
SE-PLUS
S4
Found orphan EgressGatewayPolicy with irrelevant EgressGateway name
There is orphan EgressGatewayPolicy in the cluster: with the name:
{{$labels.name}}
which has irrelevant EgressGateway name.It is recommended to check EgressGateway name in EgressGatewayPolicy resource:
{{$labels.egressgateway}}
-
D8CNIMisconfigured
CE
S3
The settings from the secret d8-cni-configuration and the ModuleConfig contradict each other.
It is necessary to correct the settings in the CNI {{ $labels.cni }} ModuleConfig. You can find the desired settings in the
d8-system/desired-cni-moduleconfig
configmap. To do this, please run the following command:kubectl -n d8-system get configmap desired-cni-moduleconfig -o yaml
.
Module cni-flannel
-
D8CNIMisconfigured
CE
S3
The settings from the secret d8-cni-configuration and the ModuleConfig contradict each other.
It is necessary to correct the settings in the CNI {{ $labels.cni }} ModuleConfig. You can find the desired settings in the
d8-system/desired-cni-moduleconfig
configmap. To do this, please run the following command:kubectl -n d8-system get configmap desired-cni-moduleconfig -o yaml
.
Module cni-simple-bridge
-
D8CNIMisconfigured
CE
S3
The settings from the secret d8-cni-configuration and the ModuleConfig contradict each other.
It is necessary to correct the settings in the CNI {{ $labels.cni }} ModuleConfig. You can find the desired settings in the
d8-system/desired-cni-moduleconfig
configmap. To do this, please run the following command:kubectl -n d8-system get configmap desired-cni-moduleconfig -o yaml
.
Module control-plane-manager
-
D8ControlPlaneManagerPodNotRunning
CE
S6
Controller Pod not running on Node {{ $labels.node }}
Pod
d8-control-plane-manager
fails or not scheduled on Node {{ $labels.node }}Consider checking state of the
kube-system/d8-control-plane-manager
DaemonSet and its Pods:kubectl -n kube-system get daemonset,pod --selector=app=d8-control-plane-manager
-
D8EtcdDatabaseHighFragmentationRatio
CE
S7
etcd database size in use is less than 50% of the actual allocated storage, indicating potential fragmentation, and the total storage size exceeds 75% of the configured quota.
The etcd database size in use on instance
{{ $labels.instance }}
is less than 50% of the actual allocated disk space, indicating potential fragmentationPossible solutions:
- You can do defragmentation. Use the following command:
kubectl -n kube-system exec -ti etcd-{{ $labels.node }} -- /usr/bin/etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key --endpoints https://127.0.0.1:2379/ defrag --command-timeout=30s
- You can do defragmentation. Use the following command:
-
D8EtcdExcessiveDatabaseGrowth
CE
S4
etcd cluster database growing very fast.
Predicting that the etcd database will run out of disk space in the next 1 day on instance
{{ $labels.instance }}
based on 6h growth rate.Please check and take action as it might be disruptive.
-
D8KubeEtcdDatabaseSizeCloseToTheLimit
CE
S3
etcd db size is close to the limit
The size of the etcd database on
{{ $labels.node }}
has almost exceeded. Possibly there are a lot of events (e.g. Pod evictions) or a high number of other resources are created in the cluster recently.Possible solutions:
- You can do defragmentation. Use next command:
kubectl -n kube-system exec -ti etcd-{{ $labels.node }} -- /usr/bin/etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key --endpoints https://127.0.0.1:2379/ defrag --command-timeout=30s
- Increase node memory. Begin from 24 GB
quota-backend-bytes
will be increased on 1G every extra 8 GB node memory. For example: Node Memory quota-backend-bytes 16GB 2147483648 (2GB) 24GB 3221225472 (3GB) 32GB 4294967296 (4GB) 40GB 5368709120 (5GB) 48GB 6442450944 (6GB) 56GB 7516192768 (7GB) 64GB 8589934592 (8GB) 72GB 8589934592 (8GB) ….
- You can do defragmentation. Use next command:
-
D8KubernetesVersionIsDeprecated
CE
S7
Kubernetes version "{{ $labels.k8s_version }}" is deprecated
Current kubernetes version “{{ $labels.k8s_version }}” is deprecated, and its support will be removed within 6 months
Please migrate to the next kubernetes version (at least 1.28)
Check how to update the Kubernetes version in the cluster here - https://deckhouse.io/documentation/deckhouse-faq.html#how-do-i-upgrade-the-kubernetes-version-in-a-cluster
-
D8NeedDecreaseEtcdQuotaBackendBytes
CE
S6
Deckhouse considers that quota-backend-bytes should be reduced.
Deckhouse can increase
quota-backend-bytes
only. It happens when control-plane nodes memory was reduced. If is true, you should set quota-backend-bytes manually withcontrolPlaneManager.etcd.maxDbSize
configuration parameter. Before set new value, please check current DB usage on every control-plane node:for pod in $(kubectl get pod -n kube-system -l component=etcd,tier=control-plane -o name); do kubectl -n kube-system exec -ti "$pod" -- /usr/bin/etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/ca.key endpoint status -w json | jq --arg a "$pod" -r '.[0].Status.dbSize / 1024 / 1024 | tostring | $a + ": " + . + " MB"'; done
Recommendations:
controlPlaneManager.etcd.maxDbSize
maximum value is 8 GB.- If control-plane nodes have less than 24 GB, use 2 GB for
controlPlaneManager.etcd.maxDbSize
. - For >= 24GB increase value on 1GB every extra 8 GB. Node Memory quota-backend-bytes 16GB 2147483648 (2GB) 24GB 3221225472 (3GB) 32GB 4294967296 (4GB) 40GB 5368709120 (5GB) 48GB 6442450944 (6GB) 56GB 7516192768 (7GB) 64GB 8589934592 (8GB) 72GB 8589934592 (8GB) ….
-
KubernetesVersionEndOfLife
CE
S4
Kubernetes version "{{ $labels.k8s_version }}" has reached End Of Life.
Current kubernetes version “{{ $labels.k8s_version }}” support will be removed in the next Deckhouse release (1.58).
Please migrate to the next kubernetes version (at least 1.24) as soon as possible.
Check how to update the Kubernetes version in the cluster here - https://deckhouse.io/documentation/deckhouse-faq.html#how-do-i-upgrade-the-kubernetes-version-in-a-cluster
Module documentation
-
ModuleConfigDeprecated
CE
S9
Deprecated ModuleConfig was found.
The module
deckhouse-web
was renamed to thedocumentation
.The new ModuleConfig
documentation
was generated automatically. Please, remove deprecated ModuleConfigdeckhouse-web
from the CI deploy process and delete it:kubectl delete mc deckhouse-web
.
Module extended-monitoring
-
CertificateSecretExpired
CE
S8
Certificate expired
Certificate in secret {{$labels.namespace}}/{{$labels.name}} expired.
- If the certificate is manually managed, upload a newer one.
- If the certificate is managed by cert-manager, try inspecting certificate resource, the recommended course of action:
- Retrieve certificate name from the secret:
cert=$(kubectl get secret -n {{$labels.namespace}} {{$labels.name}} -o 'jsonpath={.metadata.annotations.cert-manager\.io/certificate-name}')
- View the status of the Certificate and try to figure out why it is not updated:
kubectl describe cert -m {{$labels.namespace}} "$cert"
- Retrieve certificate name from the secret:
-
CertificateSecretExpiredSoon
CE
S8
Certificate will expire soon.
Certificate in secret {{$labels.namespace}}/{{$labels.name}} will expire in less than 2 weeks
- If the certificate is manually managed, upload a newer one.
- If certificate is managed by cert-manager, try inspecting certificate resource, the recommended course of action:
- Retrieve certificate name from the secret:
cert=$(kubectl get secret -n {{$labels.namespace}} {{$labels.name}} -o 'jsonpath={.metadata.annotations.cert-manager\.io/certificate-name}')
- View the status of the Certificate and try to figure out why it is not updated:
kubectl describe cert -n {{$labels.namespace}} "$cert"
- Retrieve certificate name from the secret:
-
CronJobAuthenticationFailure
CE
S7
Unable to login to the container registry using imagePullSecrets for the {{ $labels.image }} image.
Unable to login to the container registry using
imagePullSecrets
for the{{ $labels.image }}
image in the{{ $labels.namespace }}
Namespace; in the CronJob{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
Insufficient privileges to pull the {{ $labels.image }} image using the imagePullSecrets specified.
Insufficient privileges to pull the
{{ $labels.image }}
image using theimagePullSecrets
specified in the{{ $labels.namespace }}
Namespace; in the CronJob{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
CronJobBadImageFormat
CE
S7
The {{ $labels.image }} image has incorrect name.
You should check whether the
{{ $labels.image }}
image name is spelled correctly: in the{{ $labels.namespace }}
Namespace; in the CronJob{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
CronJobFailed
CE
S5
Job {{$labels.namespace}}/{{$labels.job_name}} failed in CronJob {{$labels.namespace}}/{{$labels.owner_name}}.
Print Job details:
kubectl -n {{$labels.namespace}} describe job {{$labels.job_name}}
Check the Job status:
kubectl -n {{$labels.namespace}} get job {{$labels.job_name}}
Check the status of pods created by the Job:
kubectl -n {{$labels.namespace}} get pods -l job-name={{$labels.job_name}}
-
CronJobImageAbsent
CE
S7
The {{ $labels.image }} image is missing from the registry.
You should check whether the
{{ $labels.image }}
image is available: in the{{ $labels.namespace }}
Namespace; in the CronJob{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
CronJobPodsNotCreated
CE
S5
CronJob {{$labels.namespace}}/{{$labels.job_name}} pods still not created.
Print Job details:
kubectl -n {{$labels.namespace}} describe job {{$labels.job_name}}
Check the Job status:
kubectl -n {{$labels.namespace}} get job {{$labels.job_name}}
Check the status of pods created by the Job:
kubectl -n {{$labels.namespace}} get pods -l job-name={{$labels.job_name}}
-
The container registry is not available for the {{ $labels.image }} image.
The container registry is not available for the
{{ $labels.image }}
image: in the{{ $labels.namespace }}
Namespace; in the CronJob{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
CronJobSchedulingError
CE
S6
CronJob {{$labels.namespace}}/{{$labels.cronjob}} failed to schedule on time.
CronJob {{$labels.namespace}}/{{$labels.cronjob}} failed to schedule on time. Schedule: “{{ printf “kube_cronjob_info{namespace="%s", cronjob="%s"}” $labels.namespace $labels.cronjob | query | first | label “schedule” }}” Last schedule time: {{ printf “kube_cronjob_status_last_schedule_time{namespace="%s", cronjob="%s"}” $labels.namespace $labels.cronjob | query | first | value | humanizeTimestamp }}% Projected next schedule time: {{ printf “kube_cronjob_next_schedule_time{namespace="%s", cronjob="%s"}” $labels.namespace $labels.cronjob | query | first | value | humanizeTimestamp }}%
-
CronJobUnknownError
CE
S7
An unknown error occurred for the {{ $labels.image }} image.
An unknown error occurred for the
{{ $labels.image }}
image in the{{ $labels.namespace }}
Namespace; in the CronJob{{ $labels.name }}
in the{{ $labels.container }}
container in the registry.Refer to the exporter logs:
kubectl -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
-
D8CertExporterPodIsNotReady
CE
S8
The x509-certificate-exporter Pod is NOT Ready.
The recommended course of action:
- Retrieve details of the Deployment:
kubectl -n d8-monitoring describe deploy x509-certificate-exporter
- View the status of the Pod and try to figure out why it is not running:
kubectl -n d8-monitoring describe pod -l app=x509-certificate-exporter
- Retrieve details of the Deployment:
-
D8CertExporterPodIsNotRunning
CE
S8
The x509-certificate-exporter Pod is NOT Running.
The recommended course of action:
- Retrieve details of the Deployment:
kubectl -n d8-monitoring describe deploy x509-certificate-exporter
- View the status of the Pod and try to figure out why it is not running:
kubectl -n d8-monitoring describe pod -l app=x509-certificate-exporter
- Retrieve details of the Deployment:
-
D8CertExporterTargetAbsent
CE
S8
There is no x509-certificate-exporter target in Prometheus.
Check the Pod status:
kubectl -n d8-monitoring get pod -l app=x509-certificate-exporter
Or check the Pod logs:
kubectl -n d8-monitoring logs -l app=x509-certificate-exporter -c x509-certificate-exporter
-
D8CertExporterTargetDown
CE
S8
Prometheus cannot scrape the x509-certificate-exporter metrics.
Check the Pod status:
kubectl -n d8-monitoring get pod -l app=x509-certificate-exporter
Or check the Pod logs:
kubectl -n d8-monitoring logs -l app=x509-certificate-exporter -c x509-certificate-exporter
-
D8ImageAvailabilityExporterMalfunctioning
CE
S8
image-availability-exporter has crashed.
The
image-availability-exporter
failed to perform any checks for the availability of images in the registry for over 20 minutes.You need to analyze its logs:
kubectl -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
-
D8ImageAvailabilityExporterPodIsNotReady
CE
S8
The image-availability-exporter Pod is NOT Ready.
The images listed in the
image
field are not checked for availability in the container registry.The recommended course of action:
- Retrieve details of the Deployment:
kubectl -n d8-monitoring describe deploy image-availability-exporter
- View the status of the Pod and try to figure out why it is not running:
kubectl -n d8-monitoring describe pod -l app=image-availability-exporter
- Retrieve details of the Deployment:
-
D8ImageAvailabilityExporterPodIsNotRunning
CE
S8
The image-availability-exporter Pod is NOT Running.
The images listed in the
image
field are not checked for availability in the container registry.The recommended course of action:
- Retrieve details of the Deployment:
kubectl -n d8-monitoring describe deploy image-availability-exporter
- View the status of the Pod and try to figure out why it is not running:
kubectl -n d8-monitoring describe pod -l app=image-availability-exporter
- Retrieve details of the Deployment:
-
D8ImageAvailabilityExporterTargetAbsent
CE
S8
There is no image-availability-exporter target in Prometheus.
Check the Pod status:
kubectl -n d8-monitoring get pod -l app=image-availability-exporter
Or check the Pod logs:
kubectl -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
-
D8ImageAvailabilityExporterTargetDown
CE
S8
Prometheus cannot scrape the image-availability-exporter metrics.
Check the Pod status:
kubectl -n d8-monitoring get pod -l app=image-availability-exporter
Or check the Pod logs:
kubectl -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
-
DaemonSetAuthenticationFailure
CE
S7
Unable to login to the container registry using imagePullSecrets for the {{ $labels.image }} image.
Unable to login to the container registry using
imagePullSecrets
for the{{ $labels.image }}
image in the{{ $labels.namespace }}
Namespace; in the DaemonSet{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
Insufficient privileges to pull the {{ $labels.image }} image using the imagePullSecrets specified.
Insufficient privileges to pull the
{{ $labels.image }}
image using theimagePullSecrets
specified in the{{ $labels.namespace }}
Namespace; in the DaemonSet{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
DaemonSetBadImageFormat
CE
S7
The {{ $labels.image }} image has incorrect name.
You should check whether the
{{ $labels.image }}
image name is spelled correctly: in the{{ $labels.namespace }}
Namespace; in the DaemonSet{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
DaemonSetImageAbsent
CE
S7
The {{ $labels.image }} image is missing from the registry.
You should check whether the
{{ $labels.image }}
image is available: in the{{ $labels.namespace }}
Namespace; in the DaemonSet{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
The container registry is not available for the {{ $labels.image }} image.
The container registry is not available for the
{{ $labels.image }}
image: in the{{ $labels.namespace }}
Namespace; in the DaemonSet{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
DaemonSetUnknownError
CE
S7
An unknown error occurred for the {{ $labels.image }} image.
An unknown error occurred for the
{{ $labels.image }}
image in the{{ $labels.namespace }}
Namespace; in the DaemonSet{{ $labels.name }}
in the{{ $labels.container }}
container in the registry.Refer to the exporter logs:
kubectl -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
-
DeploymentAuthenticationFailure
CE
S7
Unable to login to the container registry using imagePullSecrets for the {{ $labels.image }} image.
Unable to login to the container registry using
imagePullSecrets
for the{{ $labels.image }}
image in the{{ $labels.namespace }}
Namespace; in the Deployment{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
Insufficient privileges to pull the {{ $labels.image }} image using the imagePullSecrets specified.
Insufficient privileges to pull the
{{ $labels.image }}
image using theimagePullSecrets
specified in the{{ $labels.namespace }}
Namespace; in the Deployment{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
DeploymentBadImageFormat
CE
S7
The {{ $labels.image }} image has incorrect name.
You should check whether the
{{ $labels.image }}
image name is spelled correctly: in the{{ $labels.namespace }}
Namespace; in the Deployment{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
DeploymentImageAbsent
CE
S7
The {{ $labels.image }} image is missing from the registry.
You should check whether the
{{ $labels.image }}
image is available: in the{{ $labels.namespace }}
Namespace; in the Deployment{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
The container registry is not available for the {{ $labels.image }} image.
The container registry is not available for the
{{ $labels.image }}
image: in the{{ $labels.namespace }}
Namespace; in the Deployment{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
DeploymentUnknownError
CE
S7
An unknown error occurred for the {{ $labels.image }} image.
An unknown error occurred for the
{{ $labels.image }}
image in the{{ $labels.namespace }}
Namespace; in the Deployment{{ $labels.name }}
in the{{ $labels.container }}
container in the registry.Refer to the exporter logs:
kubectl -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
-
ExtendedMonitoringDeprecatatedAnnotation
CE
S4
Deprecated extended-monitoring.flant.com/enabled annotations are used in cluster. Migrate to extended-monitoring.deckhouse.io/enabled label ASAP. Check d8_deprecated_legacy_annotation metric in Prometheus to get list of all usages.
-
ExtendedMonitoringTargetDown
CE
S5
Extended-monitoring is down
Pod with extended-monitoring exporter is unavailable.
Following alerts will not be fired:
- About lack of the space and inodes on volumes
- CPU overloads and throttling of containers
- 500 errors on ingress
- Replicas quantity of controllers (alerts about the insufficient amount of replicas of Deployment, StatefulSet, DaemonSet)
- And others
To debug, execute the following commands:
kubectl -n d8-monitoring describe deploy extended-monitoring-exporter
kubectl -n d8-monitoring describe pod -l app=extended-monitoring-exporter
-
IngressResponses5xx
CE
S4
URL {{$labels.vhost}}{{$labels.location}} on Ingress {{$labels.ingress}} has more than {{ printf "extended_monitoring_ingress_threshold{threshold="5xx-critical", namespace="%s", ingress="%s"}" $labels.namespace $labels.ingress | query | first | value }}% 5xx responses from backend.
URL {{$labels.vhost}}{{$labels.location}} on Ingress {{$labels.ingress}} with Service name “{{$labels.service}}” and port “{{$labels.service_port}}” has more than {{ printf “extended_monitoring_ingress_threshold{threshold="5xx-critical", namespace="%s", ingress="%s"}” $labels.namespace $labels.ingress query first value }}% 5xx responses from backend. Currently at: {{ .Value }}%
-
IngressResponses5xx
CE
S5
URL {{$labels.vhost}}{{$labels.location}} on Ingress {{$labels.ingress}} has more than {{ printf "extended_monitoring_ingress_threshold{threshold="5xx-warning", namespace="%s", ingress="%s"}" $labels.namespace $labels.ingress | query | first | value }}% 5xx responses from backend.
URL {{$labels.vhost}}{{$labels.location}} on Ingress {{$labels.ingress}} with Service name “{{$labels.service}}” and port “{{$labels.service_port}}” has more than {{ printf “extended_monitoring_ingress_threshold{threshold="5xx-warning", namespace="%s", ingress="%s"}” $labels.namespace $labels.ingress query first value }}% 5xx responses from backend. Currently at: {{ .Value }}%
-
KubernetesDaemonSetNotUpToDate
CE
S9
There are {{ .Value }} outdated Pods in the {{ $labels.namespace }}/{{ $labels.daemonset }} DaemonSet for the last 15 minutes.
There are {{ .Value }} outdated Pods in the {{ $labels.namespace }}/{{ $labels.daemonset }} DaemonSet for the last 15 minutes.
The recommended course of action:
- Check the DaemonSet’s status:
kubectl -n {{ $labels.namespace }} get ds {{ $labels.daemonset }}
- Analyze the DaemonSet’s description:
kubectl -n {{ $labels.namespace }} describe ds {{ $labels.daemonset }}
- If the
Number of Nodes Scheduled with Up-to-date Pods
parameter does not matchCurrent Number of Nodes Scheduled
, check the DaemonSet’s updateStrategy:kubectl -n {{ $labels.namespace }} get ds {{ $labels.daemonset }} -o json | jq '.spec.updateStrategy'
- Note that if the OnDelete updateStrategy is set, the DaemonSet gets only updated when Pods are deleted.
- Check the DaemonSet’s status:
-
Count of available replicas in DaemonSet {{$labels.namespace}}/{{$labels.daemonset}} is at zero.
Count of available replicas in DaemonSet {{$labels.namespace}}/{{$labels.daemonset}} is at zero.
List of unavailable Pod(s): {{range $index, $result := (printf “(max by (namespace, pod) (kube_pod_status_ready{namespace="%s", condition!="true"} == 1)) * on (namespace, pod) kube_controller_pod{namespace="%s", controller_type="DaemonSet", controller_name="%s"}” $labels.namespace $labels.namespace $labels.daemonset query)}}{{if not (eq $index 0)}}, {{ end }}{{ $result.Labels.pod }}{{ end }} This command might help figuring out problematic nodes given you are aware where the DaemonSet should be scheduled in the first place (using label selector for pods might be of help, too):
kubectl -n {{$labels.namespace}} get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="{{$labels.daemonset}}")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
-
Count of unavailable replicas in DaemonSet {{$labels.namespace}}/{{$labels.daemonset}} is above threshold.
Count of unavailable replicas in DaemonSet {{$labels.namespace}}/{{$labels.daemonset}} is above threshold. Currently at: {{ .Value }} unavailable replica(s) Threshold at: {{ printf “extended_monitoring_daemonset_threshold{threshold="replicas-not-ready", namespace="%s", daemonset="%s"}” $labels.namespace $labels.daemonset | query | first | value }} unavailable replica(s)
List of unavailable Pod(s): {{range $index, $result := (printf “(max by (namespace, pod) (kube_pod_status_ready{namespace="%s", condition!="true"} == 1)) * on (namespace, pod) kube_controller_pod{namespace="%s", controller_type="DaemonSet", controller_name="%s"}” $labels.namespace $labels.namespace $labels.daemonset query)}}{{if not (eq $index 0)}}, {{ end }}{{ $result.Labels.pod }}{{ end }} This command might help figuring out problematic nodes given you are aware where the DaemonSet should be scheduled in the first place (using label selector for pods might be of help, too):
kubectl -n {{$labels.namespace}} get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="{{$labels.daemonset}}")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
-
Count of available replicas in Deployment {{$labels.namespace}}/{{$labels.deployment}} is at zero.
Count of available replicas in Deployment {{$labels.namespace}}/{{$labels.deployment}} is at zero.
List of unavailable Pod(s): {{range $index, $result := (printf “(max by (namespace, pod) (kube_pod_status_ready{namespace="%s", condition!="true"} == 1)) * on (namespace, pod) kube_controller_pod{namespace="%s", controller_type="Deployment", controller_name="%s"}” $labels.namespace $labels.namespace $labels.deployment query)}}{{if not (eq $index 0)}}, {{ end }}{{ $result.Labels.pod }}{{ end }} -
Count of unavailable replicas in Deployment {{$labels.namespace}}/{{$labels.deployment}} is violating "spec.strategy.rollingupdate.maxunavailable".
Count of unavailable replicas in Deployment {{$labels.namespace}}/{{$labels.deployment}} is violating “spec.strategy.rollingupdate.maxunavailable”.
Currently at: {{ .Value }} unavailable replica(s) Threshold at: {{ printf “extended_monitoring_deployment_threshold{threshold="replicas-not-ready", namespace="%s", deployment="%s"}” $labels.namespace $labels.deployment | query | first | value }} unavailable replica(s)
List of unavailable Pod(s): {{range $index, $result := (printf “(max by (namespace, pod) (kube_pod_status_ready{namespace="%s", condition!="true"} == 1)) * on (namespace, pod) kube_controller_pod{namespace="%s", controller_type="Deployment", controller_name="%s"}” $labels.namespace $labels.namespace $labels.deployment query)}}{{if not (eq $index 0)}}, {{ end }}{{ $result.Labels.pod }}{{ end }} -
Count of ready replicas in StatefulSet {{$labels.namespace}}/{{$labels.statefulset}} at zero.
Count of ready replicas in StatefulSet {{$labels.namespace}}/{{$labels.statefulset}} at zero.
List of unavailable Pod(s): {{range $index, $result := (printf “(max by (namespace, pod) (kube_pod_status_ready{namespace="%s", condition!="true"} == 1)) * on (namespace, pod) kube_controller_pod{namespace="%s", controller_type="StatefulSet", controller_name="%s"}” $labels.namespace $labels.namespace $labels.deployment query)}}{{if not (eq $index 0)}}, {{ end }}{{ $result.Labels.pod }}{{ end }} -
Count of unavailable replicas in StatefulSet {{$labels.namespace}}/{{$labels.statefulset}} above threshold.
Count of unavailable replicas in StatefulSet {{$labels.namespace}}/{{$labels.statefulset}} above threshold.
Currently at: {{ .Value }} unavailable replica(s) Threshold at: {{ printf “extended_monitoring_statefulset_threshold{threshold="replicas-not-ready", namespace="%s", statefulset="%s"}” $labels.namespace $labels.statefulset | query | first | value }} unavailable replica(s)
List of unavailable Pod(s): {{range $index, $result := (printf “(max by (namespace, pod) (kube_pod_status_ready{namespace="%s", condition!="true"} == 1)) * on (namespace, pod) kube_controller_pod{namespace="%s", controller_type="StatefulSet", controller_name="%s"}” $labels.namespace $labels.namespace $labels.deployment query)}}{{if not (eq $index 0)}}, {{ end }}{{ $result.Labels.pod }}{{ end }} -
LoadAverageHigh
CE
S4
The load average on the {{ $labels.node }} Node is too high.
For the last 5 minutes, the load average on the {{ $labels.node }} Node has been higher than {{ printf “extended_monitoring_node_threshold{threshold="load-average-per-core-critical", node="%s"}” $labels.node query first value }} per core. There are more processes in the queue than the CPU can handle; probably, some process has created too many threads or child processes, or the CPU is overloaded. -
LoadAverageHigh
CE
S5
The load average on the {{ $labels.node }} Node is too high.
For the last 30 minutes, the load average on the {{ $labels.node }} Node has been higher or equal to {{ printf “extended_monitoring_node_threshold{threshold="load-average-per-core-warning", node="%s"}” $labels.node query first value }} per core. There are more processes in the queue than the CPU can handle; probably, some process has created too many threads or child processes, or the CPU is overloaded. -
NodeDiskBytesUsage
CE
S5
Node disk "{{$labels.device}}" on mountpoint "{{$labels.mountpoint}}" is using more than {{ printf "extended_monitoring_node_threshold{threshold="disk-bytes-critical", node="%s"}" $labels.node | query | first | value }}% of storage capacity. Currently at: {{ .Value }}%
-
NodeDiskBytesUsage
CE
S6
Node disk "{{$labels.device}}" on mountpoint "{{$labels.mountpoint}}" is using more than {{ printf "extended_monitoring_node_threshold{threshold="disk-bytes-warning", node="%s"}" $labels.node | query | first | value }}% of the storage capacity. Currently at: {{ .Value }}%
Node disk “{{$labels.device}}” on mountpoint “{{$labels.mountpoint}}” is using more than {{ printf “extended_monitoring_node_threshold{threshold="disk-bytes-warning", node="%s"}” $labels.node | query | first | value }}% of the storage capacity. Currently at: {{ .Value }}%
Retrieve the disk usage info on the node: `ncdu -x {{$labels.mountpoint}}’
If the output shows high disk usage in the /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/ directory, use the following command to show the pods with the highest usage:
crictl stats -o json | jq '.stats[] | select((.writableLayer.usedBytes.value | tonumber) > 1073741824) | { meta: .attributes.labels, diskUsage: ((.writableLayer.usedBytes.value | tonumber) / 1073741824 * 100 | round / 100 | tostring + " GiB")}'
-
NodeDiskInodesUsage
CE
S5
Node disk "{{$labels.device}}" on mountpoint "{{$labels.mountpoint}}" is using more than {{ printf "extended_monitoring_node_threshold{threshold="disk-inodes-critical", node="%s"}" $labels.node | query | first | value }}% of storage capacity. Currently at: {{ .Value }}%
-
NodeDiskInodesUsage
CE
S6
Node disk "{{$labels.device}}" on mountpoint "{{$labels.mountpoint}}" is using more than {{ printf "extended_monitoring_node_threshold{threshold="disk-inodes-warning", node="%s"}" $labels.node | query | first | value }}% of storage capacity. Currently at: {{ .Value }}%
-
PersistentVolumeClaimBytesUsage
CE
S4
PersistentVolumeClaim {{$labels.namespace}}/{{$labels.persistentvolumeclaim}} is using more than {{ printf "extended_monitoring_pod_threshold{threshold="disk-bytes-critical", namespace="%s", pod="%s"}" $labels.namespace $labels.pod | query | first | value }}% of volume storage capacity.
PersistentVolumeClaim {{$labels.namespace}}/{{$labels.persistentvolumeclaim}} is using more than {{ printf “extended_monitoring_pod_threshold{threshold="disk-bytes-critical", namespace="%s", pod="%s"}” $labels.namespace $labels.pod | query | first | value }}% of volume storage capacity. Currently at: {{ .Value }}%
PersistentVolumeClaim is used by the following pods: {{range $index, $result := (print “kube_pod_spec_volumes_persistentvolumeclaims_info{namespace=’” $labels.namespace “’, persistentvolumeclaim=’” $labels.persistentvolumeclaim “’}” query)}}{{if not (eq $index 0)}}, {{ end }}{{ $result.Labels.pod }}{{ end }} -
PersistentVolumeClaimBytesUsage
CE
S5
PersistentVolumeClaim {{$labels.namespace}}/{{$labels.persistentvolumeclaim}} is using more than {{ printf "extended_monitoring_pod_threshold{threshold="disk-bytes-warning", namespace="%s", pod="%s"}" $labels.namespace $labels.pod | query | first | value }}% of volume storage capacity.
PersistentVolumeClaim {{$labels.namespace}}/{{$labels.persistentvolumeclaim}} is using more than {{ printf “extended_monitoring_pod_threshold{threshold="disk-bytes-warning", namespace="%s", pod="%s"}” $labels.namespace $labels.pod | query | first | value }}% of volume storage capacity. Currently at: {{ .Value }}%
PersistentVolumeClaim is used by the following pods: {{range $index, $result := (print “kube_pod_spec_volumes_persistentvolumeclaims_info{namespace=’” $labels.namespace “’, persistentvolumeclaim=’” $labels.persistentvolumeclaim “’}” query)}}{{if not (eq $index 0)}}, {{ end }}{{ $result.Labels.pod }}{{ end }} -
PersistentVolumeClaimInodesUsed
CE
S4
PersistentVolumeClaim {{$labels.namespace}}/{{$labels.persistentvolumeclaim}} is using more than {{ printf "extended_monitoring_pod_threshold{threshold="disk-inodes-critical", namespace="%s", pod="%s"}" $labels.namespace $labels.pod | query | first | value }}% of volume inode capacity.
PersistentVolumeClaim {{$labels.namespace}}/{{$labels.persistentvolumeclaim}} is using more than {{ printf “extended_monitoring_pod_threshold{threshold="disk-inodes-critical", namespace="%s", pod="%s"}” $labels.namespace $labels.pod | query | first | value }}% of volume inode capacity. Currently at: {{ .Value }}%
PersistentVolumeClaim is used by the following pods: {{range $index, $result := (print “kube_pod_spec_volumes_persistentvolumeclaims_info{namespace=’” $labels.namespace “’, persistentvolumeclaim=’” $labels.persistentvolumeclaim “’}” query)}}{{if not (eq $index 0)}}, {{ end }}{{ $result.Labels.pod }}{{ end }} -
PersistentVolumeClaimInodesUsed
CE
S5
PersistentVolumeClaim {{$labels.namespace}}/{{$labels.persistentvolumeclaim}} is using more than {{ printf "extended_monitoring_pod_threshold{threshold="disk-inodes-warning", namespace="%s", pod="%s"}" $labels.namespace $labels.pod | query | first | value }}% of volume inode capacity.
PersistentVolumeClaim {{$labels.namespace}}/{{$labels.persistentvolumeclaim}} is using more than {{ printf “extended_monitoring_pod_threshold{threshold="disk-inodes-warning", namespace="%s", pod="%s"}” $labels.namespace $labels.pod | query | first | value }}% of volume inode capacity. Currently at: {{ .Value }}%
PersistentVolumeClaim is used by the following pods: {{range $index, $result := (print “kube_pod_spec_volumes_persistentvolumeclaims_info{namespace=’” $labels.namespace “’, persistentvolumeclaim=’” $labels.persistentvolumeclaim “’}” query)}}{{if not (eq $index 0)}}, {{ end }}{{ $result.Labels.pod }}{{ end }} -
StatefulSetAuthenticationFailure
CE
S7
Unable to login to the container registry using imagePullSecrets for the {{ $labels.image }} image.
Unable to login to the container registry using
imagePullSecrets
for the{{ $labels.image }}
image in the{{ $labels.namespace }}
Namespace; in the StatefulSet{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
Insufficient privileges to pull the {{ $labels.image }} image using the imagePullSecrets specified.
Insufficient privileges to pull the
{{ $labels.image }}
image using theimagePullSecrets
specified in the{{ $labels.namespace }}
Namespace; in the StatefulSet{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
StatefulSetBadImageFormat
CE
S7
The {{ $labels.image }} image has incorrect name.
You should check whether the
{{ $labels.image }}
image name is spelled correctly: in the{{ $labels.namespace }}
Namespace; in the StatefulSet{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
StatefulSetImageAbsent
CE
S7
The {{ $labels.image }} image is missing from the registry.
You should check whether the
{{ $labels.image }}
image is available: in the{{ $labels.namespace }}
Namespace; in the StatefulSet{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
The container registry is not available for the {{ $labels.image }} image.
The container registry is not available for the
{{ $labels.image }}
image: in the{{ $labels.namespace }}
Namespace; in the StatefulSet{{ $labels.name }}
in the{{ $labels.container }}
container in the registry. -
StatefulSetUnknownError
CE
S7
An unknown error occurred for the {{ $labels.image }} image.
An unknown error occurred for the
{{ $labels.image }}
image in the{{ $labels.namespace }}
Namespace; in the StatefulSet{{ $labels.name }}
in the{{ $labels.container }}
container in the registry.Refer to the exporter logs:
kubectl -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
Module flow-schema
-
KubernetesAPFRejectRequests
CE
S9
APF flow schema d8-serviceaccounts has rejected API requests.
To show APF schema queue requests, use the expr
apiserver_flowcontrol_current_inqueue_requests{flow_schema="d8-serviceaccounts"}
.Attention: This is an experimental alert!
Module ingress-nginx
-
D8NginxIngressKruiseControllerPodIsRestartingTooOften
CE
S8
Too many kruise controller restarts have been detected in d8-ingress-nginx namespace.
The number of restarts in the last hour: {{ $value }}. Excessive kruise controller restarts indicate that something is wrong. Normally, it should be up and running all the time.
The recommended course of action:
- Check any events regarding kruise-controller-manager in d8-ingress-nginx namespace
in case there were some issues there related to the nodes the manager runs on or memory shortage (OOM):
kubectl -n d8-ingress-nginx get events | grep kruise-controller-manager
- Analyze the controller’s pods’ descriptions to check which containers were restarted
and what were the possible reasons (exit codes, etc.):
kubectl -n d8-ingress-nginx describe pod -lapp=kruise,control-plane=controller-manager
- In case
kruise
container was restarted, list relevant logs of the container to check if there were some meaningful errors there:kubectl -n d8-ingress-nginx logs -lapp=kruise,control-plane=controller-manager -c kruise
- Check any events regarding kruise-controller-manager in d8-ingress-nginx namespace
in case there were some issues there related to the nodes the manager runs on or memory shortage (OOM):
-
DeprecatedGeoIPVersion
CE
S9
Deprecated GeoIP version 1 is being used in the cluster.
There is an IngressNginxController and/or an Ingress object that utilize(s) Nginx GeoIPv1 module’s variables. The module is deprecated and its support is discontinued from Ingess Nginx Controller of version 1.10 and higher. It’s recommend to upgrade your configuration to use GeoIPv2 module. Use the following command to get the list of the IngressNginxControllers that contain GeoIPv1 variables:
kubectl get ingressnginxcontrollers.deckhouse.io -o json | jq '.items[] | select(..|strings | test("\\$geoip_(country_(code3|code|name)|area_code|city_continent_code|city_country_(code3|code|name)|dma_code|latitude|longitude|region|region_name|city|postal_code|org)([^_a-zA-Z0-9]|$)+")) | .metadata.name'
Use the following command to get the list of the Ingress objects that contain GeoIPv1 variables:
kubectl get ingress -A -o json | jq '.items[] | select(..|strings | test("\\$geoip_(country_(code3|code|name)|area_code|city_continent_code|city_country_(code3|code|name)|dma_code|latitude|longitude|region|region_name|city|postal_code|org)([^_a-zA-Z0-9]|$)+")) | "\(.metadata.namespace)/\(.metadata.name)"' | sort | uniq
-
NginxIngressConfigTestFailed
CE
S4
Config test failed on NGINX Ingress {{ $labels.controller }} in the {{ $labels.controller_namespace }} Namespace.
The configuration testing (nginx -t) of the {{ $labels.controller }} Ingress controller in the {{ $labels.controller_namespace }} Namespace has failed.
The recommended course of action:
- Check controllers logs:
kubectl -n {{ $labels.controller_namespace }} logs {{ $labels.controller_pod }} -c controller
; - Find the newest Ingress in the cluster:
kubectl get ingress --all-namespaces --sort-by="metadata.creationTimestamp"
; - Probably, there is an error in configuration-snippet or server-snippet.
- Check controllers logs:
-
NginxIngressDaemonSetNotUpToDate
CE
S9
There are {{ .Value }} outdated Pods in the {{ $labels.namespace }}/{{ $labels.daemonset }} Ingress Nginx DaemonSet for the last 20 minutes.
There are {{ .Value }} outdated Pods in the {{ $labels.namespace }}/{{ $labels.daemonset }} Ingress Nginx DaemonSet for the last 20 minutes.
The recommended course of action:
- Check the DaemonSet’s status:
kubectl -n {{ $labels.namespace }} get ads {{ $labels.daemonset }}
- Analyze the DaemonSet’s description:
kubectl -n {{ $labels.namespace }} describe ads {{ $labels.daemonset }}
- If the
Number of Nodes Scheduled with Up-to-date Pods
parameter does not matchCurrent Number of Nodes Scheduled
, check the pertinent Ingress Nginx Controller’s ‘nodeSelector’ and ‘toleration’ settings, and compare them to the relevant nodes’ ‘labels’ and ‘taints’ settings
- Check the DaemonSet’s status:
-
Count of available replicas in NGINX Ingress DaemonSet {{$labels.namespace}}/{{$labels.daemonset}} is at zero.
Count of available replicas in NGINX Ingress DaemonSet {{$labels.namespace}}/{{$labels.daemonset}} is at zero.
List of unavailable Pod(s): {{range $index, $result := (printf “(max by (namespace, pod) (kube_pod_status_ready{namespace="%s", condition!="true"} == 1)) * on (namespace, pod) kube_controller_pod{namespace="%s", controller_type="DaemonSet", controller_name="%s"}” $labels.namespace $labels.namespace $labels.daemonset query)}}{{if not (eq $index 0)}}, {{ end }}{{ $result.Labels.pod }}{{ end }} This command might help figuring out problematic nodes given you are aware where the DaemonSet should be scheduled in the first place (using label selector for pods might be of help, too):
kubectl -n {{$labels.namespace}} get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="{{$labels.daemonset}}")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
-
Some replicas of NGINX Ingress DaemonSet {{$labels.namespace}}/{{$labels.daemonset}} are unavailable.
Some replicas of NGINX Ingress DaemonSet {{$labels.namespace}}/{{$labels.daemonset}} are unavailable. Currently at: {{ .Value }} unavailable replica(s)
List of unavailable Pod(s): {{range $index, $result := (printf “(max by (namespace, pod) (kube_pod_status_ready{namespace="%s", condition!="true"} == 1)) * on (namespace, pod) kube_controller_pod{namespace="%s", controller_type="DaemonSet", controller_name="%s"}” $labels.namespace $labels.namespace $labels.daemonset query)}}{{if not (eq $index 0)}}, {{ end }}{{ $result.Labels.pod }}{{ end }} This command might help figuring out problematic nodes given you are aware where the DaemonSet should be scheduled in the first place (using label selector for pods might be of help, too):
kubectl -n {{$labels.namespace}} get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="{{$labels.daemonset}}")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
-
NginxIngressPodIsRestartingTooOften
CE
S4
Too many NGINX Ingress restarts have been detected.
The number of restarts in the last hour: {{ $value }}. Excessive NGINX Ingress restarts indicate that something is wrong. Normally, it should be up and running all the time.
-
NginxIngressProtobufExporterHasErrors
CE
S8
The Ingress Nginx sidecar container with protobuf_exporter has {{ $labels.type }} errors.
The Ingress Nginx sidecar container with
protobuf_exporter
has {{ $labels.type }} errors.Please, check Ingress controller’s logs:
kubectl -n d8-ingress-nginx logs $(kubectl -n d8-ingress-nginx get pods -l app=controller,name={{ $labels.controller }} -o wide | grep {{ $labels.node }} | awk '{print $1}') -c protobuf-exporter
. -
NginxIngressSslExpired
CE
S4
Certificate has expired.
SSL certificate for {{ $labels.host }} in {{ $labels.namespace }} has expired. You can verify the certificate with the
kubectl -n {{ $labels.namespace }} get secret {{ $labels.secret_name }} -o json | jq -r '.data."tls.crt" | @base64d' | openssl x509 -noout -alias -subject -issuer -dates
command.https://{{ $labels.host }} version of site doesn’t work!
-
NginxIngressSslWillExpire
CE
S5
Certificate expires soon.
SSL certificate for {{ $labels.host }} in {{ $labels.namespace }} will expire in less than 2 weeks. You can verify the certificate with the
kubectl -n {{ $labels.namespace }} get secret {{ $labels.secret_name }} -o json | jq -r '.data."tls.crt" | @base64d' | openssl x509 -noout -alias -subject -issuer -dates
command.
Module istio
-
D8IstioActualDataPlaneVersionNotEqualDesired
EE
S8
There are Pods with istio data-plane version {{$labels.version}}, but desired version is {{$labels.desired_version}}
There are Pods in Namespace
{{$labels.namespace}}
with istio data-plane version{{$labels.version}}
, but the desired one is{{$labels.desired_version}}
. Impact — istio version is to change after Pod restarting. Cheat sheet:### namespace-wide configuration # istio.io/rev=vXYZ — use specific revision # istio-injection=enabled — use global revision kubectl get ns {{$labels.namespace}} --show-labels ### pod-wide configuration kubectl -n {{$labels.namespace}} get pods -l istio.io/rev={{$labels.desired_revision}}
-
D8IstioActualVersionIsNotInstalled
EE
S4
control-plane version for Pod with already injected sidecar isn't installed
There are pods with injected sidecar with version
{{$labels.version}}
(revision{{$labels.revision}}
) in namespace{{$labels.namespace}}
, but the control-plane version isn’t installed. Consider installing it or change the Namespace or Pod configuration. Impact — Pods have lost their sync with k8s state. Getting orphaned pods:kubectl -n {{ $labels.namespace }} get pods -l 'service.istio.io/canonical-name' -o json | jq --arg revision {{ $labels.revision }} '.items[] | select(.metadata.annotations."sidecar.istio.io/status" // "{}" | fromjson | .revision == $revision) | .metadata.name'
-
D8IstioAdditionalControlplaneDoesntWork
CE
S4
Additional controlplane doesn't work.
Additional istio controlplane
{{$labels.label_istio_io_rev}}
doesn’ work. Impact — sidecar injection for Pods with{{$labels.label_istio_io_rev}}
revision doesn’t work.kubectl get pods -n d8-istio -l istio.io/rev={{$labels.label_istio_io_rev}}
-
D8IstioDataPlaneVersionMismatch
EE
S8
There are Pods with data-plane version different from control-plane one.
There are Pods in
{{$labels.namespace}}
namespace with istio data-plane version{{$labels.full_version}}
which differ from control-plane one{{$labels.desired_full_version}}
. Consider restarting affected Pods, use PromQL query to get the list:max by (namespace, dataplane_pod) (d8_istio_dataplane_metadata{full_version="{{$labels.full_version}}"})
Also consider using the automatic istio data-plane update described in the documentation: https://deckhouse.io/products/kubernetes-platform/documentation/v1/modules/istio/examples.html#upgrading-istio
-
D8IstioDataPlaneWithoutIstioInjectionConfigured
EE
S4
There are Pods with istio sidecars, but without istio-injection configured
There are Pods in
{{$labels.namespace}}
Namespace with istio sidecars, but the istio-injection isn’t configured. Impact — Pods will lose their istio sidecars after re-creation. Getting affected Pods:kubectl -n {{$labels.namespace}} get pods -o json | jq -r --arg revision {{$labels.revision}} '.items[] | select(.metadata.annotations."sidecar.istio.io/status" // "{}" | fromjson | .revision == $revision) | .metadata.name'
-
D8IstioDeprecatedIstioVersionInstalled
CE
There is deprecated istio version installed
There is deprecated istio version
{{$labels.version}}
installed. Impact — version support will be removed in future deckhouse releases. The higher alert severity — the higher probability of support cancelling. Read documentation on upgrading Istio. -
D8IstioDesiredVersionIsNotInstalled
EE
S6
Desired control-plane version isn't installed
There is desired istio control plane version
{{$labels.desired_version}}
(revision{{$labels.revision}}
) configured for pods in namespace{{$labels.namespace}}
, but the version isn’t installed. Consider installing it or change the Namespace or Pod configuration. Impact — Pods won’t be able to re-create in the{{$labels.namespace}}
Namespace. Cheat sheet:### namespace-wide configuration # istio.io/rev=vXYZ — use specific revision # istio-injection=enabled — use global revision kubectl get ns {{$labels.namespace}} --show-labels ### pod-wide configuration kubectl -n {{$labels.namespace}} get pods -l istio.io/rev={{$labels.revision}}
-
D8IstioFederationMetadataEndpointDoesntWork
EE
S6
Federation metadata endpoint failed
Metadata endpoint
{{$labels.endpoint}}
for IstioFederation{{$labels.federation_name}}
has failed to fetch by d8 hook. Reproducing request to public endpoint:curl {{$labels.endpoint}}
Reproducing request to private endpoints (run from deckhouse pod):
KEY="$(deckhouse-controller module values istio -o json | jq -r .internal.remoteAuthnKeypair.priv)" LOCAL_CLUSTER_UUID="$(deckhouse-controller module values -g istio -o json | jq -r .global.discovery.clusterUUID)" REMOTE_CLUSTER_UUID="$(kubectl get istiofederation {{$labels.federation_name}} -o json | jq -r .status.metadataCache.public.clusterUUID)" TOKEN="$(deckhouse-controller helper gen-jwt --private-key-path <(echo "$KEY") --claim iss=d8-istio --claim sub=$LOCAL_CLUSTER_UUID --claim aud=$REMOTE_CLUSTER_UUID --claim scope=private-federation --ttl 1h)" curl -H "Authorization: Bearer $TOKEN" {{$labels.endpoint}}
-
D8IstioGlobalControlplaneDoesntWork
CE
S4
Global controlplane doesn't work.
Global istio controlplane
{{$labels.label_istio_io_rev}}
doesn’ work. Impact — sidecar injection for Pods with global revision doesn’t work, validating webhook for istio resources is absent.kubectl get pods -n d8-istio -l istio.io/rev={{$labels.label_istio_io_rev}}
-
D8IstioMulticlusterMetadataEndpointDoesntWork
EE
S6
Multicluster metadata endpoint failed
Metadata endpoint
{{$labels.endpoint}}
for IstioMulticluster{{$labels.multicluster_name}}
has failed to fetch by d8 hook. Reproducing request to public endpoint:curl {{$labels.endpoint}}
Reproducing request to private endpoints (run from deckhouse pod):
KEY="$(deckhouse-controller module values istio -o json | jq -r .internal.remoteAuthnKeypair.priv)" LOCAL_CLUSTER_UUID="$(deckhouse-controller module values -g istio -o json | jq -r .global.discovery.clusterUUID)" REMOTE_CLUSTER_UUID="$(kubectl get istiomulticluster {{$labels.multicluster_name}} -o json | jq -r .status.metadataCache.public.clusterUUID)" TOKEN="$(deckhouse-controller helper gen-jwt --private-key-path <(echo "$KEY") --claim iss=d8-istio --claim sub=$LOCAL_CLUSTER_UUID --claim aud=$REMOTE_CLUSTER_UUID --claim scope=private-multicluster --ttl 1h)" curl -H "Authorization: Bearer $TOKEN" {{$labels.endpoint}}
-
D8IstioMulticlusterRemoteAPIHostDoesntWork
EE
S6
Multicluster remote api host failed
Remote api host
{{$labels.api_host}}
for IstioMulticluster{{$labels.multicluster_name}}
has failed healthcheck by d8 monitoring hook.Reproducing (run from deckhouse pod):
TOKEN="$(deckhouse-controller module values istio -o json | jq -r --arg ah {{$labels.api_host}} '.internal.multiclusters[]| select(.apiHost == $ah)| .apiJWT ')" curl -H "Authorization: Bearer $TOKEN" https://{{$labels.api_host}}/version
-
D8IstioOperatorReconcileError
CE
S5
istio-operator is unable to reconcile istio control-plane setup.
There is some error in istio-operator reconcilation loop. Please check the logs out:
kubectl -n d8-istio logs -l app=operator,revision={{$labels.revision}}
-
D8IstioPodsWithoutIstioSidecar
EE
S4
There are Pods without istio sidecars, but with istio-injection configured
There is a Pod
{{$labels.dataplane_pod}}
in{{$labels.namespace}}
Namespace without istio sidecars, but the istio-injection is configured. Getting affected Pods:kubectl -n {{$labels.namespace}} get pods -l '!service.istio.io/canonical-name' -o json | jq -r '.items[] | select(.metadata.annotations."sidecar.istio.io/inject" != "false") | .metadata.name'
-
D8IstioVersionIsIncompatibleWithK8sVersion
CE
S3
The installed istio version is incompatible with the k8s version
The current istio version
{{$labels.istio_version}}
may not work properly with the current k8s version{{$labels.k8s_version}}
, because it is unsupported officially. Please upgrade istio as soon as possible. Read documentation on upgrading Istio. -
IstioIrrelevantExternalServiceFound
CE
S5
Found external service with irrelevant ports spec
There is service in the namespace:
{{$labels.namespace}}
with the name:{{$labels.name}}
which has irrelevant ports spec. .spec.ports[] do not make any sense for services with a typeExternalName
but istio renders for External Services with ports listener “0.0.0.0:port” which catch all the traffic to the port. It is a problem for services out of istio registry.It is recommended to get rid of ports section (
.spec.ports
). It is safe.
Module kube-dns
-
KubernetesCoreDNSHasCriticalErrors
CE
S5
CoreDNS has critical errors.
CoreDNS pod {{$labels.pod}} has at least one critical error. To debug the problem, look into container logs:
kubectl -n kube-system logs {{$labels.pod}}
Module log-shipper
-
D8LogShipperAgentNotScheduledInCluster
CE
S7
Pods of log-shipper-agent cannot be scheduled in the cluster.
A number of log-shipper-agents are not scheduled.
To check the state of the
d8-log-shipper/log-shipper-agent
DaemonSet:kubectl -n d8-log-shipper get daemonsets --selector=app=log-shipper
To check the state of the
d8-log-shipper/log-shipper-agent
Pods:kubectl -n d8-log-shipper get pods --selector=app=log-shipper-agent
The following command might help figuring out problematic nodes given you are aware where the DaemonSet should be scheduled in the first place:
kubectl -n d8-log-shipper get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="log-shipper-agent")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
-
Required authorization params for ClusterLogDestination.
Found ClusterLogDestination resource {{$labels.resource_name}} without authorization params. You should add authorization params to the ClusterLogDestination resource.
-
D8LogShipperCollectLogErrors
CE
S4
Pods of log-shipper-agent cannot collect logs to the {{ $labels.component_id }} on the {{ $labels.node }} node.
The
{{ $labels.host }}
log-shipper agent on the{{ $labels.node }}
node failed to collect metrics for more than 10 minutes. The reason is{{ $labels.error_type }}
errors occurred during the{{ $labels.stage }}
stage while reading{{ $labels.component_type }}
.Consider checking logs of the pod or follow advanced debug instructions.
kubectl -n d8-log-shipper logs {{ $labels.host }}
-c vector -
D8LogShipperDestinationErrors
CE
S4
Pods of log-shipper-agent cannot send logs to the {{ $labels.component_id }} on the {{ $labels.node }} node.
Logs do not reach their destination, the
{{ $labels.host }}
log-shipper agent on the {{ $labels.node }} node cannot send logs for more than 10 minutes. The reason is{{ $labels.error_type }}
errors occurred during the{{ $labels.stage }}
stage while sending logs to{{ $labels.component_type }}
.Consider checking logs of the pod or follow advanced debug instructions.
kubectl -n d8-log-shipper logs {{ $labels.host }}
-c vector -
D8LogShipperLogsDroppedByRateLimit
CE
S4
Pods of log-shipper-agent drop logs to the {{ $labels.component_id }} on the {{ $labels.node }} node.
Rate limit rules are applied, log-shipper agent on the {{ $labels.node }} node is dropping logs for more than 10 minutes.
Consider checking logs of the pod or follow advanced debug instructions.
kubectl -n d8-log-shipper get pods -o wide | grep {{ $labels.node }}
Module metallb
-
D8MetalLBBGPSessionDown
EE
S4
MetalLB BGP session down.
{{ $labels.job }} — MetalLB {{ $labels.container }} on {{ $labels.pod}} has BGP session {{ $labels.peer }} down. Details are in logs:
kubectl -n d8-metallb logs daemonset/speaker -c speaker
-
D8MetalLBConfigNotLoaded
EE
S4
MetalLB config not loaded.
{{ $labels.job }} — MetalLB {{ $labels.container }} on {{ $labels.pod}} has not loaded. To figure out the problem, check controller logs:
kubectl -n d8-metallb logs deploy/controller -c controller
-
D8MetalLBConfigStale
EE
S4
MetalLB running on a stale configuration, because the latest config failed to load.
{{ $labels.job }} — MetalLB {{ $labels.container }} on {{ $labels.pod}} has run on a stale configuration, because the latest config failed to load. To figure out the problem, check controller logs:
kubectl -n d8-metallb logs deploy/controller -c controller
-
D8MetallbNotSupportedServiceAnnotationsDetected
SE
S4
D8 MetalLB settings is outdated
Annotation ‘{{$labels.annotation}}’ is deprecated for Service ‘{{$labels.name}}’ located in namespace ‘{{$labels.namespace}}’ Service annotations below don’t take effect now:
metallb.universe.tf/ip-allocated-from-pool
— just wipe it out.metallb.universe.tf/address-pool
— use.spec.loadBalancerClass
parameter ornetwork.deckhouse.io/metal-load-balancer-class
annotation with corresponding MetalLoadBalancerClass reference.metallb.universe.tf/loadBalancerIPs
— usenetwork.deckhouse.io/load-balancer-ips: <ip>
instead.Important! Existing LoadBalancer Services of the Deckhouse were migrated automatically, but the new ones won’t.
-
D8MetallbUpdateMCVersionRequired
SE
S5
D8 MetalLB settings is outdated
ModuleConfig version for MetaLB needs to be increased
-
L2LoadBalancerOrphanServiceFound
SE
S4
Found orphan service with irrelevant L2LoadBalancer name
There is orphan service in the namespace:
{{$labels.namespace}}
with the name:{{$labels.name}}
which has irrelevant L2LoadBalancer name.It is recommended to check L2LoadBalancer name in annotations (
network.deckhouse.io/l2-load-balancer-name
).
Module monitoring-applications
-
D8OldPrometheusTargetFormat
FE
S6
Services with the prometheus-target label are used to collect metrics in the cluster.
Services with the
prometheus-target
label are used to collect metrics in the cluster.Use the following command to filter them:
kubectl get service --all-namespaces --show-labels | grep prometheus-target
Note that the label format has changed. You need to replace the
prometheus-target
label withprometheus.deckhouse.io/target
.
Module monitoring-custom
-
CustomPodMonitorFoundInCluster
CE
S9
There are PodMonitors in Deckhouse namespace that were not created by Deckhouse.
There are PodMonitors in Deckhouse namespace that were not created by Deckhouse.
Use the following command for filtering:
kubectl get podmonitors --all-namespaces -l heritage!=deckhouse
.They must be moved from Deckhouse namespace to user-spec namespace (was not labeled as
heritage: deckhouse
).The detailed description of the metric collecting process is available in the documentation.
-
CustomServiceMonitorFoundInD8Namespace
CE
S9
There are ServiceMonitors in Deckhouse namespace that were not created by Deckhouse.
There are ServiceMonitors in Deckhouse namespace that were not created by Deckhouse.
Use the following command for filtering:
kubectl get servicemonitors --all-namespaces -l heritage!=deckhouse
.They must be moved from Deckhouse namespace to user-spec namespace (was not labeled as
heritage: deckhouse
).The detailed description of the metric collecting process is available in the documentation.
-
D8CustomPrometheusRuleFoundInCluster
CE
S9
There are PrometheusRules in the cluster that were not created by Deckhouse.
There are PrometheusRules in the cluster that were not created by Deckhouse.
Use the following command for filtering:
kubectl get prometheusrules --all-namespaces -l heritage!=deckhouse
.They must be abandoned and replaced with the
CustomPrometheusRules
object.Please, refer to the documentation for information about adding alerts and/or recording rules.
-
D8OldPrometheusCustomTargetFormat
CE
S9
Services with the prometheus-custom-target label are used to collect metrics in the cluster.
Services with the
prometheus-custom-target
label are used to collect metrics in the cluster.Use the following command for filtering:
kubectl get service --all-namespaces --show-labels | grep prometheus-custom-target
.Note that the label format has changed. You need to change the
prometheus-custom-target
label toprometheus.deckhouse.io/custom-target
.For more information, refer to the documentation.
-
D8ReservedNodeLabelOrTaintFound
CE
S6
Node {{ $labels.name }} needs fixing up
Node {{ $labels.name }} uses:
- reserved
metadata.labels
node-role.deckhouse.io/ with ending not in(system|frontend|monitoring|_deckhouse_module_name_)
- or reserved
spec.taints
dedicated.deckhouse.io with values not in(system|frontend|monitoring|_deckhouse_module_name_)
- reserved
Module monitoring-deckhouse
-
D8CNIEnabledMoreThanOne
CE
S2
More than one CNI is enabled in the cluster.
Several CNIs are enabled in the cluster: For the cluster to work correctly, only one CNI must be enabled.
-
D8DeckhouseConfigInvalid
CE
S5
Deckhouse config is invalid.
Deckhouse config contains errors.
Please check Deckhouse logs by running
kubectl -n d8-system logs -f -l app=deckhouse
.Edit Deckhouse global configuration by running
kubectl edit mc global
or configuration of the specific module by runningkubectl edit mc <MODULE_NAME>
-
D8DeckhouseCouldNotDeleteModule
CE
S4
Deckhouse is unable to delete the {{ $labels.module }} module.
Please, refer to the corresponding logs:
kubectl -n d8-system logs -f -l app=deckhouse
. -
D8DeckhouseCouldNotDiscoverModules
CE
S4
Deckhouse is unable to discover modules.
Please, refer to the corresponding logs:
kubectl -n d8-system logs -f -l app=deckhouse
. -
D8DeckhouseCouldNotRunGlobalHook
CE
S5
Deckhouse is unable to run the {{ $labels.hook }} global hook.
Please, refer to the corresponding logs:
kubectl -n d8-system logs -f -l app=deckhouse
. -
D8DeckhouseCouldNotRunModule
CE
S4
Deckhouse is unable to start the {{ $labels.module }} module.
Please, refer to the corresponding logs:
kubectl -n d8-system logs -f -l app=deckhouse
. -
D8DeckhouseCouldNotRunModuleHook
CE
S7
Deckhouse is unable to run the {{ $labels.module }}/{{ $labels.hook }} module hook.
Please, refer to the corresponding logs:
kubectl -n d8-system logs -f -l app=deckhouse
. -
D8DeckhouseCustomTargetDown
CE
S4
Prometheus is unable to scrape custom metrics generated by Deckhouse hooks.
-
D8DeckhouseDeprecatedConfigmapManagedByArgoCD
CE
S4
Deprecated deckhouse configmap managed by Argo CD
The deckhouse configmap is no longer used. You need to remove configmap “d8-system/deckhouse” from ArgoCD
-
D8DeckhouseGlobalHookFailsTooOften
CE
S9
The {{ $labels.hook }} Deckhouse global hook crashes way too often.
The {{ $labels.hook }} has failed in the last
__SCRAPE_INTERVAL_X_4__
.Please, refer to the corresponding logs:
kubectl -n d8-system logs -f -l app=deckhouse
. -
D8DeckhouseHasNoAccessToRegistry
CE
S7
Deckhouse is unable to connect to the registry.
Deckhouse is unable to connect to the registry (registry.deckhouse.io in most cases) to check for a new Docker image (checks are performed every 15 seconds). Deckhouse does not have access to the registry; automatic updates are not available.
Usually, this alert means that the Deckhouse Pod is having difficulties with connecting to the Internet.
-
D8DeckhouseIsHung
CE
S4
Deckhouse is down.
Deckhouse is probably down since the
deckhouse_live_ticks
metric in Prometheus is no longer increasing (it is supposed to increment every 10 seconds). -
D8DeckhouseIsNotOnReleaseChannel
CE
S9
Deckhouse in the cluster is not subscribed to one of the regular release channels.
Deckhouse is on a custom branch instead of one of the regular release channels.
It is recommended that Deckhouse be subscribed to one of the following channels:
Alpha
,Beta
,EarlyAccess
,Stable
,RockSolid
.Use the command below to find out what release channel is currently in use:
kubectl -n d8-system get deploy deckhouse -o json | jq '.spec.template.spec.containers[0].image' -r
Subscribe the cluster to one of the regular release channels.
-
D8DeckhouseModuleHookFailsTooOften
CE
S9
The {{ $labels.module }}/{{ $labels.hook }} Deckhouse hook crashes way too often.
The {{ $labels.hook }} hook of the {{ $labels.module }} module has failed in the last
__SCRAPE_INTERVAL_X_4__
.Please, refer to the corresponding logs:
kubectl -n d8-system logs -f -l app=deckhouse
. -
D8DeckhouseModuleUpdatePolicyNotFound
CE
S5
Module update policy not found for {{ $labels.module_release }}
Module update policy not found for {{ $labels.module_release }}
You need to remove label from MR:
kubectl label mr {{ $labels.module_release }} modules.deckhouse.io/update-policy-
. A new suitable policy will be detected automatically. -
D8DeckhousePodIsNotReady
CE
S4
The Deckhouse Pod is NOT Ready.
-
D8DeckhousePodIsNotRunning
CE
S4
The Deckhouse Pod is NOT Running.
-
D8DeckhousePodIsRestartingTooOften
CE
S9
Excessive Deckhouse restarts detected.
The number of restarts in the last hour: {{ $value }}.
Excessive Deckhouse restarts indicate that something is wrong. Normally, Deckhouse should be up and running all the time.
Please, refer to the corresponding logs:
kubectl -n d8-system logs -f -l app=deckhouse
. -
D8DeckhouseQueueIsHung
CE
S7
The {{ $labels.queue }} Deckhouse queue has hung; there are {{ $value }} task(s) in the queue.
Deckhouse cannot finish processing of the {{ $labels.queue }} queue with {{ $value }} tasks piled up.
Please, refer to the corresponding logs:
kubectl -n d8-system logs -f -l app=deckhouse
. -
D8DeckhouseSelfTargetAbsent
CE
S4
There is no Deckhouse target in Prometheus.
-
D8DeckhouseSelfTargetDown
CE
S4
Prometheus is unable to scrape Deckhouse metrics.
-
D8DeckhouseWatchErrorOccurred
CE
S5
Possible apiserver connection error in the client-go informer, check logs and snapshots.
Error occurred in the client-go informer, possible problems with connection to apiserver.
Check Deckhouse logs for more information by running:
kubectl -n d8-system logs deploy/deckhouse | grep error | grep -i watch
This alert is an attempt to detect the correlation between the faulty snapshot invalidation and apiserver connection errors, especially for the handle-node-template hook in the node-manager module. Check the difference between the snapshot and actual node objects for this hook:
diff -u <(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}'|sort) <(kubectl -n d8-system exec svc/deckhouse-leader -c deckhouse -- deckhouse-controller module snapshots node-manager -o json | jq '."040-node-manager/hooks/handle_node_templates.go"' | jq '.nodes.snapshot[] | .filterResult.Name' -r | sort)
-
D8HasModuleConfigAllowedToDisable
CE
S4
ModuleConfig annotation for allow to disable is setted.
ModuleConfig is waiting for disable.
It is recommended to keep clean your module configurations from approve annotations.
If you ignore this alert and do not clear the annotation, it may cause the module to be accidentally removed from the cluster.
Removing a module from a cluster can lead to a number of irreparable consequences.
Please run
kubectl annotate moduleconfig {{ $labels.module }} modules.deckhouse.io/allow-disabling-
to stop this alert. -
D8NodeHasDeprecatedOSVersion
CE
S4
Nodes have deprecated OS versions.
Some nodes have deprecated OS versions. Please update nodes to actual OS version.
To observe affected nodes use the expr
kube_node_info{os_image=~"Ubuntu 18.04.*|Debian GNU/Linux 10.*|CentOS Linux 7.*"}
in Prometheus. -
D8NodeHasUnmetKernelRequirements
CE
S4
Nodes have unmet kernel requirements
Some nodes have unmet kernel constraints. This means that some modules cannot be run on that nodes. Current kernel constraint requirements: For Cilium module kernel should be >= 4.9.17. For Cilium with Istio modules kernel should be >= 5.7. For Cilium with OpenVPN modules kernel should be >= 5.7. For Cilium with Node-local-dns modules kernel should be >= 5.7.
To observe affected nodes use the expr
d8_node_kernel_does_not_satisfy_requirements == 1
in Prometheus. -
DeckhouseReleaseDisruptionApprovalRequired
CE
S4
Deckhouse release disruption approval required.
Deckhouse release contains disruption update.
You can figure out more details by running
kubectl describe DeckhouseRelease {{ $labels.name }}
. If you are ready to deploy this release, run:kubectl annotate DeckhouseRelease {{ $labels.name }} release.deckhouse.io/disruption-approved=true
. -
DeckhouseReleaseIsBlocked
CE
S5
Deckhouse release requirements unmet.
Deckhouse release requirements is not met.
Please run
kubectl describe DeckhouseRelease {{ $labels.name }}
for details. -
DeckhouseReleaseIsWaitingManualApproval
CE
S3
Deckhouse release is waiting for manual approval.
Deckhouse release is waiting for manual approval.
Please run
kubectl patch DeckhouseRelease {{ $labels.name }} --type=merge -p='{"approved": true}'
for confirmation. -
DeckhouseReleaseIsWaitingManualApproval
CE
S6
Deckhouse release is waiting for manual approval.
Deckhouse release is waiting for manual approval.
Please run
kubectl patch DeckhouseRelease {{ $labels.name }} --type=merge -p='{"approved": true}'
for confirmation. -
DeckhouseReleaseIsWaitingManualApproval
CE
S9
Deckhouse release is waiting for manual approval.
Deckhouse release is waiting for manual approval.
Please run
kubectl patch DeckhouseRelease {{ $labels.name }} --type=merge -p='{"approved": true}'
for confirmation. -
DeckhouseReleaseNotificationNotSent
CE
S4
Deckhouse release notification webhook not sent.
Failed to send the Deckhouse release notification webhook.
Check the notification webhook address by running
kubectl get mc deckhouse -o yaml
. -
DeckhouseUpdating
CE
S4
Deckhouse is being updated.
-
DeckhouseUpdatingFailed
CE
S4
Deckhouse updating is failed.
Failed to update Deckhouse.
Next version minor/path Deckhouse image is not available in the registry or the image is corrupted. Actual version: {{ $labels.version }}.
Make sure that the next version Deckhouse image is available in the registry.
-
MigrationRequiredFromRBDInTreeProvisionerToCSIDriver
CE
S9
Storage class {{ $labels.storageclass }} uses the deprecated rbd provisioner. It is necessary to migrate the volumes to the Ceph CSI driver.
To migrate volumes use this script https://github.com/deckhouse/deckhouse/blob//modules/031-ceph-csi/tools/rbd-in-tree-to-ceph-csi-migration-helper.sh A description of how the migration is performed can be found here https://github.com/deckhouse/deckhouse/blob//modules/031-ceph-csi/docs/internal/INTREE_MIGRATION.md
-
ModuleAtConflict
CE
S4
Conflict detected for module {{ $labels.moduleName }}.
Conflicting sources for the {{ $labels.moduleName }} module. Please specify the proper source in the module configuration.
-
ModuleConfigObsoleteVersion
CE
S4
ModuleConfig {{ $labels.name }} is outdated.
ModuleConfig {{ $labels.name }} is outdated. Update ModuleConfig {{ $labels.name }} to the latest version.
-
ModuleHasDeprecatedUpdatePolicy
CE
S4
The '{{ $labels.moduleName }}' module has a deprecated module update policy.
The ‘{{ $labels.moduleName }}’ module has a deprecated module update policy, the policy selector does not work, and the module will use deckhouse update policy.
Please specify the proper update policy in the module configuration to continue get updates.
-
ModuleReleaseIsBlockedByRequirements
CE
S6
Module release is blocked by the requirements.
Module {{ $labels.moduleName }} release is blocked by the requirements.
Please check the requirements with the following command
kubectl get mr {{ $labels.name }} -o json | jq .spec.requirements
. -
ModuleReleaseIsWaitingManualApproval
CE
S6
Module release is waiting for manual approval.
Module {{ $labels.moduleName }} release is waiting for manual approval.
Please run
kubectl annotate mr {{ $labels.name }} modules.deckhouse.io/approved="true"
for confirmation.
Module monitoring-kubernetes
-
CPUStealHigh
CE
S4
CPU Steal on the {{ $labels.node }} Node is too high.
The CPU steal is too high on the {{ $labels.node }} Node in the last 30 minutes.
Probably, some other component is stealing Node resources (e.g., a neighboring virtual machine). This may be the result of “overselling” the hypervisor. In other words, there are more virtual machines than the hypervisor can handle.
-
DeadMansSwitch
CE
S4
Alerting DeadMansSwitch
This is a DeadMansSwitch meant to ensure that the entire Alerting pipeline is functional.
-
DeploymentGenerationMismatch
CE
S4
Deployment is outdated
Observed deployment generation does not match expected one for deployment {{$labels.namespace}}/{{$labels.deployment}}
-
EbpfExporterKernelNotSupported
CE
S8
The BTF module required for ebpf_exporter is missing in the kernel. Possible actions to resolve the problem: * Built kernel with BTF type information info. * Disable ebpf_exporter
-
FdExhaustionClose
CE
S3
file descriptors soon exhausted
{{ $labels.job }}: {{ $labels.instance }} instance will exhaust in file/socket descriptors within the next hour
-
FdExhaustionClose
CE
S3
file descriptors soon exhausted
{{ $labels.job }}: {{ $labels.namespace }}/{{ $labels.pod }} instance will exhaust in file/socket descriptors within the next hour
-
FdExhaustionClose
CE
S4
file descriptors soon exhausted
{{ $labels.job }}: {{ $labels.instance }} instance will exhaust in file/socket descriptors within the next 4 hours
-
FdExhaustionClose
CE
S4
file descriptors soon exhausted
{{ $labels.job }}: {{ $labels.namespace }}/{{ $labels.pod }} instance will exhaust in file/socket descriptors within the next 4 hours
-
HelmReleasesHasResourcesWithDeprecatedVersions
CE
S5
At least one HELM release contains resources with deprecated apiVersion, which will be removed in Kubernetes v{{ $labels.k8s_version }}.
To observe all resources use the expr
max by (helm_release_namespace, helm_release_name, helm_version, resource_namespace, resource_name, api_version, kind, k8s_version) (resource_versions_compatibility) == 1
in Prometheus.You can find more details for migration in the deprecation guide: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v{{ $labels.k8s_version reReplaceAll “\.” “-“ }}. Attention: The check runs once per hour, so this alert should go out within an hour after deprecated resources migration.
-
HelmReleasesHasResourcesWithUnsupportedVersions
CE
S4
At least one HELM release contains resources with unsupported apiVersion for Kubernetes v{{ $labels.k8s_version }}.
To observe all resources use the expr
max by (helm_release_namespace, helm_release_name, helm_version, resource_namespace, resource_name, api_version, kind, k8s_version) (resource_versions_compatibility) == 2
in Prometheus.You can find more details for migration in the deprecation guide: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v{{ $labels.k8s_version reReplaceAll “\.” “-“ }}. Attention: The check runs once per hour, so this alert should go out within an hour after deprecated resources migration.
-
K8SKubeletDown
CE
S3
Many kubelets cannot be scraped
Prometheus failed to scrape {{ $value }}% of kubelets.
-
K8SKubeletDown
CE
S4
A few kubelets cannot be scraped
Prometheus failed to scrape {{ $value }}% of kubelets.
-
K8SKubeletTooManyPods
CE
S7
Kubelet is close to pod limit
Kubelet {{ $labels.node }} is running {{ $value }} pods, close to the limit of {{ printf “kube_node_status_capacity{job="kube-state-metrics",resource="pods",unit="integer",node="%s"}” $labels.node query first value }} -
K8SManyNodesNotReady
CE
S3
Too many nodes are not ready
{{ $value }}% of Kubernetes nodes are not ready
-
K8SNodeNotReady
CE
S3
Node status is NotReady
The Kubelet on {{ $labels.node }} has not checked in with the API, or has set itself to NotReady, for more than 10 minutes
-
KubeletImageFSBytesUsage
CE
S5
No more free bytes on imagefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint.
No more free bytes on imagefs (filesystem that the container runtime uses for storing images and container writable layers) on node {{$labels.node}} mountpoint {{$labels.mountpoint}}.
-
KubeletImageFSBytesUsage
CE
S6
Hard eviction of imagefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint is in progress.
Hard eviction of imagefs (filesystem that the container runtime uses for storing images and container writable layers) on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint is in progress.
Threshold at: {{ printf “kubelet_eviction_imagefs_bytes{type="hard", node="%s", mountpoint="%s"}” $labels.node $labels.mountpoint query first value }}% Currently at: {{ .Value }}%
-
KubeletImageFSBytesUsage
CE
S7
Close to hard eviction threshold of imagefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint.
Close to hard eviction threshold of imagefs (filesystem that the container runtime uses for storing images and container writable layers) on node {{$labels.node}} mountpoint {{$labels.mountpoint}}.
Threshold at: {{ printf “kubelet_eviction_imagefs_bytes{type="hard", node="%s", mountpoint="%s"}” $labels.node $labels.mountpoint query first value }}% Currently at: {{ .Value }}%
-
KubeletImageFSBytesUsage
CE
S9
Soft eviction of imagefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint is in progress.
Soft eviction of imagefs (filesystem that the container runtime uses for storing images and container writable layers) on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint is in progress.
Threshold at: {{ printf “kubelet_eviction_imagefs_bytes{type="soft", node="%s", mountpoint="%s"}” $labels.node $labels.mountpoint query first value }}% Currently at: {{ .Value }}%
-
KubeletImageFSInodesUsage
CE
S5
No more free inodes on imagefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint.
-
KubeletImageFSInodesUsage
CE
S6
Hard eviction of imagefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint is in progress.
Hard eviction of imagefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint is in progress.
Threshold at: {{ printf “kubelet_eviction_imagefs_inodes{type="hard", node="%s", mountpoint="%s"}” $labels.node $labels.mountpoint query first value }}% Currently at: {{ .Value }}%
-
KubeletImageFSInodesUsage
CE
S7
Close to hard eviction threshold of imagefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint.
Close to hard eviction threshold of imagefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint.
Threshold at: {{ printf “kubelet_eviction_imagefs_inodes{type="hard", node="%s", mountpoint="%s"}” $labels.node $labels.mountpoint query first value }}% Currently at: {{ .Value }}%
-
KubeletImageFSInodesUsage
CE
S9
Soft eviction of imagefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint is in progress.
Soft eviction of imagefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint is in progress.
Threshold at: {{ printf “kubelet_eviction_imagefs_inodes{type="soft", node="%s", mountpoint="%s"}” $labels.node $labels.mountpoint query first value }}% Currently at: {{ .Value }}%
-
KubeletNodeFSBytesUsage
CE
S5
No more free space on nodefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint.
-
KubeletNodeFSBytesUsage
CE
S6
Hard eviction of nodefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint is in progress.
Hard eviction of nodefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint is in progress.
Threshold at: {{ printf “kubelet_eviction_nodefs_bytes{type="hard", node="%s", mountpoint="%s"}” $labels.node $labels.mountpoint query first value }}% Currently at: {{ .Value }}%
-
KubeletNodeFSBytesUsage
CE
S7
Close to hard eviction threshold of nodefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint.
Close to hard eviction threshold of nodefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint.
Threshold at: {{ printf “kubelet_eviction_nodefs_bytes{type="hard", node="%s", mountpoint="%s"}” $labels.node $labels.mountpoint query first value }}% Currently at: {{ .Value }}%
-
KubeletNodeFSBytesUsage
CE
S9
Soft eviction of nodefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint is in progress.
Soft eviction of nodefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint is in progress.
Threshold at: {{ printf “kubelet_eviction_nodefs_bytes{type="soft", node="%s", mountpoint="%s"}” $labels.node $labels.mountpoint query first value }}% Currently at: {{ .Value }}%
-
KubeletNodeFSInodesUsage
CE
S5
No more free inodes on nodefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint.
-
KubeletNodeFSInodesUsage
CE
S6
Hard eviction of nodefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint is in progress.
Hard eviction of nodefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint is in progress.
Threshold at: {{ printf “kubelet_eviction_nodefs_inodes{type="hard", node="%s", mountpoint="%s"}” $labels.node $labels.mountpoint query first value }}% Currently at: {{ .Value }}%
-
KubeletNodeFSInodesUsage
CE
S7
Close to hard eviction threshold of nodefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint.
Close to hard eviction threshold of nodefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint.
Threshold at: {{ printf “kubelet_eviction_nodefs_inodes{type="hard", node="%s", mountpoint="%s"}” $labels.node $labels.mountpoint query first value }}% Currently at: {{ .Value }}%
-
KubeletNodeFSInodesUsage
CE
S9
Soft eviction of nodefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint is in progress.
Soft eviction of nodefs on the {{$labels.node}} Node at the {{$labels.mountpoint}} mountpoint is in progress.
Threshold at: {{ printf “kubelet_eviction_nodefs_inodes{type="soft", node="%s", mountpoint="%s"}” $labels.node $labels.mountpoint query first value }}% Currently at: {{ .Value }}%
-
KubernetesDnsTargetDown
CE
S5
Kube-dns or CoreDNS are not under monitoring.
Prometheus is unable to collect metrics from kube-dns. Thus its status is unknown.
To debug the problem, use the following commands:
kubectl -n kube-system describe deployment -l k8s-app=kube-dns
kubectl -n kube-system describe pod -l k8s-app=kube-dns
-
KubeStateMetricsDown
CE
S3
Kube-state-metrics is not working in the cluster.
There are no metrics about cluster resources for 5 minutes.
Most alerts an monitroing panels aren’t working.
To debug the problem:
- Check kube-state-metrics pods:
kubectl -n d8-monitoring describe pod -l app=kube-state-metrics
- Check its logs:
kubectl -n d8-monitoring describe deploy kube-state-metrics
- Check kube-state-metrics pods:
-
LoadBalancerServiceWithoutExternalIP
CE
S4
A load balancer has not been created.
One or more services with the LoadBalancer type cannot get an external address.
The list of services can be obtained with the following command: kubectl get svc -Ao json | jq -r ‘.items[] | select(.spec.type == “LoadBalancer”) | select(.status.loadBalancer.ingress[0].ip == null) | “namespace: (.metadata.namespace), name: (.metadata.name), ip: (.status.loadBalancer.ingress[0].ip)”’ Check the cloud-controller-manager logs in the ‘d8-cloud-provider-*’ namespace If you are using a bare-metal cluster with the metallb module enabled, check that the address space of the pool has not been exhausted.
-
NodeConntrackTableFull
CE
S3
The conntrack table is full.
The
conntrack
table on the {{ $labels.node }} Node is full!No new connections are created or accepted on the Node; note that this may result in strange software issues.
The recommended course of action is to identify the source of “excess”
conntrack
entries using Okmeter or Grafana charts. -
NodeConntrackTableFull
CE
S4
The conntrack table is close to the maximum size.
The conntrack table on the {{ $labels.node }} is {{ $value }}% of the maximum size.
There’s nothing to worry about yet if the
conntrack
table is only 70-80 percent full. However, if it runs out, you will experience problems with new connections while the software will behave strangely.The recommended course of action is to identify the source of “excess”
conntrack
entries using Okmeter or Grafana charts. -
NodeExporterDown
CE
S3
Prometheus could not scrape a node-exporter
Prometheus could not scrape a node-exporter for more than 10m, or node-exporters have disappeared from discovery
-
NodeFilesystemIsRO
CE
S4
The file system of the node is in read-only mode.
The file system on the node has switched to read-only mode.
See the node logs to find out the cause and fix it.
-
NodeSUnreclaimBytesUsageHigh
CE
S4
The {{ $labels.node }} Node has high kernel memory usage.
The {{ $labels.node }} Node has potential kernel memory leak. There is one known issue that can be reason for it.
You should check cgroupDriver on the {{ $labels.node }} Node:
cat /var/lib/kubelet/config.yaml | grep 'cgroupDriver: systemd'
If cgroupDriver is set to systemd then reboot is required to roll back to cgroupfs driver. Please, drain and reboot the node.
You can check this issue for extra information.
-
NodeSystemExporterDoesNotExistsForNode
CE
S4
Some of the Node system exporters don’t work correctly for the {{ $labels.node }} Node.
The recommended course of action:
- Find the Node exporter Pod for this Node:
kubectl -n d8-monitoring get pod -l app=node-exporter -o json | jq -r ".items[] | select(.spec.nodeName==\"{{$labels.node}}\") | .metadata.name"
; - Describe the Node exporter Pod:
kubectl -n d8-monitoring describe pod <pod_name>
; - Check that kubelet is running on the {{ $labels.node }} node.
- Find the Node exporter Pod for this Node:
-
NodeTCPMemoryExhaust
CE
S6
The {{ $labels.node }} node has high TCP stack memory usage.
The TCP stack on the {{ $labels.node }} node is experiencing high memory pressure. This could be caused by applications with intensive TCP networking functionality. Investigate the relevant applications and consider adjusting the system’s TCP memory configuration or addressing the source of increased network traffic.
-
NodeUnschedulable
CE
S8
The {{ $labels.node }} Node is cordon-protected; no new Pods can be scheduled onto it.
The {{ $labels.node }} Node is cordon-protected; no new Pods can be scheduled onto it.
This means that someone has executed one of the following commands on that Node:
kubectl cordon {{ $labels.node }}
kubectl drain {{ $labels.node }}
that runs for more than 20 minutes
Probably, this is due to the maintenance of this Node.
-
PodStatusIsIncorrect
CE
The state of the {{ $labels.namespace }}/{{ $labels.pod }} Pod running on the {{ $labels.node }} Node is incorrect. You need to restart kubelet.
There is a {{ $labels.namespace }}/{{ $labels.pod }} Pod in the cluster that runs on the {{ $labels.node }} and listed as NotReady while all the Pod’s containers are Ready.
This could be due to the Kubernetes bug.
The recommended course of action:
- Find all the Pods having this state:
kubectl get pod -o json --all-namespaces | jq '.items[] | select(.status.phase == "Running") | select(.status.conditions[] | select(.type == "ContainersReady" and .status == "True")) | select(.status.conditions[] | select(.type == "Ready" and .status == "False")) | "\(.spec.nodeName)/\(.metadata.namespace)/\(.metadata.name)"'
; - Find all the Nodes affected:
kubectl get pod -o json --all-namespaces | jq '.items[] | select(.status.phase == "Running") | select(.status.conditions[] | select(.type == "ContainersReady" and .status == "True")) | select(.status.conditions[] | select(.type == "Ready" and .status == "False")) | .spec.nodeName' -r | sort | uniq -c
; - Restart
kubelet
on each Node:systemctl restart kubelet
.
- Find all the Pods having this state:
-
StorageClassCloudManual
CE
S6
Manually deployed StorageClass {{ $labels.name }} found in the cluster
StorageClass having a cloud-provider provisioner shouldn’t be deployed manually. They are managed by the cloud-provider module, you only need to change the module configuration to fit your needs.
-
StorageClassDefaultDuplicate
CE
S6
Multiple default StorageClasses found in the cluster
More than one StorageClass in the cluster annotated as a default. Probably manually deployed StorageClass exists, that overlaps with cloud-provider module default Storage configuration.
-
UnsupportedContainerRuntimeVersion
CE
Unsupported version of CRI {{$labels.container_runtime_version}} installed for Kubernetes version: {{$labels.kubelet_version}}
Unsupported version {{$labels.container_runtime_version}} of CRI installed on {{$labels.node}} node. Supported version of CRI for kubernetes {{$labels.kubelet_version}} version:
- Containerd 1.4.*
- Containerd 1.5.*
- Containerd 1.6.*
- Containerd 1.7.*
Module monitoring-kubernetes-control-plane
-
K8SApiserverDown
CE
S3
No API servers are reachable
No API servers are reachable or all have disappeared from service discovery
-
K8sCertificateExpiration
CE
S5
Kubernetes has API clients with soon expiring certificates
Some clients connect to {{$labels.component}} with certificate which expiring soon (less than 1 day) on node {{$labels.component}}.
You need to use
kubeadm
to check control plane certificates.- Install kubeadm:
apt install kubeadm=1.24.*
. - Check certificates:
kubeadm alpha certs check-expiration
To check kubelet certificates, on each node you need to:
- Check kubelet config:
ps aux \ | grep "/usr/bin/kubelet" \ | grep -o -e "--kubeconfig=\S*" \ | cut -f2 -d"=" \ | xargs cat
- Find field
client-certificate
orclient-certificate-data
- Check certificate using openssl
There are no tools to help you find other stale kubeconfigs. It will be better for you to enable
control-plane-manager
module to be able to debug in this case. - Install kubeadm:
-
K8sCertificateExpiration
CE
S6
Kubernetes has API clients with soon expiring certificates
Some clients connect to {{$labels.component}} with certificate which expiring soon (less than 7 days) on node {{$labels.node}}.
You need to use
kubeadm
to check control plane certificates.- Install kubeadm:
apt install kubeadm=1.24.*
. - Check certificates:
kubeadm alpha certs check-expiration
To check kubelet certificates, on each node you need to:
- Check kubelet config:
ps aux \ | grep "/usr/bin/kubelet" \ | grep -o -e "--kubeconfig=\S*" \ | cut -f2 -d"=" \ | xargs cat
- Find field
client-certificate
orclient-certificate-data
- Check certificate using openssl
There are no tools to help you find other stale kubeconfigs. It will be better for you to enable
control-plane-manager
module to be able to debug in this case. - Install kubeadm:
-
K8SControllerManagerTargetDown
CE
S3
Controller manager is down
There is no running kube-controller-manager. Deployments and replication controllers are not making progress.
-
K8SSchedulerTargetDown
CE
S3
Scheduler is down
There is no running K8S scheduler. New pods are not being assigned to nodes.
-
KubeEtcdHighFsyncDurations
CE
S7
Synching (fsync) WAL files to disk is slow.
In the last 15 minutes, the 99th percentile of the fsync duration for WAL files is longer than 0.5 seconds: {{ $value }}.
Possible causes:
- High latency of the disk where the etcd data is located;
- High CPU usage on the Node.
-
KubeEtcdHighNumberOfLeaderChanges
CE
S5
The etcd cluster re-elects the leader too often.
There were {{ $value }} leader re-elections for the etcd cluster member running on the {{ $labels.node }} Node in the last 10 minutes.
Possible causes:
- High latency of the disk where the etcd data is located;
- High CPU usage on the Node;
- Degradation of network connectivity between cluster members in the multi-master mode.
-
KubeEtcdInsufficientMembers
CE
S4
There are insufficient members in the etcd cluster; the cluster will fail if one of the remaining members will become unavailable.
Check the status of the etcd pods:
kubectl -n kube-system get pod -l component=etcd
. -
KubeEtcdNoLeader
CE
S4
The etcd cluster member running on the {{ $labels.node }} Node has lost the leader.
Check the status of the etcd Pods:
kubectl -n kube-system get pod -l component=etcd | grep {{ $labels.node }}
. -
KubeEtcdTargetAbsent
CE
S5
There is no etcd target in Prometheus.
Check the status of the etcd Pods:
kubectl -n kube-system get pod -l component=etcd
or Prometheus logs:kubectl -n d8-monitoring logs -l app.kubernetes.io/name=prometheus -c prometheus
-
KubeEtcdTargetDown
CE
S5
Prometheus is unable to scrape etcd metrics.
Check the status of the etcd Pods:
kubectl -n kube-system get pod -l component=etcd
or Prometheus logs:kubectl -n d8-monitoring logs -l app.kubernetes.io/name=prometheus -c prometheus
.
Module monitoring-ping
-
NodePingPacketLoss
CE
S4
Ping loss more than 5%
ICMP packet loss to node {{$labels.destination_node}} is more than 5%
Module node-manager
-
There are unavailable instances in the {{ $labels.machine_deployment_name }} MachineDeployment.
In MachineDeployment
{{ $labels.machine_deployment_name }}
number of unavailable instances is {{ $value }}. Take a look and check at the state of the instances in the cluster:kubectl get instance -l node.deckhouse.io/group={{ $labels.machine_deployment_name }}
-
ClusterHasOrphanedDisks
CE
S6
Cloud data discoverer finds disks in the cloud for which there is no PersistentVolume in the cluster
Cloud data discoverer finds disks in the cloud for which there is no PersistentVolume in the cluster. You can manually delete these disks from your cloud: ID: {{ $labels.id }}, Name: {{ $labels.name }}
-
D8BashibleApiserverLocked
CE
S6
Bashible-apiserver is locked for too long
Check bashible-apiserver pods are up-to-date and running
kubectl -n d8-cloud-instance-manager get pods -l app=bashible-apiserver
-
D8CloudDataDiscovererCloudRequestError
CE
S6
Cloud data discoverer cannot get data from cloud
Cloud data discoverer cannot get data from cloud. See cloud data discoverer logs for more information:
kubectl -n {{ $labels.namespace }} logs deploy/cloud-data-discoverer
-
D8CloudDataDiscovererSaveError
CE
S6
Cloud data discoverer cannot save data to k8s resource
Cloud data discoverer cannot save data to k8s resource. See cloud data discoverer logs for more information:
kubectl -n {{ $labels.namespace }} logs deploy/cloud-data-discoverer
-
D8ClusterAutoscalerManagerPodIsNotReady
CE
S8
The {{$labels.pod}} Pod is NOT Ready.
-
D8ClusterAutoscalerPodIsNotRunning
CE
S8
The cluster-autoscaler Pod is NOT Running.
The {{$labels.pod}} Pod is {{$labels.phase}}.
Run the following command to check its status:
kubectl -n {{$labels.namespace}} get pods {{$labels.pod}} -o json | jq .status
. -
D8ClusterAutoscalerPodIsRestartingTooOften
CE
S9
Too many cluster-autoscaler restarts have been detected.
The number of restarts in the last hour: {{ $value }}.
Excessive cluster-autoscaler restarts indicate that something is wrong. Normally, it should be up and running all the time.
Please, refer to the corresponding logs:
kubectl -n d8-cloud-instance-manager logs -f -l app=cluster-autoscaler -c cluster-autoscaler
. -
D8ClusterAutoscalerTargetAbsent
CE
S8
There is no cluster-autoscaler target in Prometheus.
Cluster-autoscaler automatically scales Nodes in the cluster; its unavailability will result in the inability to add new Nodes if there is a lack of resources to schedule Pods. In addition, the unavailability of cluster-autoscaler may result in over-spending due to provisioned but inactive cloud instances.
The recommended course of action:
- Check the availability and status of cluster-autoscaler Pods:
kubectl -n d8-cloud-instance-manager get pods -l app=cluster-autoscaler
- Check whether the cluster-autoscaler deployment is present:
kubectl -n d8-cloud-instance-manager get deploy cluster-autoscaler
- Check the status of the cluster-autoscaler deployment:
kubectl -n d8-cloud-instance-manager describe deploy cluster-autoscaler
- Check the availability and status of cluster-autoscaler Pods:
-
D8ClusterAutoscalerTargetDown
CE
S8
Prometheus is unable to scrape cluster-autoscaler's metrics.
-
D8ClusterAutoscalerTooManyErrors
CE
S8
Cluster-autoscaler issues too many errors.
Cluster-autoscaler’s scaling attempt resulted in an error from the cloud provider.
Please, refer to the corresponding logs:
kubectl -n d8-cloud-instance-manager logs -f -l app=cluster-autoscaler -c cluster-autoscaler
. -
D8MachineControllerManagerPodIsNotReady
CE
S8
The {{$labels.pod}} Pod is NOT Ready.
-
D8MachineControllerManagerPodIsNotRunning
CE
S8
The machine-controller-manager Pod is NOT Running.
The {{$labels.pod}} Pod is {{$labels.phase}}.
Run the following command to check the status of the Pod:
kubectl -n {{$labels.namespace}} get pods {{$labels.pod}} -o json | jq .status
. -
D8MachineControllerManagerPodIsRestartingTooOften
CE
S9
The machine-controller-manager module restarts too often.
The number of restarts in the last hour: {{ $value }}.
Excessive machine-controller-manager restarts indicate that something is wrong. Normally, it should be up and running all the time.
Please, refer to the logs:
kubectl -n d8-cloud-instance-manager logs -f -l app=machine-controller-manager -c controller
. -
D8MachineControllerManagerTargetAbsent
CE
S8
There is no machine-controller-manager target in Prometheus.
Machine controller manager manages ephemeral Nodes in the cluster. Its unavailability will result in the inability to add/delete Nodes.
The recommended course of action:
- Check the availability and status of
machine-controller-manager
Pods:kubectl -n d8-cloud-instance-manager get pods -l app=machine-controller-manager
; - Check the availability of the
machine-controller-manager
Deployment:kubectl -n d8-cloud-instance-manager get deploy machine-controller-manager
; - Check the status of the
machine-controller-manager
Deployment:kubectl -n d8-cloud-instance-manager describe deploy machine-controller-manager
.
- Check the availability and status of
-
D8MachineControllerManagerTargetDown
CE
S8
Prometheus is unable to scrape machine-controller-manager's metrics.
-
D8NodeGroupIsNotUpdating
CE
S8
The {{ $labels.node_group }} node group is not handling the update correctly.
There is a new update for Nodes of the {{ $labels.node_group }} group; Nodes have learned about the update. However, no Node can get approval to start updating.
Most likely, there is a problem with the
update_approval
hook of thenode-manager
module. -
D8NodeIsNotUpdating
CE
S7
The {{ $labels.node }} Node cannot complete the update.
There is a new update for the {{ $labels.node }} Node of the {{ $labels.node_group }} group; the Node has learned about the update, requested and received approval, started the update, ran into a step that causes possible downtime. The update manager (the update_approval hook of node-group module) performed the update, and the Node received downtime approval. However, there is no success message about the update.
Here is how you can view Bashible logs on the Node:
journalctl -fu bashible
-
D8NodeIsNotUpdating
CE
S8
The {{ $labels.node }} Node cannot complete the update.
There is a new update for the {{ $labels.node }} Node of the {{ $labels.node_group }} group}; the Node has learned about the update, requested and received approval, but cannot complete the update.
Here is how you can view Bashible logs on the Node:
journalctl -fu bashible
-
D8NodeIsNotUpdating
CE
S9
The {{ $labels.node }} Node does not update.
There is a new update for the {{ $labels.node }} Node of the {{ $labels.node_group }} group but it has not received the update nor trying to.
Most likely Bashible for some reason is not handling the update correctly. At this point, it must add the
update.node.deckhouse.io/waiting-for-approval
annotation to the Node so that it can be approved.You can find out the most current version of the update using this command:
kubectl -n d8-cloud-instance-manager get secret configuration-checksums -o jsonpath={.data.{{ $labels.node_group }}} | base64 -d
Use the following command to find out the version on the Node:
kubectl get node {{ $labels.node }} -o jsonpath='{.metadata.annotations.node\.deckhouse\.io/configuration-checksum}'
Here is how you can view Bashible logs on the Node:
journalctl -fu bashible
-
D8NodeIsUnmanaged
CE
S9
The {{ $labels.node }} Node is not managed by the node-manager module.
The {{ $labels.node }} Node is not managed by the node-manager module.
The recommended actions are as follows:
- Follow these instructions to clean up the node and add it to the cluster: https://deckhouse.io/products/kubernetes-platform/documentation/v1/modules/node-manager/faq.html#how-to-clean-up-a-node-for-adding-to-the-cluster
-
D8NodeUpdateStuckWaitingForDisruptionApproval
CE
S8
The {{ $labels.node }} Node cannot get disruption approval.
There is a new update for the {{ $labels.node }} Node of the {{ $labels.node_group }} group; the Node has learned about the update, requested and received approval, started the update, and ran into a stage that causes possible downtime. For some reason, the Node cannot get that approval (it is issued fully automatically by the
update_approval
hook of thenode-manager
). -
D8ProblematicNodeGroupConfiguration
CE
S8
The {{ $labels.node }} Node cannot begin the update.
There is a new update for Nodes of the {{ $labels.node_group }} group; Nodes have learned about the update. However, {{ $labels.node }} Node cannot be updated.
Node {{ $labels.node }} has no
node.deckhouse.io/configuration-checksum
annotation. Perhaps the bootstrap process of the Node did not complete correctly. Check thecloud-init
logs (/var/log/cloud-init-output.log) of the Node. There is probably a problematic NodeGroupConfiguration resource for {{ $labels.node_group }} NodeGroup. -
EarlyOOMPodIsNotReady
CE
S8
The {{$labels.pod}} Pod has detected unavailable PSI subsystem. Check logs for additional information: kubectl -n d8-cloud-instance-manager logs {{$labels.pod}} Possible actions to resolve the problem: * Upgrade kernel to version 4.20 or higher. * Enable Pressure Stall Information. * Disable early oom.
-
NodeGroupHasStaticInternalNetworkCIDRsField
CE
S9
NodeGroup {{ $labels.name }} has deprecated filed spec.static.internalNetworkCIDRs
Internal network CIDRs setting now located in the static cluster configuration. Delete this field from NodeGroup {{ $labels.name }} to fix this alert. Do not worry, it has been already migrated to another place.
-
NodeGroupMasterTaintIsAbsent
CE
S4
The 'master' node group does not contain desired taint.
master
node group has nonode-role.kubernetes.io/control-plane
taint. Probably control-plane nodes are misconfigured and are able to run not only control-plane Pods. Please, add:nodeTemplate: taints: - effect: NoSchedule key: node-role.kubernetes.io/control-plane
to the
master
node group spec.key: node-role.kubernetes.io/master
taint was deprecated and will have no effect in Kubernetes 1.24+. -
There are no available instances in the {{ $labels.node_group }} node group.
Probably, machine-controller-manager is unable to create a machine using the cloud provider module. Possible causes:
- Cloud provider limits on available resources;
- No access to the cloud provider API;
- Cloud provider or instance class misconfiguration;
- Problems with bootstrapping the Machine.
The recommended course of action:
- Run
kubectl get ng {{ $labels.node_group }} -o yaml
. In the.status.lastMachineFailures
field you can find all errors related to the creation of Machines; - The absence of Machines in the list that have been in Pending status for more than a couple of minutes means that Machines are continuously being created and deleted because of some error:
kubectl -n d8-cloud-instance-manager get machine
; - Refer to the Machine description if the logs do not include error messages and the Machine continues to be Pending:
kubectl -n d8-cloud-instance-manager get machine <machine_name> -o json | jq .status.bootstrapStatus
; - The output similar to the one below means that you have to use nc to examine the bootstrap logs:
{ "description": "Use 'nc 192.168.199.158 8000' to get bootstrap logs.", "tcpEndpoint": "192.168.199.158" }
- The absence of information about the endpoint for getting logs means that
cloudInit
is not working correctly. This may be due to the incorrect configuration of the instance class for the cloud provider.
-
The number of simultaneously unavailable instances in the {{ $labels.node_group }} node group exceeds the allowed value.
Possibly, autoscaler has provisioned too many Nodes. Take a look at the state of the Machine in the cluster. Probably, machine-controller-manager is unable to create a machine using the cloud provider module. Possible causes:
- Cloud provider limits on available resources;
- No access to the cloud provider API;
- Cloud provider or instance class misconfiguration;
- Problems with bootstrapping the Machine.
The recommended course of action:
- Run
kubectl get ng {{ $labels.node_group }} -o yaml
. In the.status.lastMachineFailures
field you can find all errors related to the creation of Machines; - The absence of Machines in the list that have been in Pending status for more than a couple of minutes means that Machines are continuously being created and deleted because of some error:
kubectl -n d8-cloud-instance-manager get machine
; - Refer to the Machine description if the logs do not include error messages and the Machine continues to be Pending:
kubectl -n d8-cloud-instance-manager get machine <machine_name> -o json | jq .status.bootstrapStatus
; - The output similar to the one below means that you have to use nc to examine the bootstrap logs:
{ "description": "Use 'nc 192.168.199.158 8000' to get bootstrap logs.", "tcpEndpoint": "192.168.199.158" }
- The absence of information about the endpoint for getting logs means that
cloudInit
is not working correctly. This may be due to the incorrect configuration of the instance class for the cloud provider.
-
There are unavailable instances in the {{ $labels.node_group }} node group.
The number of unavailable instances is {{ $value }}. See the relevant alerts for more information. Probably, machine-controller-manager is unable to create a machine using the cloud provider module. Possible causes:
- Cloud provider limits on available resources;
- No access to the cloud provider API;
- Cloud provider or instance class misconfiguration;
- Problems with bootstrapping the Machine.
The recommended course of action:
- Run
kubectl get ng {{ $labels.node_group }} -o yaml
. In the.status.lastMachineFailures
field you can find all errors related to the creation of Machines; - The absence of Machines in the list that have been in Pending status for more than a couple of minutes means that Machines are continuously being created and deleted because of some error:
kubectl -n d8-cloud-instance-manager get machine
; - Refer to the Machine description if the logs do not include error messages and the Machine continues to be Pending:
kubectl -n d8-cloud-instance-manager get machine <machine_name> -o json | jq .status.bootstrapStatus
; - The output similar to the one below means that you have to use nc to examine the bootstrap logs:
{ "description": "Use 'nc 192.168.199.158 8000' to get bootstrap logs.", "tcpEndpoint": "192.168.199.158" }
- The absence of information about the endpoint for getting logs means that
cloudInit
is not working correctly. This may be due to the incorrect configuration of the instance class for the cloud provider.
-
NodeRequiresDisruptionApprovalForUpdate
CE
S8
The {{ $labels.node }} Node requires disruption approval to proceed with the update
There is a new update for Nodes and the {{ $labels.node }} Node of the {{ $labels.node_group }} group has learned about the update, requested and received approval, started the update, and ran into a step that causes possible downtime.
You have to manually approve the disruption since the
Manual
mode is active in the group settings (disruptions.approvalMode
).Grant approval to the Node using the
update.node.deckhouse.io/disruption-approved=
annotation if it is ready for unsafe updates (e.g., drained).Caution!!! The Node will not be drained automatically since the manual mode is enabled (
disruptions.approvalMode: Manual
).Caution!!! No need to drain the master node.
- Use the following commands to drain the Node and grant it update approval:
kubectl drain {{ $labels.node }} --delete-local-data=true --ignore-daemonsets=true --force=true && kubectl annotate node {{ $labels.node }} update.node.deckhouse.io/disruption-approved=
- Note that you need to uncordon the node after the update is complete (i.e., after removing the
update.node.deckhouse.io/approved
annotation).while kubectl get node {{ $labels.node }} -o json | jq -e '.metadata.annotations | has("update.node.deckhouse.io/approved")' > /dev/null; do sleep 1; done kubectl uncordon {{ $labels.node }}
Note that if there are several Nodes in a NodeGroup, you will need to repeat this operation for each Node since only one Node can be updated at a time. Perhaps it makes sense to temporarily enable the automatic disruption approval mode (
disruptions.approvalMode: Automatic
). - Use the following commands to drain the Node and grant it update approval:
-
NodeStuckInDraining
CE
S6
The {{ $labels.node }} Node is stuck in draining.
The {{ $labels.node }} Node of the {{ $labels.node_group }} NodeGroup stuck in draining.
You can get more info by running:
kubectl -n default get event --field-selector involvedObject.name={{ $labels.node }},reason=DrainFailed --sort-by='.metadata.creationTimestamp'
The error message is: {{ $labels.message }}
-
NodeStuckInDrainingForDisruptionDuringUpdate
CE
S6
The {{ $labels.node }} Node is stuck in draining.
There is a new update for the {{ $labels.node }} Node of the {{ $labels.node_group }} NodeGroup. The Node has learned about the update, requested and received approval, started the update, ran into a step that causes possible downtime, and stuck in draining in order to get disruption approval automatically.
You can get more info by running:
kubectl -n default get event --field-selector involvedObject.name={{ $labels.node }},reason=ScaleDown --sort-by='.metadata.creationTimestamp'
Module okmeter
-
D8OkmeterAgentPodIsNotReady
CE
S6
Okmeter agent is not Ready
Module operator-prometheus
-
D8PrometheusOperatorPodIsNotReady
CE
S7
The prometheus-operator Pod is NOT Ready.
The new
Prometheus
,PrometheusRules
,ServiceMonitor
settings cannot be applied in the cluster; however, all existing and configured components continue to operate correctly. This problem will not affect alerting or monitoring in the short term (a few days).The recommended course of action:
- Analyze the Deployment info:
kubectl -n d8-operator-prometheus describe deploy prometheus-operator
; - Examine the status of the Pod and try to figure out why it is not running:
kubectl -n d8-operator-prometheus describe pod -l app=prometheus-operator
.
- Analyze the Deployment info:
-
D8PrometheusOperatorPodIsNotRunning
CE
S7
The prometheus-operator Pod is NOT Running.
The new
Prometheus
,PrometheusRules
,ServiceMonitor
settings cannot be applied in the cluster; however, all existing and configured components continue to operate correctly. This problem will not affect alerting or monitoring in the short term (a few days).The recommended course of action:
- Analyze the Deployment info:
kubectl -n d8-operator-prometheus describe deploy prometheus-operator
; - Examine the status of the Pod and try to figure out why it is not running:
kubectl -n d8-operator-prometheus describe pod -l app=prometheus-operator
.
- Analyze the Deployment info:
-
D8PrometheusOperatorTargetAbsent
CE
S7
There is no prometheus-operator target in Prometheus.
The new
Prometheus
,PrometheusRules
,ServiceMonitor
settings cannot be applied in the cluster; however, all existing and configured components continue to operate correctly. This problem will not affect alerting or monitoring in the short term (a few days).The recommended course of action is to analyze the deployment information:
kubectl -n d8-operator-prometheus describe deploy prometheus-operator
. -
D8PrometheusOperatorTargetDown
CE
S8
Prometheus is unable to scrape prometheus-operator metrics.
The
prometheus-operator
Pod is not available.The new
Prometheus
,PrometheusRules
,ServiceMonitor
settings cannot be applied in the cluster; however, all existing and configured components continue to operate correctly. This problem will not affect alerting or monitoring in the short term (a few days).The recommended course of action:
- Analyze the Deployment info:
kubectl -n d8-operator-prometheus describe deploy prometheus-operator
; - Examine the status of the Pod and try to figure out why it is not running:
kubectl -n d8-operator-prometheus describe pod -l app=prometheus-operator
.
- Analyze the Deployment info:
Module prometheus
-
One or more Grafana Pods are NOT Running.
The number of Grafana replicas is less than the specified number.
The Deployment is in the MinimumReplicasUnavailable state.
Run the following command to check the status of the Deployment:
kubectl -n d8-monitoring get deployment grafana-v10 -o json | jq .status
.Run the following command to check the status of the Pods:
kubectl -n d8-monitoring get pods -l app=grafana-v10 -o json | jq '.items[] | {(.metadata.name):.status}'
. -
D8GrafanaDeprecatedCustomDashboardDefinition
CE
S9
The deprecated ConfigMap for defining Grafana dashboards is detected.
The
grafana-dashboard-definitions-custom
ConfigMap was found in thed8-monitoring
namespace. This means that the deprecated method of registering custom dashboards in Grafana is being used.This method is no longer used! Please, use the custom GrafanaDashboardDefinition resource instead.
-
D8GrafanaPodIsNotReady
CE
S6
The Grafana Pod is NOT Ready.
-
D8GrafanaPodIsRestartingTooOften
CE
S9
Excessive Grafana restarts are detected.
The number of restarts in the last hour: {{ $value }}.
Excessive Grafana restarts indicate that something is wrong. Normally, Grafana should be up and running all the time.
Please, refer to the corresponding logs:
kubectl -n d8-monitoring logs -f -l app=grafana-v10 -c grafana
. -
D8GrafanaTargetAbsent
CE
S6
There is no Grafana target in Prometheus.
Grafana visualizes metrics collected by Prometheus. Grafana is critical for some tasks, such as monitoring the state of applications and the cluster as a whole. Additionally, Grafana unavailability can negatively impact users who actively use it in their work.
The recommended course of action:
- Check the availability and status of Grafana Pods:
kubectl -n d8-monitoring get pods -l app=grafana-v10
; - Check the availability of the Grafana Deployment:
kubectl -n d8-monitoring get deployment grafana-v10
; - Examine the status of the Grafana Deployment:
kubectl -n d8-monitoring describe deployment grafana-v10
.
- Check the availability and status of Grafana Pods:
-
D8GrafanaTargetDown
CE
S6
Prometheus is unable to scrape Grafana metrics.
-
D8PrometheusLongtermFederationTargetDown
CE
S5
prometheus-longterm cannot scrape prometheus.
prometheus-longterm cannot scrape “/federate” endpoint from Prometheus. Check error cause in prometheus-longterm WebUI or logs.
-
D8PrometheusLongtermTargetAbsent
CE
S7
There is no prometheus-longterm target in Prometheus.
This Prometheus component is only used to display historical data and is not crucial. However, if its unavailability will last long enough, you will not be able to view the statistics.
Usually, Pods of this type have problems because of disk unavailability (e.g., the disk cannot be mounted to a Node for some reason).
The recommended course of action:
- Take a look at the StatefulSet data:
kubectl -n d8-monitoring describe statefulset prometheus-longterm
; - Explore its PVC (if used):
kubectl -n d8-monitoring describe pvc prometheus-longterm-db-prometheus-longterm-0
; - Explore the Pod’s state:
kubectl -n d8-monitoring describe pod prometheus-longterm-0
.
- Take a look at the StatefulSet data:
-
D8TricksterTargetAbsent
CE
S5
There is no Trickster target in Prometheus.
The following modules use this component:
prometheus-metrics-adapter
— the unavailability of the component means that HPA (auto scaling) is not running and you cannot view resource consumption usingkubectl
;vertical-pod-autoscaler
— this module is quite capable of surviving a short-term unavailability, as VPA looks at the consumption history for 8 days;grafana
— by default, all dashboards use Trickster for caching requests to Prometheus. You can retrieve data directly from Prometheus (bypassing the Trickster). However, this may lead to high memory usage by Prometheus and, hence, to its unavailability.
The recommended course of action:
- Analyze the Deployment information:
kubectl -n d8-monitoring describe deployment trickster
; - Analyze the Pod information:
kubectl -n d8-monitoring describe pod -l app=trickster
; - Usually, Trickster is unavailable due to Prometheus-related issues because the Trickster’s readinessProbe checks the Prometheus availability. Thus, make sure that Prometheus is running:
kubectl -n d8-monitoring describe pod -l app.kubernetes.io/name=prometheus,prometheus=main
.
-
D8TricksterTargetAbsent
CE
S5
There is no Trickster target in Prometheus.
The following modules use this component:
prometheus-metrics-adapter
— the unavailability of the component means that HPA (auto scaling) is not running and you cannot view resource consumption usingkubectl
;vertical-pod-autoscaler
— this module is quite capable of surviving a short-term unavailability, as VPA looks at the consumption history for 8 days;grafana
— by default, all dashboards use Trickster for caching requests to Prometheus. You can retrieve data directly from Prometheus (bypassing the Trickster). However, this may lead to high memory usage by Prometheus and, hence, to unavailability.
The recommended course of action:
- Analyze the Deployment stats:
kubectl -n d8-monitoring describe deployment trickster
; - Analyze the Pod stats:
kubectl -n d8-monitoring describe pod -l app=trickster
; - Usually, Trickster is unavailable due to Prometheus-related issues because the Trickster’s readinessProbe checks the Prometheus availability. Thus, make sure that Prometheus is running:
kubectl -n d8-monitoring describe pod -l app.kubernetes.io/name=prometheus,prometheus=main
.
-
DeckhouseModuleUseEmptyDir
CE
S9
Deckhouse module {{ $labels.module_name }} use emptydir as storage.
Deckhouse module {{ $labels.module_name }} use emptydir as storage.
-
GrafanaDashboardAlertRulesDeprecated
CE
S8
Deprecated Grafana alerts have been found.
Before updating to Grafana 10, it’s required to migrate an outdated alerts from Grafana to the external alertmanager (or exporter-alertmanager stack) To list all deprecated alert rules use the expr
sum by (dashboard, panel, alert_rule) (d8_grafana_dashboards_deprecated_alert_rule) > 0
Attention: The check runs once per hour, so this alert should go out within an hour after deprecated resources migration.
-
GrafanaDashboardPanelIntervalDeprecated
CE
S8
Deprecated Grafana panel intervals have been found.
Before updating to Grafana 10, it’s required to rewrite an outdated expressions that uses
$interval_rv
,interval_sx3
orinterval_sx4
to$__rate_interval
To list all deprecated panel intervals use the exprsum by (dashboard, panel, interval) (d8_grafana_dashboards_deprecated_interval) > 0
Attention: The check runs once per hour, so this alert should go out within an hour after deprecated resources migration.
-
GrafanaDashboardPluginsDeprecated
CE
S8
Deprecated Grafana plugins have been found.
Before updating to Grafana 10, it’s required to check if currently installed plugins will work correctly with Grafana 10 To list all potentially outdated plugins use the expr
sum by (dashboard, panel, plugin) (d8_grafana_dashboards_deprecated_plugin) > 0
Plugin “flant-statusmap-panel” is being deprecated and won’t be supported in the near future We recommend you to migrate to the State Timeline plugin: https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/state-timeline/
Attention: The check runs once per hour, so this alert should go out within an hour after deprecated resources migration.
-
K8STooManyNodes
CE
S7
Nodes amount is close to the maximum allowed amount.
Cluster is running {{ $value }} nodes, close to the maximum amount of {{ print “d8_max_nodes_amount{}” query first value }} nodes. -
PrometheusDiskUsage
CE
S4
Prometheus disk is over 95% used.
For more information, use the command:
kubectl -n {{ $labels.namespace }} exec -ti {{ $labels.pod_name }} -c prometheus -- df -PBG /prometheus
Consider increasing it https://deckhouse.io/products/kubernetes-platform/documentation/v1/modules/prometheus/faq.html#how-to-expand-disk-size
-
PrometheusLongtermRotatingEarlierThanConfiguredRetentionDays
CE
S4
Prometheus-longterm data is being rotated earlier than configured retention days
You need to increase the disk size, reduce the number of metrics or decrease
longtermRetentionDays
module parameter. -
PrometheusMainRotatingEarlierThanConfiguredRetentionDays
CE
S4
Prometheus-main data is being rotated earlier than configured retention days
You need to increase the disk size, reduce the number of metrics or decrease
retentionDays
module parameter. -
PrometheusScapeConfigDeclarationDeprecated
CE
S8
AdditionalScrapeConfigs from secrets will be deprecated in soon
Old way for describing additional scrape config via secrets will be deprecated in prometheus-operator > v0.65.1. Please use CRD ScrapeConfig instead.
https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/proposals/202212-scrape-config.md
-
PrometheusServiceMonitorDeprecated
CE
S8
Deprecated Prometheus ServiceMonitor has found.
Kubernetes cluster uses a more advanced network mechanism - EndpointSlice You service monitor
{{ $labels.namespace }}/{{ $labels.name }}
has relabeling with old Endpoint mechanism, starts with__meta_kubernetes_endpoints_
. This relabeling rule support, based on the_endpoint_
label, will be remove in the future (Deckhouse release 1.60). Please, migrate to EndpointSlice relabeling rules. To do this, you have modify ServiceMonitor with changing the following labels:__meta_kubernetes_endpoints_name -> __meta_kubernetes_endpointslice_name __meta_kubernetes_endpoints_label_XXX -> __meta_kubernetes_endpointslice_label_XXX __meta_kubernetes_endpoints_labelpresent_XXX -> __meta_kubernetes_endpointslice_labelpresent_XXX __meta_kubernetes_endpoints_annotation_XXX -> __meta_kubernetes_endpointslice_annotation_XXX __meta_kubernetes_endpoints_annotationpresent_XXX -> __meta_kubernetes_endpointslice_annotationpresent_XXX __meta_kubernetes_endpoint_node_name -> __meta_kubernetes_endpointslice_endpoint_topology_kubernetes_io_hostname __meta_kubernetes_endpoint_ready -> __meta_kubernetes_endpointslice_endpoint_conditions_ready __meta_kubernetes_endpoint_port_name -> __meta_kubernetes_endpointslice_port_name __meta_kubernetes_endpoint_port_protocol -> __meta_kubernetes_endpointslice_port_protocol __meta_kubernetes_endpoint_address_target_kind -> __meta_kubernetes_endpointslice_address_target_kind __meta_kubernetes_endpoint_address_target_name -> __meta_kubernetes_endpointslice_address_target_name
-
TargetDown
CE
S5
Target is down
{{ $labels.job }} target is down.
-
TargetDown
CE
S6
Target is down
{{ $labels.job }} target is down.
-
TargetDown
CE
S7
Target is down
{{ $labels.job }} target is down.
-
TargetSampleLimitExceeded
CE
S6
Scrapes are exceeding sample limit
Target are down because of a sample limit exceeded.
-
TargetSampleLimitExceeded
CE
S7
The sampling limit is close.
The target is close to exceeding the sampling limit. less than 10% left to the limit
Module runtime-audit-engine
-
D8RuntimeAuditEngineNotScheduledInCluster
EE
S4
Pods of runtime-audit-engine cannot be scheduled in the cluster.
A number of runtime-audit-engine pods are not scheduled. Security audit is not fully operational.
Consider checking state of the d8-runtime-audit-engine/runtime-audit-engine DaemonSet.
kubectl -n d8-runtime-audit-engine get daemonset,pod --selector=app=runtime-audit-engine
Get a list of nodes that have pods in an not Ready state.kubectl -n {{$labels.namespace}} get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="{{$labels.daemonset}}")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
Module secret-copier
-
D8SecretCopierDeprecatedLabels
CE
S9
Obsolete antiopa_secret_copier=yes label has been found.
The secrets copier module has changed the service label for the original secrets in the
default
namespace.Soon we will abandon the old
antiopa-secret-copier: "yes"
label.You have to replace the
antiopa-secret-copier: "yes"
label withsecret-copier.deckhouse.io/enabled: ""
for all secrets that thesecret-copier
module uses in thedefault
namespace.
Module snapshot-controller
-
D8SnapshotControllerPodIsNotReady
CE
S8
The snapshot-controller Pod is NOT Ready.
The recommended course of action:
- Retrieve details of the Deployment:
kubectl -n d8-snapshot-controller describe deploy snapshot-controller
- View the status of the Pod and try to figure out why it is not running:
kubectl -n d8-snapshot-controller describe pod -l app=snapshot-controller
- Retrieve details of the Deployment:
-
D8SnapshotControllerPodIsNotRunning
CE
S8
The snapshot-controller Pod is NOT Running.
The recommended course of action:
- Retrieve details of the Deployment:
kubectl -n d8-snapshot-controller describe deploy snapshot-controller
- View the status of the Pod and try to figure out why it is not running:
kubectl -n d8-snapshot-controller describe pod -l app=snapshot-controller
- Retrieve details of the Deployment:
-
D8SnapshotControllerTargetAbsent
CE
S8
There is no snapshot-controller target in Prometheus.
The recommended course of action:
- Check the Pod status:
kubectl -n d8-snapshot-controller get pod -l app=snapshot-controller
- Or check the Pod logs:
kubectl -n d8-snapshot-controller logs -l app=snapshot-controller -c snapshot-controller
- Check the Pod status:
-
D8SnapshotControllerTargetDown
CE
S8
Prometheus cannot scrape the snapshot-controller metrics.
The recommended course of action:
- Check the Pod status:
kubectl -n d8-snapshot-controller get pod -l app=snapshot-controller
- Or check the Pod logs:
kubectl -n d8-snapshot-controller logs -l app=snapshot-controller -c snapshot-controller
- Check the Pod status:
-
D8SnapshotValidationWebhookPodIsNotReady
CE
S8
The snapshot-validation-webhook Pod is NOT Ready.
The recommended course of action:
- Retrieve details of the Deployment:
kubectl -n d8-snapshot-controller describe deploy snapshot-validation-webhook
- View the status of the Pod and try to figure out why it is not running:
kubectl -n d8-snapshot-controller describe pod -l app=snapshot-validation-webhook
- Retrieve details of the Deployment:
-
D8SnapshotValidationWebhookPodIsNotRunning
CE
S8
The snapshot-validation-webhook Pod is NOT Running.
The recommended course of action:
- Retrieve details of the Deployment:
kubectl -n d8-snapshot-controller describe deploy snapshot-validation-webhook
- View the status of the Pod and try to figure out why it is not running:
kubectl -n d8-snapshot-controller describe pod -l app=snapshot-validation-webhook
- Retrieve details of the Deployment:
Module terraform-manager
-
D8TerraformStateExporterClusterStateChanged
CE
S8
Terraform-state-exporter cluster state changed
Real Kubernetes cluster state is
{{ $labels.status }}
comparing to Terraform state.It’s important to make them equal. First, run the
dhctl terraform check
command to check what will change. To converge state of Kubernetes cluster, usedhctl converge
command. -
D8TerraformStateExporterClusterStateError
CE
S8
Terraform-state-exporter cluster state error
Terraform-state-exporter can’t check difference between Kubernetes cluster state and Terraform state.
Probably, it occurred because Terraform-state-exporter had failed to run terraform with current state and config. First, run the
dhctl terraform check
command to check what will change. To converge state of Kubernetes cluster, usedhctl converge
command. -
D8TerraformStateExporterHasErrors
CE
S8
Terraform-state-exporter has errors
Errors occurred while terraform-state-exporter working.
Check pods logs to get more details:
kubectl -n d8-system logs -l app=terraform-state-exporter -c exporter
-
D8TerraformStateExporterNodeStateChanged
CE
S8
Terraform-state-exporter node state changed
Real Node
{{ $labels.node_group }}/{{ $labels.name }}
state is{{ $labels.status }}
comparing to Terraform state.It’s important to make them equal. First, run the
dhctl terraform check
command to check what will change. To converge state of Kubernetes cluster, usedhctl converge
command. -
D8TerraformStateExporterNodeStateError
CE
S8
Terraform-state-exporter node state error
Terraform-state-exporter can’t check difference between Node
{{ $labels.node_group }}/{{ $labels.name }}
state and Terraform state.Probably, it occurred because Terraform-manager had failed to run terraform with current state and config. First, run the
dhctl terraform check
command to check what will change. To converge state of Kubernetes cluster, usedhctl converge
command. -
D8TerraformStateExporterNodeTemplateChanged
CE
S8
Terraform-state-exporter node template changed
Terraform-state-exporter found difference between node template from cluster provider configuration and from NodeGroup
{{ $labels.name }}
. Node template is{{ $labels.status }}
.First, run the
dhctl terraform check
command to check what will change. Usedhctl converge
command or manually adjust NodeGroup settings to fix the issue. -
D8TerraformStateExporterPodIsNotReady
CE
S8
Pod terraform-state-exporter is not Ready
Terraform-state-exporter doesn’t check the difference between real Kubernetes cluster state and Terraform state.
Pease, check:
- Deployment description:
kubectl -n d8-system describe deploy terraform-state-exporter
- Pod status:
kubectl -n d8-system describe pod -l app=terraform-state-exporter
- Deployment description:
-
D8TerraformStateExporterPodIsNotRunning
CE
S8
Pod terraform-state-exporter is not Running
Terraform-state-exporter doesn’t check the difference between real Kubernetes cluster state and Terraform state.
Pease, check:
- Deployment description:
kubectl -n d8-system describe deploy terraform-state-exporter
- Pod status:
kubectl -n d8-system describe pod -l app=terraform-state-exporter
- Deployment description:
-
D8TerraformStateExporterTargetAbsent
CE
S8
Prometheus has no terraform-state-exporter target
To get more details: Check pods state:
kubectl -n d8-system get pod -l app=terraform-state-exporter
or logs:kubectl -n d8-system logs -l app=terraform-state-exporter -c exporter
-
D8TerraformStateExporterTargetDown
CE
S8
Prometheus can't scrape terraform-state-exporter
To get more details: Check pods state:
kubectl -n d8-system get pod -l app=terraform-state-exporter
or logs:kubectl -n d8-system logs -l app=terraform-state-exporter -c exporter
Module upmeter
-
D8SmokeMiniNotBoundPersistentVolumeClaims
CE
S9
Smoke-mini has unbound or lost persistent volume claims.
{{ $labels.persistentvolumeclaim }} persistent volume claim status is {{ $labels.phase }}.
There is a problem with pv provisioning. Check the status of the pvc o find the problem:
kubectl -n d8-upmeter get pvc {{ $labels.persistentvolumeclaim }}
If you have no disk provisioning system in the cluster, you can disable ordering volumes for the some-mini through the module settings.
-
D8UpmeterAgentPodIsNotReady
CE
S6
Upmeter agent is not Ready
-
One or more Upmeter agent pods is NOT Running
Check DaemonSet status:
kubectl -n d8-upmeter get daemonset upmeter-agent -o json | jq .status
Check the status of its pod:
kubectl -n d8-upmeter get pods -l app=upmeter-agent -o json | jq '.items[] | {(.metadata.name):.status}'
-
D8UpmeterProbeGarbageConfigmap
CE
S9
Garbage produced by basic probe is not being cleaned.
Probe configmaps found.
Upmeter agents should clean ConfigMaps produced by control-plane/basic probe. There should not be more configmaps than master nodes (upmeter-agent is a DaemonSet with master nodeSelector). Also, they should be deleted within seconds.
This might be an indication of a problem with kube-apiserver. Or, possibly, the configmaps were left by old upmeter-agent pods due to Upmeter update.
- Check upmeter-agent logs
kubectl -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "basic-functionality") | [.time, .level, .msg] | @tsv'
-
Check that control plane is functional.
-
Delete configmaps manually:
kubectl -n d8-upmeter delete cm -l heritage=upmeter
-
D8UpmeterProbeGarbageDeployment
CE
S9
Garbage produced by controller-manager probe is not being cleaned.
Average probe deployments count per upmeter-agent pod: {{ $value }}.
Upmeter agents should clean Deployments produced by control-plane/controller-manager probe. There should not be more deployments than master nodes (upmeter-agent is a DaemonSet with master nodeSelector). Also, they should be deleted within seconds.
This might be an indication of a problem with kube-apiserver. Or, possibly, the deployments were left by old upmeter-agent pods due to Upmeter update.
- Check upmeter-agent logs
kubectl -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "controller-manager") | [.time, .level, .msg] | @tsv'
-
Check that control plane is functional, kube-controller-manager in particular.
-
Delete deployments manually:
kubectl -n d8-upmeter delete deploy -l heritage=upmeter
-
D8UpmeterProbeGarbageNamespaces
CE
S9
Garbage produced by namespace probe is not being cleaned.
Average probe namespace per upmeter-agent pod: {{ $value }}.
Upmeter agents should clean namespaces produced by control-plane/namespace probe. There should not be more of these namespaces than master nodes (upmeter-agent is a DaemonSet with master nodeSelector). Also, they should be deleted within seconds.
This might be an indication of a problem with kube-apiserver. Or, possibly, the namespaces were left by old upmeter-agent pods due to Upmeter update.
- Check upmeter-agent logs
kubectl -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "namespace") | [.time, .level, .msg] | @tsv'
-
Check that control plane is functional.
-
Delete namespaces manually:
kubectl -n d8-upmeter delete ns -l heritage=upmeter
-
D8UpmeterProbeGarbagePods
CE
S9
Garbage produced by scheduler probe is not being cleaned.
Average probe pods count per upmeter-agent pod: {{ $value }}.
Upmeter agents should clean Pods produced by control-plane/scheduler probe. There should not be more of these pods than master nodes (upmeter-agent is a DaemonSet with master nodeSelector). Also, they should be deleted within seconds.
This might be an indication of a problem with kube-apiserver. Or, possibly, the pods were left by old upmeter-agent pods due to Upmeter update.
- Check upmeter-agent logs
kubectl -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "scheduler") | [.time, .level, .msg] | @tsv'
-
Check that control plane is functional.
-
Delete pods manually:
kubectl -n d8-upmeter delete po -l upmeter-probe=scheduler
-
D8UpmeterProbeGarbagePodsFromDeployments
CE
S9
Garbage produced by controller-manager probe is not being cleaned.
Average probe pods count per upmeter-agent pod: {{ $value }}.
Upmeter agents should clean Deployments produced by control-plane/controller-manager probe, and hence kube-controller-manager should clean their pods. There should not be more of these pods than master nodes (upmeter-agent is a DaemonSet with master nodeSelector). Also, they should be deleted within seconds.
This might be an indication of a problem with kube-apiserver or kube-controller-manager. Or, probably, the pods were left by old upmeter-agent pods due to Upmeter update.
- Check upmeter-agent logs
kubectl -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "controller-manager") | [.time, .level, .msg] | @tsv'
-
Check that control plane is functional, kube-controller-manager in particular.
-
Delete pods manually:
kubectl -n d8-upmeter delete po -l upmeter-probe=controller-manager
-
D8UpmeterProbeGarbageSecretsByCertManager
CE
S9
Garbage produced by cert-manager probe is not being cleaned.
Probe secrets found.
Upmeter agents should clean certificates, and thus secrets produced by cert-manager should clean, too. There should not be more secrets than master nodes (upmeter-agent is a DaemonSet with master nodeSelector). Also, they should be deleted within seconds.
This might be an indication of a problem with kube-apiserver, or cert-manager, or upmeter itself. It is also possible, that the secrets were left by old upmeter-agent pods due to Upmeter update.
- Check upmeter-agent logs
kubectl -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "cert-manager") | [.time, .level, .msg] | @tsv'
-
Check that control plane and cert-manager are functional.
-
Delete certificates manually, and secrets, if needed:
kubectl -n d8-upmeter delete certificate -l upmeter-probe=cert-manager kubectl -n d8-upmeter get secret -ojson | jq -r '.items[] | .metadata.name' | grep upmeter-cm-probe | xargs -n 1 -- kubectl -n d8-upmeter delete secret
-
D8UpmeterServerPodIsNotReady
CE
S6
Upmeter server is not Ready
-
D8UpmeterServerPodIsRestartingTooOften
CE
S9
Upmeter server is restarting too often.
Restarts for the last hour: {{ $value }}.
Upmeter server should not restart too often. It should always be running and collecting episodes. Check its logs to find the problem:
kubectl -n d8-upmeter logs -f upmeter-0 upmeter
-
One or more Upmeter server pods is NOT Running
Check StatefulSet status:
kubectl -n d8-upmeter get statefulset upmeter -o json | jq .status
Check the status of its pod:
kubectl -n d8-upmeter get pods upmeter-0 -o json | jq '.items[] | {(.metadata.name):.status}'
-
D8UpmeterSmokeMiniMoreThanOnePVxPVC
CE
S9
Unnecessary smoke-mini volumes in cluster
The number of unnecessary smoke-mini PVs: {{ $value }}.
Smoke-mini PVs should be deleted when released. Probably smoke-mini storage class has Retain policy by default, or there is CSI/cloud issue.
These PVs have no valuable data on them an should be deleted.
The list of PVs:
kubectl get pv | grep disk-smoke-mini
. -
D8UpmeterTooManyHookProbeObjects
CE
S9
Too many UpmeterHookProbe objects in cluster
Average UpmeterHookProbe count per upmeter-agent pod is {{ $value }}, but should be strictly 1.
Some of the objects were left by old upmeter-agent pods due to Upmeter update or downscale.
Leave only newest objects corresponding to upmeter-agent pods, when the reason it investigated.
See
kubectl get upmeterhookprobes.deckhouse.io
.
Module user-authn
-
D8DexAllTargetsDown
CE
S6
Prometheus is unable to scrape Dex metrics.