The page displays a list of all alerts of monitoring in the Deckhouse Kubernetes Platform.

Alerts are grouped by modules. To the right of the alert name, there are icons indicating the minimum DKP edition in which the alert is available, and the alert severity.

For each alert, a summary is provided, and if available, the detailed alert description can be viewed by expanding it.

Alert severity

Alert descriptions contain the Severity (S) parameter, which indicates the level of criticality. Its value varies from S1 to S9 and can be interpreted as follows:

  • S1 — maximum level, critical failure/crash (immediate action required);
  • S2 — high level, close to maximum, possible accident (rapid response required);
  • S3 — medium level, potentially serious problem (verification required);
  • S4-S9 — low level. There is a problem, but overall performance is not impaired.

The criticality level is formed as follows:

Likelihood / Impact Deadly Catastrophic Critical Marginal Negligible
Сertain S1 S2 S3 S4 S5
Likely S2 S3 S4 S5 S6
Possible S3 S4 S5 S6 S7
Unlikely S4 S5 S6 S7 S8
Rare S5 S6 S7 S8 S9

Here:

  • Likelihood:
    • Сertain — the incident has already occurred;
    • Likely — the incident almost occurred;
    • Possible — the incident has a moderate probability of occurring;
    • Unlikely — the incident has a low probability of occurring;
    • Rare — the probability of an incident occurring is nearly zero.
  • Impact:
    • Deadly — a complete loss of a system component that will take a long time to recover. The component is critical for the health of the entire system. A catastrophic data loss for an indefinite period;
    • Catastrophic — all or almost all users of the system noticed the failure;
    • Critical — a significant percentage of users report problems with the system;
    • Marginal — a minimal percentage of users experience periodic or permanent failures;
    • Negligible — diagnostic information. Users do not notice any anomalies in the system operation.

Module admission-policy-engine

  • D8AdmissionPolicyEngineNotBootstrapped CE S7
    Admission-policy-engine module hasn't been bootstrapped for 10 minutes.

    The admission-policy-engine module couldn’t bootstrap.

    Steps to troubleshoot:

    1. Verify that the module’s components are up and running:

      kubectl get pods -n d8-admission-policy-engine
      
    2. Check logs for issues, such as missing constraint templates or incomplete CRD creation:

      kubectl logs -n d8-system -lapp=deckhouse --tail=1000 | grep admission-policy-engine
      
  • OperationPolicyViolation CE S7
    At least one object violates the configured cluster operation policies.

    You have configured operation policies for the cluster, and one or more existing objects are violating these policies.

    To identify violating objects:

    • Run the following Prometheus query:

      count by (violating_namespace, violating_kind, violating_name, violation_msg) (
        d8_gatekeeper_exporter_constraint_violations{
          violation_enforcement="deny",
          source_type="OperationPolicy"
        }
      )
      
    • Alternatively, check the admission-policy-engine Grafana dashboard.

  • PodSecurityStandardsViolation CE S7
    At least one pod violates the configured cluster pod security standards.

    You have configured Pod Security Standards, and one or more running pods are violating these standards.

    To identify violating pods:

    • Run the following Prometheus query:

      count by (violating_namespace, violating_name, violation_msg) (
        d8_gatekeeper_exporter_constraint_violations{
          violation_enforcement="deny",
          violating_namespace=~".*",
          violating_kind="Pod",
          source_type="PSS"
        }
      )
      
    • Alternatively, check the admission-policy-engine Grafana dashboard.

  • SecurityPolicyViolation CE S7
    At least one object violates the configured cluster security policies.

    You have configured security policies for the cluster, and one or more existing objects are violating these policies.

    To identify violating objects:

    • Run the following Prometheus query:

      count by (violating_namespace, violating_kind, violating_name, violation_msg) (
        d8_gatekeeper_exporter_constraint_violations{
          violation_enforcement="deny",
          source_type="SecurityPolicy"
        }
      )
      
    • Alternatively, check the admission-policy-engine Grafana dashboard.

Module cert-manager

  • CertmanagerCertificateExpired CE S4
    Certificate NAMESPACE/NAME is not provided.

    Certificate is not provided.

    To check the certificate details, run the following command:

    kubectl -n NAMESPACE describe certificate NAME
    
  • CertmanagerCertificateExpiredSoon CE S4
    Certificate will expire soon.

    The certificate NAMESPACE/NAME will expire in less than two weeks.

    To check the certificate details, run the following command:

    kubectl -n NAMESPACE describe certificate CERTIFICATE-NAME
    
  • CertmanagerCertificateOrderErrors CE S5
    Cert-manager couldn't order a certificate.

    Cert-manager received responses with the status code STATUS_REFERENCE when requesting SCHEME://HOST_NAMEPATH.

    This can affect certificate ordering and prolongation in the future. For details, check the cert-manager logs using the following command:

    kubectl -n d8-cert-manager logs -l app=cert-manager -c cert-manager
    

Module chrony

  • NodeTimeOutOfSync CE S5
    Clock on the node NODE_NAME is drifting.

    Time on the node NODE_NAME is out of sync and drifts apart from the NTP server clock by VALUE seconds.

    To resolve the time synchronization issues:

    • Fix network errors:
      • Ensure the upstream time synchronization servers defined in the chrony configuration are available.
      • Eliminate large packet loss and excessive latency to upstream time synchronization servers.
    • Modify the NTP servers list defined in the chrony configuration.
  • NTPDaemonOnNodeDoesNotSynchronizeTime CE S5
    NTP daemon on the node NODE_NAME haven't synchronized time for too long.

    Steps to troubleshoot:

    1. Check if the chrony pod is running on the node by executing the following command:

      kubectl -n d8-chrony --field-selector spec.nodeName="NODE_NAME" get pods
      
    2. Verify the chrony daemon’s status by executing the following command:

      kubectl -n d8-chrony exec POD_NAME -- /opt/chrony-static/bin/chronyc sources
      
    3. Resolve the time synchronization issues:

      • Fix network errors:
        • Ensure the upstream time synchronization servers defined in the chrony configuration are available.
        • Eliminate large packet loss and excessive latency to upstream time synchronization servers.
      • Modify the NTP servers list defined in the chrony configuration.

Module cloud-provider-yandex

  • D8YandexNatInstanceConnectionsQuotaUtilization CE S4
    Connection quota utilization of the Yandex NAT instance exceeds 85% over the last 5 minutes.

    The connection quota for the Yandex NAT instance has exceeded 85% utilization over the past 5 minutes.

    To prevent potential issues, contact Yandex technical support and request an increase in the connection quota.

  • NATInstanceWithDeprecatedAvailabilityZone CE S9
    NAT instance NAME is in a deprecated availability zone.

    The NAT instance NAME is located in the availability zone ru-central1-c, which has been deprecated by Yandex Cloud. To resolve this issue, migrate the NAT instance to either ru-central1-a or ru-central1-b by following the instructions below.

    The migration process involves irreversible changes and may result in a significant downtime. The duration typically depends on Yandex Cloud’s response time and can last several tens of minutes.

    1. Migrate the NAT instance. To get providerClusterConfiguration.withNATInstance, run the following command:

      kubectl -n d8-system exec -ti svc/deckhouse-leader -c deckhouse -- deckhouse-controller module values -g cloud-provider-yandex -o json | jq -c | jq '.cloudProviderYandex.internal.providerClusterConfiguration.withNATInstance'
      
      • If you specified withNATInstance.natInstanceInternalAddress and/or withNATInstance.internalSubnetID in providerClusterConfiguration, remove them using the following command:

        kubectl -n d8-system exec -ti svc/deckhouse-leader -c deckhouse -- deckhouse-controller edit provider-cluster-configuration
        
      • If you specified withNATInstance.externalSubnetID and/or withNATInstance.natInstanceExternalAddress in providerClusterConfiguration, change them to the appropriate values.

        To get the address and subnet ID, use the Yandex Cloud Console or CLI.

        To change withNATInstance.externalSubnetID and withNATInstance.natInstanceExternalAddress, run the following command:

        kubectl -n d8-system exec -ti svc/deckhouse-leader -c deckhouse -- deckhouse-controller edit provider-cluster-configuration
        
    2. Run the appropriate edition and version of the Deckhouse installer container on the local machine. You may have to change the container registry address to do that. After that, perform the converge.

      • Get the appropriate edition and version of Deckhouse:

        DH_VERSION=$(kubectl -n d8-system get deployment deckhouse -o jsonpath='{.metadata.annotations.core\.deckhouse\.io\/version}')
        DH_EDITION=$(kubectl -n d8-system get deployment deckhouse -o jsonpath='{.metadata.annotations.core\.deckhouse\.io\/edition}' | tr '[:upper:]' '[:lower:]')
        echo "DH_VERSION=$DH_VERSION DH_EDITION=$DH_EDITION"
        
      • Run the installer:

        docker run --pull=always -it -v "$HOME/.ssh/:/tmp/.ssh/" registry.deckhouse.io/deckhouse/${DH_EDITION}/install:${DH_VERSION} bash
        
      • Perform the converge:

        dhctl converge --ssh-agent-private-keys=/tmp/.ssh/SSH_KEY_FILENAME --ssh-user=USERNAME --ssh-host MASTER-NODE-0-HOST
        
    3. Update the route table.

      • Get the route table name:

        kubectl -n d8-system exec -ti svc/deckhouse-leader -c deckhouse -- deckhouse-controller module values -g cloud-provider-yandex -o json | jq -c | jq '.global.clusterConfiguration.cloud.prefix'
        
      • Get the NAT instance name:

        kubectl -n d8-system exec -ti svc/deckhouse-leader -c deckhouse -- deckhouse-controller module values -g cloud-provider-yandex -o json | jq -c | jq '.cloudProviderYandex.internal.providerDiscoveryData.natInstanceName'
        
      • Get the NAT instance internal IP address:

        yc compute instance list | grep -e "INTERNAL IP" -e NAT_INSTANCE_NAME_FROM_PREVIOUS_STEP
        
      • Update the route:

        yc vpc route-table update --name ROUTE_TABLE_NAME_FROM_PREVIOUS_STEP --route "destination=0.0.0.0/0,next-hop=NAT_INSTANCE_INTERNAL_IP_FROM_PREVIOUS_STEP"
        
  • NodeGroupNodeWithDeprecatedAvailabilityZone CE S9
    NodeGroup NODE_GROUP_NAME contains nodes in a deprecated availability zone.

    Certain nodes in the node group NODE_GROUP_NAME are located in the availability zone ru-central1-c, which has been deprecated by Yandex Cloud.

    Steps to troubleshoot:

    1. Identify the nodes that need to be migrated by running the following command:

      kubectl get node -l "topology.kubernetes.io/zone=ru-central1-c"
      
    2. Migrate your nodes, disks, and load balancers to one of the supported zones: ru-central1-a, ru-central1-b, or ru-central1-d. Refer to the Yandex migration guide for detailed instructions.

      You can’t migrate public IP addresses between zones. For details, refer to the migration guide.

Module cni-cilium

  • CiliumAgentEndpointsNotReady CE S4
    Over 50% of all known endpoints aren't ready in agent NAMESPACE/POD_NAME.

    For details, refer to the logs of the agent:

    kubectl -n NAMESPACE logs POD_NAME
    
  • CiliumAgentMapPressureCritical CE S4
    eBPF map MAP_NAME exceeds 90% utilization in agent NAMESPACE/POD_NAME.

    The eBPF map resource utilization limit has almost been reached.

    Check with the vendor for potential remediation steps.

  • CiliumAgentMetricNotFound CE S4
    Agent NAMESPACE/POD_NAME isn't sending some metrics.

    Steps to troubleshoot:

    1. Check the logs of the agent:

      kubectl -n NAMESPACE logs POD_NAME
      
    2. Verify the agent’s health status:

      kubectl -n NAMESPACE exec -ti POD_NAME cilium-health status
      
    3. Compare the metrics with those of a neighboring agent.

    Note that the absence of metrics can indirectly indicate that new pods can’t be created on the node due to connectivity issues with the agent.

  • CiliumAgentPolicyImportErrors CE S4
    Agent NAMESPACE/POD_NAME fails to import policies.

    For details, refer to the logs of the agent:

    kubectl -n NAMESPACE logs POD_NAME
    
  • CiliumAgentUnreachableHealthEndpoints CE S4
    Agent NAMESPACE/POD_NAME can't reach some of the node health endpoints.

    For details, refer to the logs of the agent:

    kubectl -n NAMESPACE logs POD_NAME
    
  • CniCiliumNonStandardVXLANPortFound CE S4
    Cilium configuration uses a non-standard VXLAN port.

    The Cilium configuration specifies a non-standard VXLAN port PORT_NUMBER. This port falls outside the recommended range:

    • 4298: When the virtualization module is enabled.
    • 4299: For a standard Deckhouse setup.

    To resolve this issue, update the tunnel-port parameter in the cilium-configmap ConfigMap located in the d8-cni-cilium namespace to match the recommended range.

    If you configured the non-standard port on purpose, ignore this alert.

  • CniCiliumOrphanEgressGatewayPolicyFound SE-PLUS S4
    Orphaned EgressGatewayPolicy with an irrelevant EgressGateway name has been found.

    The cluster contains an orphaned EgressGatewayPolicy named NAME with an irrelevant EgressGateway name.

    To resolve this issue, verify the EgressGateway name specified in the EgressGatewayPolicy resource EGRESSGATEWAY and update it as needed.

  • D8CNIMisconfigured CE S5
    Settings in the secret d8-cni-configuration conflict with the ModuleConfig.

    Steps to troubleshoot:

    1. Find the desired settings in the ConfigMap d8-system/desired-cni-moduleconfig by running the following command:

      kubectl -n d8-system get configmap desired-cni-moduleconfig -o yaml
      
    2. Update the conflicting settings in the CNI CNI_NAME ModuleConfig to match the desired configuration.

Module cni-flannel

  • D8CNIMisconfigured CE S5
    Settings in the secret d8-cni-configuration conflict with the ModuleConfig.

    Steps to troubleshoot:

    1. Find the desired settings in the ConfigMap d8-system/desired-cni-moduleconfig by running the following command:

      kubectl -n d8-system get configmap desired-cni-moduleconfig -o yaml
      
    2. Update the conflicting settings in the CNI CNI_NAME ModuleConfig to match the desired configuration.

Module cni-simple-bridge

  • D8CNIMisconfigured CE S5
    Settings in the secret d8-cni-configuration conflict with the ModuleConfig.

    Steps to troubleshoot:

    1. Find the desired settings in the ConfigMap d8-system/desired-cni-moduleconfig by running the following command:

      kubectl -n d8-system get configmap desired-cni-moduleconfig -o yaml
      
    2. Update the conflicting settings in the CNI CNI_NAME ModuleConfig to match the desired configuration.

Module control-plane-manager

  • D8ControlPlaneManagerPodNotRunning CE S6
    Controller pod isn't running on node NODE_NAME.

    The d8-control-plane-manager pod is either failing or hasn’t been scheduled on node NODE_NAME.

    To resolve this issue, check the status of the kube-system/d8-control-plane-manager DaemonSet and its pods by running the following command:

    kubectl -n kube-system get daemonset,pod --selector=app=d8-control-plane-manager
    
  • D8EtcdDatabaseHighFragmentationRatio CE S7
    etcd database size in use is less than 50% of the allocated storage.

    The etcd database size in use on instance INSTANCE_NAME is less than 50% of the allocated disk space, indicating potential fragmentation. Additionally, the total storage size exceeds 75% of the configured quota.

    To resolve this issue, defragment the etcd database by running the following command:

    kubectl -n kube-system exec -ti etcd-NODE_NAME -- /usr/bin/etcdctl \
      --cacert /etc/kubernetes/pki/etcd/ca.crt \
      --cert /etc/kubernetes/pki/etcd/ca.crt \
      --key /etc/kubernetes/pki/etcd/ca.key \
      --endpoints https://127.0.0.1:2379/ defrag --command-timeout=30s
    
  • D8EtcdExcessiveDatabaseGrowth CE S4
    etcd cluster database is growing rapidly.

    Based on the growth rate observed over the last six hours, Deckhouse predicts that the etcd database will run out of disk space within one day on instance INSTANCE_NAME.

    To prevent disruptions, investigate the cause and take necessary action.

  • D8KubeEtcdDatabaseSizeCloseToTheLimit CE S3
    etcd database size is approaching the limit.

    The etcd database size on NODE_NAME is nearing its size limit. This may be caused by a high number of events, such as pod evictions or the recent creation of numerous resources in the cluster.

    Possible solutions:

    • Defragment the etcd database by running the following command:

      kubectl -n kube-system exec -ti etcd-NODE_NAME -- /usr/bin/etcdctl \
        --cacert /etc/kubernetes/pki/etcd/ca.crt \
        --cert /etc/kubernetes/pki/etcd/ca.crt \
        --key /etc/kubernetes/pki/etcd/ca.key \
        --endpoints https://127.0.0.1:2379/ defrag --command-timeout=30s
      
    • Increase node memory. Starting from 24 GB, quota-backend-bytes will increase by 1 GB for every extra 8 GB of memory.

      Example:

      Node memory quota-backend-bytes
      16 GB 2147483648 (2 GB)
      24 GB 3221225472 (3 GB)
      32 GB 4294967296 (4 GB)
      40 GB 5368709120 (5 GB)
      48 GB 6442450944 (6 GB)
      56 GB 7516192768 (7 GB)
      64 GB 8589934592 (8 GB)
      72 GB 8589934592 (8 GB)
  • D8KubernetesVersionIsDeprecated CE S7
    Kubernetes version VERSION_NUMBER is deprecated.

    The current Kubernetes version VERSION_NUMBER has been deprecated, and support for it will be removed in upcoming releases.

    To resolve this issue, migrate to Kubernetes version 1.28 or later.

    Refer to the Kubernetes upgrade guide for instructions.

  • D8NeedDecreaseEtcdQuotaBackendBytes CE S6
    Deckhouse suggests reducing quota-backend-bytes.

    When the control plane node memory is reduced, Deckhouse may suggest reducing quota-backend-bytes. While Deckhouse is capable of automatically increasing this value, reducing it must be done manually.

    To modify quota-backend-bytes, set the controlPlaneManager.etcd.maxDbSize parameter. Before setting a new value, check the current database usage on every control plane node by running:

    for pod in $(kubectl get pod -n kube-system -l component=etcd,tier=control-plane -o name); do
      kubectl -n kube-system exec -ti "$pod" -- /usr/bin/etcdctl \
        --cacert /etc/kubernetes/pki/etcd/ca.crt \
        --cert /etc/kubernetes/pki/etcd/ca.crt \
        --key /etc/kubernetes/pki/etcd/ca.key \
        endpoint status -w json | jq --arg a "$pod" -r \
        '.[0].Status.dbSize / 1024 / 1024 | tostring | $a + ": " + . + " MB"';
    done
    

    Things to note:

    • The maximum value for controlPlaneManager.etcd.maxDbSize is 8 GB.
    • If control plane nodes have less than 24 GB, set controlPlaneManager.etcd.maxDbSize to 2 GB.
    • Starting from 24 GB, quota-backend-bytes will increase by 1 GB for every extra 8 GB of memory.

      Example:

      Node memory quota-backend-bytes
      16 GB 2147483648 (2 GB)
      24 GB 3221225472 (3 GB)
      32 GB 4294967296 (4 GB)
      40 GB 5368709120 (5 GB)
      48 GB 6442450944 (6 GB)
      56 GB 7516192768 (7 GB)
      64 GB 8589934592 (8 GB)
      72 GB 8589934592 (8 GB)

Module documentation

  • ModuleConfigDeprecated CE S9
    Deprecated deckhouse-web ModuleConfig detected.

    The deckhouse-web module has been renamed to documentation, and a new documentation ModuleConfig is generated automatically.

    Steps to troubleshoot:

    1. Remove the deprecated deckhouse-web ModuleConfig from the CI deployment process.
    2. Delete it using the following command:

      kubectl delete mc deckhouse-web
      

Module extended-monitoring

  • CertificateSecretExpired CE S8
    Certificate has expired.

    A certificate in Secret NAMESPACE/NAME has expired.

    Ways to resolve:

    • If the certificate is managed manually, upload a new certificate.
    • If the certificate is managed by the cert-manager module, inspect the certificate resource:
      1. Retrieve the certificate name from the Secret:

        cert=$(kubectl get secret -n NAMESPACE NAME -o 'jsonpath={.metadata.annotations.cert-manager\.io/certificate-name}')
        
      2. Check the certificate status and investigate why it hasn’t been updated:

        kubectl describe cert -m NAMESPACE "$cert"
        
  • CertificateSecretExpiredSoon CE S8
    Certificate is expiring soon.

    A certificate in Secret NAMESPACE/NAME will expire in less than two weeks.

    Ways to resolve:

    • If the certificate is managed manually, upload a new certificate.
    • If the certificate is managed by the cert-manager module, inspect the certificate resource:
      1. Retrieve the certificate name from the Secret:

        cert=$(kubectl get secret -n NAMESPACE NAME -o 'jsonpath={.metadata.annotations.cert-manager\.io/certificate-name}')
        
      2. Check the certificate status and investigate why it hasn’t been updated:

        kubectl describe cert -n NAMESPACE "$cert"
        
  • CronJobAuthenticationFailure CE S7
    Unable to log in to the container registry using imagePullSecrets for the IMAGE_NAME image.

    Deckhouse was unable to log in to the container registry using imagePullSecrets for the IMAGE_NAME image.

    To resolve this issue, investigate the possible causes in the following sources:

    • The NAMESPACE namespace.
    • The CronJob NAME.
    • The CONTAINER_NAME container in the registry.
  • CronJobAuthorizationFailure CE S7
    Insufficient privileges to pull the IMAGE_NAME image using the specified imagePullSecrets.

    Deckhouse has insufficient privileges to pull the IMAGE_NAME image using the specified imagePullSecrets.

    To resolve this issue, investigate the possible causes in the following sources:

    • The NAMESPACE namespace.
    • The CronJob NAME.
    • The CONTAINER_NAME container in the registry.
  • CronJobBadImageFormat CE S7
    The IMAGE_NAME image name is incorrect.

    Deckhouse has detected that the IMAGE_NAME image name is incorrect.

    To resolve this issue, check that the IMAGE_NAME image name is spelled correctly in the following sources:

    • The NAMESPACE namespace.
    • The CronJob NAME.
    • The CONTAINER_NAME container in the registry.
  • CronJobFailed CE S5
    Job NAMESPACE/JOB_NAME failed in CronJob NAMESPACE/CRONJOB.

    Deckhouse has detected that Job NAMESPACE/JOB_NAME failed in CronJob NAMESPACE/CRONJOB.

    Steps to resolve:

    1. Print the job details:

      kubectl -n NAMESPACE describe job JOB_NAME
      
    2. Check the job status:

      kubectl -n NAMESPACE get job JOB_NAME
      
    3. Check the status of pods created by the job:

      kubectl -n NAMESPACE get pods -l job-name=JOB_NAME
      
  • CronJobImageAbsent CE S7
    The IMAGE_NAME image is missing from the registry.

    Deckhouse has detected that the IMAGE_NAME image is missing from the container registry.

    To resolve this issue, check whether the IMAGE_NAME image is available in the following sources:

    • The NAMESPACE namespace.
    • The CronJob NAME.
    • The CONTAINER_NAME container in the registry.
  • CronJobPodsNotCreated CE S5
    Pods set in CronJob NAMESPACE/JOB_NAME haven't been created.

    Deckhouse has detected that the pods set in CronJob NAMESPACE/CRONJOB still haven’t been created.

    Steps to resolve:

    1. Print the job details:

      kubectl -n NAMESPACE describe job JOB_NAME
      
    2. Check the job status:

      kubectl -n NAMESPACE get job JOB_NAME
      
    3. Check the status of pods created by the job:

      kubectl -n NAMESPACE get pods -l job-name=JOB_NAME
      
  • CronJobRegistryUnavailable CE S7
    The container registry is not available for the IMAGE_NAME image.

    Deckhouse has detected that the container registry is not available for the IMAGE_NAME image.

    To resolve this issue, investigate the possible causes in the following sources:

    • The NAMESPACE namespace.
    • The CronJob NAME.
    • The CONTAINER_NAME container in the registry.
  • CronJobSchedulingError CE S6
    CronJob NAMESPACE/CRONJOB failed to schedule on time.

    Deckhouse has detected that CronJob NAMESPACE/CRONJOB failed to schedule on time.

    • Current schedule: XXX
    • Last scheduled time: XXX%
    • Next projected schedule time: XXX%
  • CronJobUnknownError CE S7
    An unknown error occurred with the IMAGE_NAME image.

    Deckhouse has detected an unknown error with the IMAGE_NAME image in the following sources:

    • The NAMESPACE namespace.
    • The CronJob NAME.
    • The CONTAINER_NAME container in the registry.

    To resolve this issue, review the exporter logs:

    kubectl -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
    
  • D8CertExporterPodIsNotReady CE S8
    The x509-certificate-exporter pod isn't ready.

    Steps to resolve:

    1. Retrieve the deployment details:

      kubectl -n d8-monitoring describe deploy x509-certificate-exporter
      
    2. Check the pod status and investigate why it’s not ready:

      kubectl -n d8-monitoring describe pod -l app=x509-certificate-exporter
      
  • D8CertExporterPodIsNotRunning CE S8
    The x509-certificate-exporter pod isn't running.

    Steps to resolve:

    1. Retrieve the deployment details:

      kubectl -n d8-monitoring describe deploy x509-certificate-exporter
      
    2. Check the pod status and investigate why it’s not running:

      kubectl -n d8-monitoring describe pod -l app=x509-certificate-exporter
      
  • D8CertExporterTargetAbsent CE S8
    There is no x509-certificate-exporter target in Prometheus.

    Ways to resolve:

    • Check the pod status:

      kubectl -n d8-monitoring get pod -l app=x509-certificate-exporter
      
    • Check the pod logs:

      kubectl -n d8-monitoring logs -l app=x509-certificate-exporter -c x509-certificate-exporter
      
  • D8CertExporterTargetDown CE S8
    Prometheus can't scrape x509-certificate-exporter metrics.

    Ways to resolve:

    • Check the pod status:

      kubectl -n d8-monitoring get pod -l app=x509-certificate-exporter
      
    • Check the pod logs:

      kubectl -n d8-monitoring logs -l app=x509-certificate-exporter -c x509-certificate-exporter
      
  • D8ImageAvailabilityExporterMalfunctioning CE S8
    The image-availability-exporter has crashed.

    The image-availability-exporter has failed to perform any image availability checks in the container registry for over 20 minutes.

    To investigate the issue, review the exporter’s logs:

    kubectl -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
    
  • D8ImageAvailabilityExporterPodIsNotReady CE S8
    The image-availability-exporter pod is not ready.

    Deckhouse has detected that the image-availability-exporter pod is not ready. As a result, the images listed in the image field aren’t checked for availability in the container registry.

    Steps to resolve:

    1. Retrieve the deployment details:

      kubectl -n d8-monitoring describe deploy image-availability-exporter
      
    2. Check the pod status and investigate why it isn’t Ready:

      kubectl -n d8-monitoring describe pod -l app=image-availability-exporter
      
  • D8ImageAvailabilityExporterPodIsNotRunning CE S8
    The image-availability-exporter pod is not running.

    Deckhouse has detected that the image-availability-exporter pod is not running. As a result, the images listed in the image field aren’t checked for availability in the container registry.

    Steps to resolve:

    1. Retrieve the deployment details:

      kubectl -n d8-monitoring describe deploy image-availability-exporter
      
    2. Check the pod status and investigate why it isn’t running:

      kubectl -n d8-monitoring describe pod -l app=image-availability-exporter
      
  • D8ImageAvailabilityExporterTargetAbsent CE S8
    The image-availability-exporter target is missing from Prometheus.

    Deckhouse has detected that the image-availability-exporter target is missing from Prometheus.

    Steps to resolve:

    1. Check the pod status:

      kubectl -n d8-monitoring get pod -l app=image-availability-exporter
      
    2. Check the pod logs:

      kubectl -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
      
  • D8ImageAvailabilityExporterTargetDown CE S8
    Prometheus can't scrape metrics of image-availability-exporter.

    Deckhouse has detected that Prometheus is unable to scrape metrics of image-availability-exporter.

    Steps to resolve:

    1. Check the pod status:

      kubectl -n d8-monitoring get pod -l app=image-availability-exporter
      
    2. Check the pod logs:

      kubectl -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
      
  • DaemonSetAuthenticationFailure CE S7
    Unable to log in to the container registry using imagePullSecrets for the IMAGE_NAME image.

    Deckhouse was unable to log in to the container registry using imagePullSecrets for the IMAGE_NAME image.

    To resolve this issue, investigate the possible causes in the following sources:

    • The NAMESPACE namespace.
    • The DaemonSet NAME.
    • The CONTAINER_NAME container in the registry.
  • DaemonSetAuthorizationFailure CE S7
    Insufficient privileges to pull the IMAGE_NAME image using the specified imagePullSecrets.

    Deckhouse has insufficient privileges to pull the IMAGE_NAME image using the specified imagePullSecrets.

    To resolve this issue, investigate the possible causes in the following sources:

    • The NAMESPACE namespace.
    • The DaemonSet NAME.
    • The CONTAINER_NAME container in the registry.
  • DaemonSetBadImageFormat CE S7
    The IMAGE_NAME image name is incorrect.

    Deckhouse has detected that the IMAGE_NAME image name is incorrect.

    To resolve this issue, check that the IMAGE_NAME image name is spelled correctly in the following sources:

    • The NAMESPACE namespace.
    • The DaemonSet NAME.
    • The CONTAINER_NAME container in the registry.
  • DaemonSetImageAbsent CE S7
    The IMAGE_NAME image is missing from the registry.

    Deckhouse has detected that the IMAGE_NAME image is missing from the container registry.

    To resolve this issue, check whether the IMAGE_NAME image is available in the following sources:

    • The NAMESPACE namespace.
    • The DaemonSet NAME.
    • The CONTAINER_NAME container in the registry.
  • DaemonSetRegistryUnavailable CE S7
    The container registry is not available for the IMAGE_NAME image.

    Deckhouse has detected that the container registry is not available for the IMAGE_NAME image.

    To resolve this issue, investigate the possible causes in the following sources:

    • The NAMESPACE namespace.
    • The DaemonSet NAME.
    • The CONTAINER_NAME container in the registry.
  • DaemonSetUnknownError CE S7
    An unknown error occurred with the IMAGE_NAME image.

    Deckhouse has detected an unknown error with the IMAGE_NAME image in the following sources:

    • The NAMESPACE namespace.
    • The DaemonSet NAME.
    • The CONTAINER_NAME container in the registry.

    To resolve this issue, review the exporter logs:

    kubectl -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
    
  • DeploymentAuthenticationFailure CE S7
    Unable to log in to the container registry using imagePullSecrets for the IMAGE_NAME image.

    Deckhouse was unable to log in to the container registry using imagePullSecrets for the IMAGE_NAME image.

    To resolve this issue, investigate the possible causes in the following sources:

    • The NAMESPACE namespace.
    • The Deployment NAME.
    • The CONTAINER_NAME container in the registry.
  • DeploymentAuthorizationFailure CE S7
    Insufficient privileges to pull the IMAGE_NAME image using the specified imagePullSecrets.

    Deckhouse has insufficient privileges to pull the IMAGE_NAME image using the specified imagePullSecrets.

    To resolve this issue, investigate the possible causes in the following sources:

    • The NAMESPACE namespace.
    • The Deployment NAME.
    • The CONTAINER_NAME container in the registry.
  • DeploymentBadImageFormat CE S7
    The IMAGE_NAME image name is incorrect.

    Deckhouse has detected that the IMAGE_NAME image name is incorrect.

    To resolve this issue, check that the IMAGE_NAME image name is spelled correctly in the following sources:

    • The NAMESPACE namespace.
    • The Deployment NAME.
    • The CONTAINER_NAME container in the registry.
  • DeploymentImageAbsent CE S7
    The IMAGE_NAME image is missing from the registry.

    Deckhouse has detected that the IMAGE_NAME image is missing from the container registry.

    To resolve this issue, check whether the IMAGE_NAME image is available in the following sources:

    • The NAMESPACE namespace.
    • The Deployment NAME.
    • The CONTAINER_NAME container in the registry.
  • DeploymentRegistryUnavailable CE S7
    The container registry is not available for the IMAGE_NAME image.

    Deckhouse has detected that the container registry is not available for the IMAGE_NAME image.

    To resolve this issue, investigate the possible causes in the following sources:

    • The NAMESPACE namespace.
    • The Deployment NAME.
    • The CONTAINER_NAME container in the registry.
  • DeploymentUnknownError CE S7
    An unknown error occurred with the IMAGE_NAME image.

    Deckhouse has detected an unknown error with the IMAGE_NAME image in the following sources:

    • The NAMESPACE namespace.
    • The Deployment NAME.
    • The CONTAINER_NAME container in the registry.

    To resolve this issue, review the exporter logs:

    kubectl -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
    
  • ExtendedMonitoringDeprecatatedAnnotation CE S4
    Deprecated annotation is used in the cluster.

    Deckhouse has detected that the deprecated annotation extended-monitoring.flant.com/enabled is used in the cluster.

    Steps to resolve:

    1. Check the d8_deprecated_legacy_annotation metric in Prometheus for a list of all detected usages.
    2. Migrate to the extended-monitoring.deckhouse.io/enabled label.
  • ExtendedMonitoringTargetDown CE S5
    Extended monitoring is unavailable.

    The pod running extended-monitoring-exporter is currently unavailable.

    As a result, the following alerts will not be triggered:

    • Low disk space and inode usage on volumes.
    • CPU overloads and container throttling.
    • 500 errors on Ingress.
    • Insufficient replicas of Deployments, StatefulSets, and DaemonSets.
    • Other alerts associated with this exporter.

    To resolve this issue, investigate its possible causes:

    1. Print detailed information about the extended-monitoring-exporter deployment:

      kubectl -n d8-monitoring describe deploy extended-monitoring-exporter
      
    2. Print detailed information about the pods associated with the extended-monitoring-exporter:

      kubectl -n d8-monitoring describe pod -l app=extended-monitoring-exporter
      
  • IngressResponses5xx CE S4
    URL VHOST/LOCATION on Ingress INGRESS has more than XXX% of 5xx responses from the backend.

    Deckhouse has detected that URL VHOST/LOCATION on Ingress INGRESS, using service SERVICE_NAME on port SERVICE_PORT has more than XXX% of 5xx responses from the backend.

    Current rate of 5xx responses: VALUE%

  • IngressResponses5xx CE S5
    URL VHOST/LOCATION on Ingress INGRESS has more than XXX% of 5xx responses from the backend.

    Deckhouse has detected that URL VHOST/LOCATION on Ingress INGRESS, using service SERVICE_NAME on port SERVICE_PORT, has more than XXX% of 5xx responses from the backend.

    Current rate of 5xx responses: VALUE%

  • KubernetesDaemonSetNotUpToDate CE S9
    There were VALUE outdated pods in DaemonSet NAMESPACE/DAEMONSET_NAME over the last 15 minutes.

    Deckhouse has detected VALUE outdated pods in DaemonSet NAMESPACE/DAEMONSET_NAME over the last 15 minutes.

    Steps to resolve:

    1. Check the DaemonSet’s status:

      kubectl -n NAMESPACE get ds DAEMONSET_NAME
      
    2. Analyze the DaemonSet’s description:

      kubectl -n NAMESPACE describe ds DAEMONSET_NAME
      
    3. If the parameter Number of Nodes Scheduled with Up-to-date Pods does not match Current Number of Nodes Scheduled, check the DaemonSet’s updateStrategy:

      kubectl -n NAMESPACE get ds DAEMONSET_NAME -o json | jq '.spec.updateStrategy'
      

      If updateStrategy is set to OnDelete, the DaemonSet is updated only when pods are deleted.

  • KubernetesDaemonSetReplicasUnavailable CE S5
    No available replicas remaining in DaemonSet NAMESPACE/DAEMONSET_NAME.

    Deckhouse has detected that there are no available replicas remaining in DaemonSet NAMESPACE/DAEMONSET_NAME.

    List of unavailable pods:

    XXXXXX, XXXPOD_NAMEXXX
    

    If you know where the DaemonSet should be scheduled, run the command below to identify the problematic nodes. Use a label selector for pods, if needed.

    kubectl -n NAMESPACE get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="DAEMONSET_NAME")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
    
  • KubernetesDaemonSetReplicasUnavailable CE S6
    The number of unavailable replicas in DaemonSet NAMESPACE/DAEMONSET_NAME exceeds the threshold.

    Deckhouse has detected that the number of unavailable replicas in DaemonSet NAMESPACE/DAEMONSET_NAME exceeds the threshold.

    • Current number: VALUE unavailable replica(s).
    • Threshold number: XXX unavailable replica(s).

    List of unavailable pods:

    XXXXXX, XXXPOD_NAMEXXX
    

    If you know where the DaemonSet should be scheduled, run the command below to identify the problematic nodes. Use a label selector for pods, if needed.

    kubectl -n NAMESPACE get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="DAEMONSET_NAME")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
    
  • KubernetesDeploymentReplicasUnavailable CE S5
    No available replicas remaining in deployment NAMESPACE/DEPLOYMENT_NAME.

    Deckhouse has detected that there are no available replicas remaining in deployment NAMESPACE/DEPLOYMENT_NAME.

    List of unavailable pods:

    XXXXXX, XXXPOD_NAMEXXX
    
  • KubernetesDeploymentReplicasUnavailable CE S6
    The number of unavailable replicas in deployment NAMESPACE/DEPLOYMENT_NAME exceeds spec.strategy.rollingupdate.maxunavailable.

    Deckhouse has detected that the number of unavailable replicas in deployment NAMESPACE/DEPLOYMENT_NAME exceeds the value set in spec.strategy.rollingupdate.maxunavailable.

    • Current number: VALUE unavailable replica(s).
    • Threshold number: XXX unavailable replica(s).

    List of unavailable pods:

    XXXXXX, XXXPOD_NAMEXXX
    
  • KubernetesStatefulSetReplicasUnavailable CE S5
    No ready replicas remaining in StatefulSet NAMESPACE/STATEFULSET.

    Deckhouse has detected that there are no ready replicas remaining in StatefulSet NAMESPACE/STATEFULSET.

    List of unavailable pods:

    XXXXXX, XXXPOD_NAMEXXX
    
  • KubernetesStatefulSetReplicasUnavailable CE S6
    The number of unavailable replicas in StatefulSet NAMESPACE/STATEFULSET exceeds the threshold.

    Deckhouse has detected that the number of unavailable replicas in StatefulSet NAMESPACE/STATEFULSET exceeds the threshold.

    • Current number: VALUE unavailable replica(s).
    • Threshold number: XXX unavailable replica(s).

    List of unavailable pods:

    XXXXXX, XXXPOD_NAMEXXX
    
  • LoadAverageHigh CE S4
    Average load on node NODE_NAME is too high.

    Over the last 5 minutes, the average load on node NODE_NAME has been higher than XXX per core.

    There are more processes in the queue than the CPU can handle.

    Possible causes:

    • A process has created too many threads or child processes.
    • The CPU is overloaded.
  • LoadAverageHigh CE S5
    Average load on node NODE_NAME is too high.

    Over the last 30 minutes, the average load on node NODE_NAME has been higher than or equal to XXX per core.

    There are more processes in the queue than the CPU can handle.

    Possible causes:

    • A process has created too many threads or child processes.
    • The CPU is overloaded.
  • NodeDiskBytesUsage CE S5
    Node disk NODE_DISK_NAME on mount point MOUNTPOINT is using more than XXX% of its storage capacity.

    Deckhouse has detected that node disk NODE_DISK_NAME on mount point MOUNTPOINT is using more than XXX% of its storage capacity.

    Current storage usage: VALUE%

    Steps to resolve:

    1. Retrieve disk usage information on the node:

      ncdu -x MOUNTPOINT
      
    2. If the output shows high disk usage in the /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/ directory, identify the pods with the highest usage:

      crictl stats -o json | jq '.stats[] | select((.writableLayer.usedBytes.value | tonumber) > 1073741824) | { meta: .attributes.labels, diskUsage: ((.writableLayer.usedBytes.value | tonumber) / 1073741824 * 100 | round / 100 | tostring + " GiB")}'
      
  • NodeDiskBytesUsage CE S6
    Node disk NODE_DISK_NAME on mount point MOUNTPOINT is using more than XXX% of its storage capacity.

    Deckhouse has detected that node disk NODE_DISK_NAME on mount point MOUNTPOINT is using more than XXX% of its storage capacity.

    Current storage usage: VALUE%

    Steps to resolve:

    1. Retrieve disk usage information on the node:

      ncdu -x MOUNTPOINT
      
    2. If the output shows high disk usage in the /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/ directory, identify the pods with the highest usage:

      crictl stats -o json | jq '.stats[] | select((.writableLayer.usedBytes.value | tonumber) > 1073741824) | { meta: .attributes.labels, diskUsage: ((.writableLayer.usedBytes.value | tonumber) / 1073741824 * 100 | round / 100 | tostring + " GiB")}'
      
  • NodeDiskInodesUsage CE S5
    Node disk NODE_DISK_NAME on mount point MOUNTPOINT is using more than XXX% of its storage capacity.

    Deckhouse has detected that node disk NODE_DISK_NAME on mount point MOUNTPOINT is using more than XXX% of its storage capacity.

    Current storage usage: VALUE%

  • NodeDiskInodesUsage CE S6
    Node disk NODE_DISK_NAME on mount point MOUNTPOINT is using more than XXX% of its storage capacity.

    Deckhouse has detected that node disk NODE_DISK_NAME on mount point MOUNTPOINT is using more than XXX% of its storage capacity.

    Current storage usage: VALUE%

  • PersistentVolumeClaimBytesUsage CE S4
    PersistentVolumeClaim NAMESPACE/PVC_NAME is using more than XXX% of the volume storage capacity.

    Deckhouse has detected that PersistentVolumeClaim NAMESPACE/PVC_NAME is using more than XXX% of the volume storage capacity.

    Current volume storage usage: VALUE%

    PersistentVolumeClaim NAMESPACE/PVC_NAME is used by the following pods:

    XXXXXX, XXXPOD_NAMEXXX
    
  • PersistentVolumeClaimBytesUsage CE S5
    PersistentVolumeClaim NAMESPACE/PVC_NAME is using more than XXX% of the volume storage capacity.

    Deckhouse has detected that PersistentVolumeClaim NAMESPACE/PVC_NAME is using more than XXX% of the volume storage capacity.

    Currently volume storage usage: VALUE%

    PersistentVolumeClaim NAMESPACE/PVC_NAME is used by the following pods:

    XXXXXX, XXXPOD_NAMEXXX
    
  • PersistentVolumeClaimInodesUsed CE S4
    PersistentVolumeClaim NAMESPACE/PVC_NAME is using more than XXX% of the volume inode capacity.

    Deckhouse has detected that PersistentVolumeClaim NAMESPACE/PVC_NAME is using more than XXX% of the volume inode capacity.

    Current volume inode usage: VALUE%

    PersistentVolumeClaim NAMESPACE/PVC_NAME is used by the following pods:

    XXXXXX, XXXPOD_NAMEXXX
    
  • PersistentVolumeClaimInodesUsed CE S5
    PersistentVolumeClaim NAMESPACE/PVC_NAME is using more than XXX% of the volume inode capacity.

    Deckhouse has detected that PersistentVolumeClaim NAMESPACE/PVC_NAME is using more than XXX% of the volume inode capacity.

    Current volume inode usage: VALUE%

    PersistentVolumeClaim NAMESPACE/PVC_NAME is used by the following pods:

    XXXXXX, XXXPOD_NAMEXXX
    
  • StatefulSetAuthenticationFailure CE S7
    Unable to log in to the container registry using imagePullSecrets for the IMAGE_NAME image.

    Deckhouse was unable to log in to the container registry using imagePullSecrets for the IMAGE_NAME image.

    To resolve this issue, investigate the possible causes in the following sources:

    • The NAMESPACE namespace.
    • The StatefulSet NAME.
    • The CONTAINER_NAME container in the registry.
  • StatefulSetAuthorizationFailure CE S7
    Insufficient privileges to pull the IMAGE_NAME image using the specified imagePullSecrets.

    Deckhouse has insufficient privileges to pull the IMAGE_NAME image using the specified imagePullSecrets.

    To resolve this issue, investigate the possible causes in the following sources:

    • The NAMESPACE namespace.
    • The StatefulSet NAME.
    • The CONTAINER_NAME container in the registry.
  • StatefulSetBadImageFormat CE S7
    The IMAGE_NAME image name is incorrect.

    Deckhouse has detected that the IMAGE_NAME image name is incorrect.

    To resolve this issue, check that the IMAGE_NAME image name is spelled correctly in the following sources:

    • The NAMESPACE namespace.
    • The StatefulSet NAME.
    • The CONTAINER_NAME container in the registry.
  • StatefulSetImageAbsent CE S7
    The IMAGE_NAME image is missing from the registry.

    Deckhouse has detected that the IMAGE_NAME image is missing from the container registry.

    To resolve this issue, check whether the IMAGE_NAME image is available in the following sources:

    • The NAMESPACE namespace.
    • The StatefulSet NAME.
    • The CONTAINER_NAME container in the registry.
  • StatefulSetRegistryUnavailable CE S7
    The container registry is not available for the IMAGE_NAME image.

    Deckhouse has detected that the container registry is not available for the IMAGE_NAME image.

    To resolve this issue, investigate the possible causes in the following sources:

    • The NAMESPACE namespace.
    • The StatefulSet NAME.
    • The CONTAINER_NAME container in the registry.
  • StatefulSetUnknownError CE S7
    An unknown error occurred with the IMAGE_NAME image.

    Deckhouse has detected an unknown error with the IMAGE_NAME image in the following sources:

    • The NAMESPACE namespace.
    • The StatefulSet NAME.
    • The CONTAINER_NAME container in the registry.

    To resolve this issue, review the exporter logs:

    kubectl -n d8-monitoring logs -l app=image-availability-exporter -c image-availability-exporter
    

Module flow-schema

  • KubernetesAPFRejectRequests CE S9
    APF flow schema d8-serviceaccounts has rejected API requests.

    This alert is experimental.

    To display the APF schema queue requests, use the following expression:

    apiserver_flowcontrol_current_inqueue_requests{flow_schema="d8-serviceaccounts"}
    

Module ingress-nginx

  • D8NginxIngressKruiseControllerPodIsRestartingTooOften CE S8
    Too many Kruise controller restarts detected.

    VALUE Kruise controller restarts detected in the last hour.

    Excessive Kruise controller restarts indicate that something is wrong. Normally, it should be up and running all the time.

    Steps to resolve:

    1. Check events associated with kruise-controller-manager in the d8-ingress-nginx namespace. Look for issues related to node failures or memory shortages (OOM events):

      kubectl -n d8-ingress-nginx get events | grep kruise-controller-manager
      
    2. Analyze the controller’s pod descriptions to identify restarted containers and possible causes. Pay attention to exit codes and other details:

      kubectl -n d8-ingress-nginx describe pod -lapp=kruise,control-plane=controller-manager
      
    3. In case the kruise container has restarted, get a list of relevant container logs to identify any meaningful errors:

      kubectl -n d8-ingress-nginx logs -lapp=kruise,control-plane=controller-manager -c kruise
      
  • DeprecatedGeoIPVersion CE S9
    Deprecated GeoIP version 1 is used in the cluster.

    An IngressNginxController and/or Ingress object in the cluster is using variables from the deprecated NGINX GeoIPv1 module. Support for this module has been discontinued in Ingress NGINX Controller version 1.10 and higher.

    It’s recommended that you update your configuration to use the GeoIPv2 module.

    To get a list of the IngressNginxControllers using GeoIPv1 variables, run the following command:

    kubectl get ingressnginxcontrollers.deckhouse.io -o json | jq '.items[] | select(..|strings | test("\\$geoip_(country_(code3|code|name)|area_code|city_continent_code|city_country_(code3|code|name)|dma_code|latitude|longitude|region|region_name|city|postal_code|org)([^_a-zA-Z0-9]|$)+")) | .metadata.name'
    

    To get a list of the Ingress objects using GeoIPv1 variables, run the following command:

    kubectl get ingress -A -o json | jq '.items[] | select(..|strings | test("\\$geoip_(country_(code3|code|name)|area_code|city_continent_code|city_country_(code3|code|name)|dma_code|latitude|longitude|region|region_name|city|postal_code|org)([^_a-zA-Z0-9]|$)+")) | "\(.metadata.namespace)/\(.metadata.name)"' | sort | uniq
    
  • NginxIngressConfigTestFailed CE S4
    Configuration test failed on NGINX Ingress CONTROLLER_NAMESPACE/CONTROLLER_NAME.

    The configuration test (nginx -t) for the CONTROLLER_NAME Ingress controller in the CONTROLLER_NAMESPACE namespace has failed.

    Steps to resolve:

    1. Check the controller logs:

      kubectl -n CONTROLLER_NAMESPACE logs CONTROLLER_POD_NAME -c controller
      
    2. Find the most recently created Ingress in the cluster:

      kubectl get ingress --all-namespaces --sort-by="metadata.creationTimestamp"
      
    3. Check for errors in the configuration-snippet or server-snippet annotations.

  • NginxIngressDaemonSetNotUpToDate CE S9
    There were VALUE outdated pods in NGINX Ingress DaemonSet NAMESPACE/DAEMONSET_NAME over the last 20 minutes.

    Deckhouse has detected VALUE outdated pods in NGINX Ingress DaemonSet NAMESPACE/DAEMONSET_NAME over the last 20 minutes.

    Steps to resolve:

    1. Check the DaemonSet’s status:

      kubectl -n NAMESPACE get ads DAEMONSET_NAME
      
    2. Analyze the DaemonSet’s description:

      kubectl -n NAMESPACE describe ads DAEMONSET_NAME
      
    3. If the parameter Number of Nodes Scheduled with Up-to-date Pods does not match Current Number of Nodes Scheduled, check the ‘nodeSelector’ and ‘toleration’ settings of the corresponding NGINX Ingress Controller and compare them to the ‘labels’ and ‘taints’ settings of the relevant nodes.

  • NginxIngressDaemonSetReplicasUnavailable CE S4
    No available replicas remaining in NGINX Ingress DaemonSet NAMESPACE/DAEMONSET_NAME.

    Deckhouse has detected that there are no available replicas remaining in NGINX Ingress DaemonSet NAMESPACE/DAEMONSET_NAME.

    List of unavailable pods:

    XXXXXX, XXXPOD_NAMEXXX
    

    If you know where the DaemonSet should be scheduled, run the command below to identify the problematic nodes. Use a label selector for pods, if needed.

    kubectl -n NAMESPACE get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="DAEMONSET_NAME")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
    
  • NginxIngressDaemonSetReplicasUnavailable CE S6
    Some replicas of NGINX Ingress DaemonSet NAMESPACE/DAEMONSET_NAME are unavailable.

    Deckhouse has detected that some replicas of NGINX Ingress DaemonSet NAMESPACE/DAEMONSET_NAME are unavailable.

    Current number: VALUE unavailable replica(s).

    List of unavailable pods:

    XXXXXX, XXXPOD_NAMEXXX
    

    If you know where the DaemonSet should be scheduled, run the command below to identify the problematic nodes. Use a label selector for pods, if needed.

    kubectl -n NAMESPACE get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="DAEMONSET_NAME")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
    
  • NginxIngressPodIsRestartingTooOften CE S4
    Too many NGINX Ingress restarts detected.

    VALUE NGINX Ingress controller restarts detected in the last hour.

    Excessive NGINX Ingress restarts indicate that something is wrong. Normally, it should be up and running all the time.

  • NginxIngressProtobufExporterHasErrors CE S8
    The Ingress NGINX sidecar container with protobuf_exporter has ERROR_TYPE errors.

    Deckhouse has detected that the Ingress NGINX sidecar container with protobuf_exporter has ERROR_TYPE errors.

    To resolve the issue, check the Ingress controller’s logs:

    kubectl -n d8-ingress-nginx logs $(kubectl -n d8-ingress-nginx get pods -l app=controller,name=CONTROLLER_NAME -o wide | grep NODE_NAME | awk '{print $1}') -c protobuf-exporter
    
  • NginxIngressSslExpired CE S4
    Certificate has expired.

    The SSL certificate for HOST_NAME in the NAMESPACE namespace has expired.

    To verify the certificate, run the following command:

    kubectl -n NAMESPACE get secret SECRET_NAME -o json | jq -r '.data."tls.crt" | @base64d' | openssl x509 -noout -alias -subject -issuer -dates
    

    The site at https://HOST_NAME is not accessible.

  • NginxIngressSslWillExpire CE S5
    Certificate is expiring soon.

    The SSL certificate for HOST_NAME in the NAMESPACE namespace will expire in less than two weeks.

    To verify the certificate, run the following command:

    kubectl -n NAMESPACE get secret SECRET_NAME -o json | jq -r '.data."tls.crt" | @base64d' | openssl x509 -noout -alias -subject -issuer -dates
    

Module istio

  • D8IstioActualDataPlaneVersionNotEqualDesired EE S8
    There are pods with Istio data plane version VERSION_NUMBER, but desired version is XXX.

    There are pods in the NAMESPACE namespace with Istio data plane version VERSION_NUMBER, while the desired version is XXX. As a result, the Istio version will be changed after the pod is restarted.

    To resolve the issue, use the following cheat sheet:

    ### Namespace-wide configuration
    # `istio.io/rev=vXYZ`: Use a specific revision.
    # `istio-injection=enabled`: Use the global revision.
    kubectl get ns NAMESPACE --show-labels
    
    ### Pod-wide configuration
    kubectl -n NAMESPACE get pods -l istio.io/rev=DESIRED_VISION
    
  • D8IstioActualVersionIsNotInstalled EE S4
    The control plane version for pods with injected sidecars isn't installed.

    There are pods in the NAMESPACE namespace with injected sidecars of version VERSION_NUMBER (revision REVISION_NUMBER), but the corresponding control plane version is not installed. As a result, these pods have lost synchronization with the state in Kubernetes.

    To resolve this issue, install the required control plane version. Alternatively, update the namespace or pod configuration to match an installed control plane version.

    To identify orphaned pods, run the following command:

    kubectl -n NAMESPACE get pods -l 'service.istio.io/canonical-name' -o json | jq --arg revision REVISION_NUMBER '.items[] | select(.metadata.annotations."sidecar.istio.io/status" // "{}" | fromjson | .revision == $revision) | .metadata.name'
    
  • D8IstioAdditionalControlplaneDoesntWork CE S4
    Additional control plane isn't working.

    Deckhouse has detected that the additional Istio control plane ISTIO_REVISION_LABEL isn’t working.

    As a result, sidecar injection for pods with ISTIO_REVISION_LABEL isn’t working as well.

    To check the status of the control plane pods, run the following command:

    kubectl get pods -n d8-istio -l istio.io/rev=ISTIO_REVISION_LABEL
    
  • D8IstioDataPlaneVersionMismatch EE S8
    There are pods with data plane version different from the control plane version.

    There are pods in the NAMESPACE namespace with Istio data plane version VERSION_NUMBER, which is different from the control plane version DESIRED_VERSION.

    Steps to resolve the issue:

    1. Restart affected pods and use the following PromQL query to get a full list:

      max by (namespace, dataplane_pod) (d8_istio_dataplane_metadata{full_version="VERSION_NUMBER"})
      
    2. Use the automatic Istio data plane upgrade described in the guide.

  • D8IstioDataPlaneWithoutIstioInjectionConfigured EE S4
    Detected pods with Istio sidecars but istio-injection isn't configured.

    There are pods in the NAMESPACE namespace with Istio sidecars, but istio-injection isn’t configured. As a result, these pods will lose their Istio sidecars after being recreated.

    To identify the affected pods, run the following command:

    kubectl -n NAMESPACE get pods -o json | jq -r --arg revision REVISION_NUMBER '.items[] | select(.metadata.annotations."sidecar.istio.io/status" // "{}" | fromjson | .revision == $revision) | .metadata.name'
    
  • D8IstioDeprecatedIstioVersionInstalled CE
    The installed Istio version has been deprecated.

    Deckhouse has detected that a deprecated Istio version VERSION_NUMBER is installed.

    Support for this version will be removed in upcoming Deckhouse releases. The higher the alert severity, the greater the probability of support being discontinued.

    To learn how to upgrade Istio, refer to the upgrade guide.

  • D8IstioDesiredVersionIsNotInstalled EE S6
    Desired control plane version isn't installed.

    There is a desired Istio control plane version XXX (revision REVISION_NUMBER) configured for pods in the NAMESPACE namespace, but that version isn’t installed. As a result, pods can’t be recreated in the NAMESPACE namespace.

    To resolve this issue, install the desired control plane version. Alternatively, update the namespace or pod configuration to match an installed control plane version.

    Use the following cheat sheet:

    ### Namespace-wide configuration
    # `istio.io/rev=vXYZ`: Use a specific revision.
    # `istio-injection=enabled`: Use the global revision.
    kubectl get ns NAMESPACE --show-labels
    
    ### Pod-wide configuration
    kubectl -n NAMESPACE get pods -l istio.io/rev=REVISION_NUMBER
    
  • D8IstioFederationMetadataEndpointDoesntWork EE S6
    Federation metadata endpoint failed.

    The metadata endpoint ENDPOINT_NAME for IstioFederation FEDERATION_NAME has failed to fetch via the Deckhouse hook.

    To reproduce the request to the public endpoint, run the following command:

    curl ENDPOINT_NAME
    

    To reproduce the request to private endpoints (run from the Deckhouse pod), run the following:

    KEY="$(deckhouse-controller module values istio -o json | jq -r .internal.remoteAuthnKeypair.priv)"
    LOCAL_CLUSTER_UUID="$(deckhouse-controller module values -g istio -o json | jq -r .global.discovery.clusterUUID)"
    REMOTE_CLUSTER_UUID="$(kubectl get istiofederation FEDERATION_NAME -o json | jq -r .status.metadataCache.public.clusterUUID)"
    TOKEN="$(deckhouse-controller helper gen-jwt --private-key-path <(echo "$KEY") --claim iss=d8-istio --claim sub=$LOCAL_CLUSTER_UUID --claim aud=$REMOTE_CLUSTER_UUID --claim scope=private-federation --ttl 1h)"
    curl -H "Authorization: Bearer $TOKEN" ENDPOINT_NAME
    
  • D8IstioGlobalControlplaneDoesntWork CE S4
    Global control plane isn't working.

    Deckhouse has detected that the global Istio control plane ISTIO_REVISION_LABEL isn’t working.

    As a result, sidecar injection for pods with global revision isn’t working as well, and the validating webhook for Istio resources is absent.

    To check the status of the control plane pods, run the following command:

    kubectl get pods -n d8-istio -l istio.io/rev=ISTIO_REVISION_LABEL
    
  • D8IstioMulticlusterMetadataEndpointDoesntWork EE S6
    Multicluster metadata endpoint failed.

    The metadata endpoint ENDPOINT_NAME for IstioMulticluster MULTICLUSTER_NAME has failed to fetch via the Deckhouse hook.

    To reproduce the request to the public endpoint, run the following command:

    curl ENDPOINT_NAME
    

    To reproduce the request to private endpoints (run from the d8-system/deckhouse pod), run the following:

    KEY="$(deckhouse-controller module values istio -o json | jq -r .internal.remoteAuthnKeypair.priv)"
    LOCAL_CLUSTER_UUID="$(deckhouse-controller module values -g istio -o json | jq -r .global.discovery.clusterUUID)"
    REMOTE_CLUSTER_UUID="$(kubectl get istiomulticluster MULTICLUSTER_NAME -o json | jq -r .status.metadataCache.public.clusterUUID)"
    TOKEN="$(deckhouse-controller helper gen-jwt --private-key-path <(echo "$KEY") --claim iss=d8-istio --claim sub=$LOCAL_CLUSTER_UUID --claim aud=$REMOTE_CLUSTER_UUID --claim scope=private-multicluster --ttl 1h)"
    curl -H "Authorization: Bearer $TOKEN" ENDPOINT_NAME
    
  • D8IstioMulticlusterRemoteAPIHostDoesntWork EE S6
    Multicluster remote API host health check failed.

    The remote API host API_HOST for IstioMulticluster MULTICLUSTER_NAME has failed the health check performed by the Deckhouse monitoring hook.

    To reproduce the request (run from the d8-system/deckhouse pod), run the following:

    TOKEN="$(deckhouse-controller module values istio -o json | jq -r --arg ah API_HOST '.internal.multiclusters[]| select(.apiHost == $ah)| .apiJWT ')"
    curl -H "Authorization: Bearer $TOKEN" https://API_HOST/version
    
  • D8IstioOperatorReconcileError CE S5
    The istio-operator is unable to reconcile Istio control plane setup.

    Deckhouse has detected an error in the istio-operator reconciliation loop.

    To investigate the issue, check the operator logs:

    kubectl -n d8-istio logs -l app=operator,revision=REVISION_NUMBER
    
  • D8IstioPodsWithoutIstioSidecar EE S4
    Detected pods without Istio sidecars but with istio-injection configured.

    There is a pod POD_NAME in the NAMESPACE namespace without Istio sidecars, but with istio-injection configured.

    To identify the affected pods, run the following command:

    kubectl -n NAMESPACE get pods -l '!service.istio.io/canonical-name' -o json | jq -r '.items[] | select(.metadata.annotations."sidecar.istio.io/inject" != "false") | .metadata.name'
    
  • D8IstioVersionIsIncompatibleWithK8sVersion CE S3
    The installed Istio version is incompatible with the Kubernetes version.

    The installed Istio version VERSION_NUMBER may not work properly with the current Kubernetes version VERSION_NUMBER because it’s not supported officially.

    To resolve the issue, upgrade Istio following the guide.

  • IstioIrrelevantExternalServiceFound CE S5
    External service found with irrelevant ports specifications.

    A service NAME in the NAMESPACE namespace has an irrelevant port specification.

    The .spec.ports[] field isn’t applicable for services of the ExternalName type. However, Istio renders port listeners for external services as 0.0.0.0:port, which captures all traffic to the specified port. This can cause problems for services that aren’t registered in the Istio registry.

    To resolve the issue, remove the .spec.ports section from the service configuration. It is safe.

Module kube-dns

  • KubernetesCoreDNSHasCriticalErrors CE S5
    Critical errors found in CoreDNS.

    Deckhouse has detected at least one critical error in the CoreDNS pod POD_NAME.

    To resolve the issue, review the container logs:

    kubectl -n kube-system logs POD_NAME
    

Module log-shipper

  • D8LogShipperAgentNotScheduledInCluster CE S7
    The log-shipper-agent pods can't be scheduled in the cluster.

    Deckhouse has detected that a number of log-shipper-agent pods are not scheduled.

    To resolve this issue, do the following:

    1. Check the state of the d8-log-shipper/log-shipper-agent DaemonSet:

      kubectl -n d8-log-shipper get daemonsets --selector=app=log-shipper
      
    2. Check the state of the d8-log-shipper/log-shipper-agent pods:

      kubectl -n d8-log-shipper get pods --selector=app=log-shipper-agent
      
    3. If you know where the DaemonSet should be scheduled, run the following command to identify the problematic nodes:

      kubectl -n d8-log-shipper get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="log-shipper-agent")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
      
  • D8LogShipperClusterLogDestinationD8LokiAuthorizationRequired CE S9
    Authorization parameters required for the ClusterLogDestination resource.

    Deckhouse has detected the ClusterLogDestination resource RESOURCE_NAME without authorization parameters.

    Add the authorization parameters to the ClusterLogDestination resource following the instructions.

  • D8LogShipperCollectLogErrors CE S4
    The log-shipper-agent pods can't collect logs to COMPONENT_ID on the NODE_NAME node.

    Deckhouse has detected that the HOST_NAME log-shipper-agent on the NODE_NAME node has failed to collect metrics for more than 10 minutes.

    This is caused by the ERROR_TYPE errors occurred during the STAGE_NAME stage while reading COMPONENT_TYPE.

    To resolve this, check the pod logs or follow advanced instructions:

    kubectl -n d8-log-shipper logs HOST_NAME` -c vector
    
  • D8LogShipperDestinationErrors CE S4
    The log-shipper-agent pods can't send logs to COMPONENT_ID on the NODE_NAME node.

    Deckhouse has detected that the HOST_NAME log-shipper-agent on the NODE_NAME node has failed to send a log for more than 10 minutes.

    This is caused by the ERROR_TYPE errors occurred during the STAGE_NAME stage while sending logs to COMPONENT_TYPE.

    To resolve this, check the pod logs or follow advanced instructions:

    kubectl -n d8-log-shipper logs HOST_NAME` -c vector
    
  • D8LogShipperLogsDroppedByRateLimit CE S4
    The log-shipper-agent pods are dropping logs to COMPONENT_ID on the NODE_NAME node.

    Rate-limiting rules have been applied, and the log-shipper-agent on the NODE_NAME node has been dropping logs for more than 10 minutes.

    To resolve this, check the pod logs or follow advanced instructions:

    kubectl -n d8-log-shipper get pods -o wide | grep NODE_NAME
    

Module metallb

  • D8MetallbBGPSessionDown EE S4
    MetalLB BGP session is down.

    XXX, MetalLB CONTAINER_NAME on POD_NAME has BGP session PEER down.

    Check the logs for details:

    kubectl -n d8-metallb logs daemonset/speaker -c speaker
    
  • D8MetallbConfigNotLoaded EE S4
    The MetalLB configuration hasn't been loaded.

    XXX, MetalLB CONTAINER_NAME on POD_NAME hasn’t been loaded.

    To find the cause of the issue, review the controller logs:

    kubectl -n d8-metallb logs deploy/controller -c controller
    
  • D8MetallbConfigStale EE S4
    MetalLB is running on a stale configuration.

    XXX, MetalLB CONTAINER_NAME on POD_NAME is running on a stale configuration because the latest configuration failed to load.

    To find the cause of the issue, review the controller logs:

    kubectl -n d8-metallb logs deploy/controller -c controller
    
  • D8MetallbNotSupportedServiceAnnotationsDetected SE S4
    The annotation 'ANNOTATION_NAME' has been deprecated for the service 'NAME' in the 'NAMESPACE' namespace.

    The annotation ‘ANNOTATION_NAME’ has been deprecated for the service ‘NAME’ in the ‘NAMESPACE’ namespace.

    The following service annotations are no longer effective:

    • metallb.universe.tf/ip-allocated-from-pool: Remove this annotation.
    • metallb.universe.tf/address-pool: Replace it with the .spec.loadBalancerClass parameter or use the network.deckhouse.io/metal-load-balancer-class annotation, referencing the appropriate MetalLoadBalancerClass.
    • metallb.universe.tf/loadBalancerIPs: Replace it with network.deckhouse.io/load-balancer-ips: IP.
    • metallb.universe.tf/allow-shared-ip: Replace it with network.deckhouse.io/load-balancer-shared-ip-key.

    Please note. Existing LoadBalancer services of Deckhouse have been migrated automatically, but the new ones will not be.

  • D8MetallbObsoleteLayer2PoolsAreUsed SE S7
    The metallb module has obsolete layer2 pools configured.

    In ModuleConfig version 2, addressPool ‘NAME’ of type “layer2” are ignored. They should be removed from the configuration.

  • D8MetallbUpdateMCVersionRequired SE S5
    The metallb ModuleConfig settings are outdated.

    D8 MetalLB settings are outdated.

    To resolve this issue, increase version in the ModuleConfig metallb.

  • L2LoadBalancerOrphanServiceFound SE S4
    Orphaned service with an irrelevant L2LoadBalancer name has been found.

    The cluster contains an orphaned service NAME in the NAMESPACE namespace with an irrelevant L2LoadBalancer name.

    To resolve this issue, verify the L2LoadBalancer name specified in the annotation network.deckhouse.io/l2-load-balancer-name.

Module monitoring-applications

  • D8OldPrometheusTargetFormat FE S6
    Services with the prometheus-target label are used to collect metrics in the cluster.

    Services with the prometheus-target label are used to collect metrics in the cluster.

    Use the following command to filter them: kubectl get service --all-namespaces --show-labels | grep prometheus-target

    Note that the label format has changed. You need to replace the prometheus-target label with prometheus.deckhouse.io/target.

Module monitoring-custom

  • CustomPodMonitorFoundInCluster CE S9
    There are PodMonitors in Deckhouse namespace that were not created by Deckhouse.

    There are PodMonitors in Deckhouse namespace that were not created by Deckhouse.

    Use the following command for filtering: kubectl get podmonitors --all-namespaces -l heritage!=deckhouse.

    They must be moved from Deckhouse namespace to user-spec namespace (was not labeled as heritage: deckhouse).

    The detailed description of the metric collecting process is available in the documentation.

  • CustomServiceMonitorFoundInD8Namespace CE S9
    There are ServiceMonitors in Deckhouse namespace that were not created by Deckhouse.

    There are ServiceMonitors in Deckhouse namespace that were not created by Deckhouse.

    Use the following command for filtering: kubectl get servicemonitors --all-namespaces -l heritage!=deckhouse.

    They must be moved from Deckhouse namespace to user-spec namespace (was not labeled as heritage: deckhouse).

    The detailed description of the metric collecting process is available in the documentation.

  • D8CustomPrometheusRuleFoundInCluster CE S9
    There are PrometheusRules in the cluster that were not created by Deckhouse.

    There are PrometheusRules in the cluster that were not created by Deckhouse.

    Use the following command for filtering: kubectl get prometheusrules --all-namespaces -l heritage!=deckhouse.

    They must be abandoned and replaced with the CustomPrometheusRules object.

    Please, refer to the documentation for information about adding alerts and/or recording rules.

  • D8OldPrometheusCustomTargetFormat CE S9
    Services with the prometheus-custom-target label are used to collect metrics in the cluster.

    Services with the prometheus-custom-target label are used to collect metrics in the cluster.

    Use the following command for filtering: kubectl get service --all-namespaces --show-labels | grep prometheus-custom-target.

    Note that the label format has changed. You need to change the prometheus-custom-target label to prometheus.deckhouse.io/custom-target.

    For more information, refer to the documentation.

  • D8ReservedNodeLabelOrTaintFound CE S6
    Node NAME needs fixing up

    Node NAME uses:

    • reserved metadata.labels node-role.deckhouse.io/ with ending not in (system|frontend|monitoring|_deckhouse_module_name_)
    • or reserved spec.taints dedicated.deckhouse.io with values not in (system|frontend|monitoring|_deckhouse_module_name_)

    Get instructions on how to fix it here.

Module monitoring-deckhouse

  • D8CNIEnabledMoreThanOne CE S2
    More than one CNI is enabled in the cluster.

    Several CNIs are enabled in the cluster: For the cluster to work correctly, only one CNI must be enabled.

  • D8DeckhouseConfigInvalid CE S5
    Deckhouse config is invalid.

    Deckhouse config contains errors.

    Please check Deckhouse logs by running kubectl -n d8-system logs -f -l app=deckhouse.

    Edit Deckhouse global configuration by running kubectl edit mc global or configuration of the specific module by running kubectl edit mc MODULE_NAME

  • D8DeckhouseCouldNotDeleteModule CE S4
    Deckhouse is unable to delete the MODULE_NAME module.

    Please, refer to the corresponding logs: kubectl -n d8-system logs -f -l app=deckhouse.

  • D8DeckhouseCouldNotDiscoverModules CE S4
    Deckhouse is unable to discover modules.

    Please, refer to the corresponding logs: kubectl -n d8-system logs -f -l app=deckhouse.

  • D8DeckhouseCouldNotRunGlobalHook CE S5
    Deckhouse is unable to run the HOOK_NAME global hook.

    Please, refer to the corresponding logs: kubectl -n d8-system logs -f -l app=deckhouse.

  • D8DeckhouseCouldNotRunModule CE S4
    Deckhouse is unable to start the MODULE_NAME module.

    Please, refer to the corresponding logs: kubectl -n d8-system logs -f -l app=deckhouse.

  • D8DeckhouseCouldNotRunModuleHook CE S7
    Deckhouse is unable to run the MODULE_NAME/HOOK_NAME module hook.

    Please, refer to the corresponding logs: kubectl -n d8-system logs -f -l app=deckhouse.

  • D8DeckhouseCustomTargetDown CE S4

    Prometheus is unable to scrape custom metrics generated by Deckhouse hooks.

  • D8DeckhouseDeprecatedConfigmapManagedByArgoCD CE S4
    Deprecated deckhouse configmap managed by Argo CD

    The deckhouse configmap is no longer used. You need to remove configmap “d8-system/deckhouse” from ArgoCD

  • D8DeckhouseGlobalHookFailsTooOften CE S9
    The HOOK_NAME Deckhouse global hook crashes way too often.

    The HOOK_NAME has failed in the last __SCRAPE_INTERVAL_X_4__.

    Please, refer to the corresponding logs: kubectl -n d8-system logs -f -l app=deckhouse.

  • D8DeckhouseHasNoAccessToRegistry CE S7
    Deckhouse is unable to connect to the registry.

    Deckhouse is unable to connect to the registry (registry.deckhouse.io in most cases) to check for a new Docker image (checks are performed every 15 seconds). Deckhouse does not have access to the registry; automatic updates are not available.

    Usually, this alert means that the Deckhouse Pod is having difficulties with connecting to the Internet.

  • D8DeckhouseIsHung CE S4
    Deckhouse is down.

    Deckhouse is probably down since the deckhouse_live_ticks metric in Prometheus is no longer increasing (it is supposed to increment every 10 seconds).

  • D8DeckhouseIsNotOnReleaseChannel CE S9
    Deckhouse in the cluster is not subscribed to one of the regular release channels.

    Deckhouse is on a custom branch instead of one of the regular release channels.

    It is recommended that Deckhouse be subscribed to one of the following channels: Alpha, Beta, EarlyAccess, Stable, RockSolid.

    Use the command below to find out what release channel is currently in use: kubectl -n d8-system get deploy deckhouse -o json | jq '.spec.template.spec.containers[0].image' -r

    Subscribe the cluster to one of the regular release channels.

  • D8DeckhouseModuleHookFailsTooOften CE S9
    The MODULE_NAME/HOOK_NAME Deckhouse hook crashes way too often.

    The HOOK_NAME hook of the MODULE_NAME module has failed in the last __SCRAPE_INTERVAL_X_4__.

    Please, refer to the corresponding logs: kubectl -n d8-system logs -f -l app=deckhouse.

  • D8DeckhouseModuleUpdatePolicyNotFound CE S5
    Module update policy not found for MODULE_RELEASE

    Module update policy not found for MODULE_RELEASE

    You need to remove label from MR: kubectl label mr MODULE_RELEASE modules.deckhouse.io/update-policy-. A new suitable policy will be detected automatically.

  • D8DeckhousePodIsNotReady CE S4

    The Deckhouse Pod is NOT Ready.

  • D8DeckhousePodIsNotRunning CE S4

    The Deckhouse Pod is NOT Running.

  • D8DeckhousePodIsRestartingTooOften CE S9
    Excessive Deckhouse restarts detected.

    The number of restarts in the last hour: VALUE.

    Excessive Deckhouse restarts indicate that something is wrong. Normally, Deckhouse should be up and running all the time.

    Please, refer to the corresponding logs: kubectl -n d8-system logs -f -l app=deckhouse.

  • D8DeckhouseQueueIsHung CE S7
    The QUEUE_NAME Deckhouse queue has hung; there are VALUE task(s) in the queue.

    Deckhouse cannot finish processing of the QUEUE_NAME queue with VALUE tasks piled up.

    Please, refer to the corresponding logs: kubectl -n d8-system logs -f -l app=deckhouse.

  • D8DeckhouseSelfTargetAbsent CE S4

    There is no Deckhouse target in Prometheus.

  • D8DeckhouseSelfTargetDown CE S4

    Prometheus is unable to scrape Deckhouse metrics.

  • D8DeckhouseWatchErrorOccurred CE S5
    Possible apiserver connection error in the client-go informer, check logs and snapshots.

    Error occurred in the client-go informer, possible problems with connection to apiserver.

    Check Deckhouse logs for more information by running: kubectl -n d8-system logs deploy/deckhouse | grep error | grep -i watch

    This alert is an attempt to detect the correlation between the faulty snapshot invalidation and apiserver connection errors, especially for the handle-node-template hook in the node-manager module. Check the difference between the snapshot and actual node objects for this hook: diff -u <(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}'|sort) <(kubectl -n d8-system exec svc/deckhouse-leader -c deckhouse -- deckhouse-controller module snapshots node-manager -o json | jq '."040-node-manager/hooks/handle_node_templates.go"' | jq '.nodes.snapshot[] | .filterResult.Name' -r | sort)

  • D8HasModuleConfigAllowedToDisable CE S4
    ModuleConfig annotation for allow to disable is setted.

    ModuleConfig is waiting for disable.

    It is recommended to keep clean your module configurations from approve annotations.

    If you ignore this alert and do not clear the annotation, it may cause the module to be accidentally removed from the cluster.

    Removing a module from a cluster can lead to a number of irreparable consequences.

    Please run kubectl annotate moduleconfig MODULE_NAME modules.deckhouse.io/allow-disabling- to stop this alert.

  • D8NodeHasDeprecatedOSVersion CE S4
    Nodes have deprecated OS versions.

    Some nodes have deprecated OS versions. Please update nodes to actual OS version.

    To observe affected nodes use the expr kube_node_info{os_image=~"Ubuntu 18.04.*|Debian GNU/Linux 10.*|CentOS Linux 7.*"} in Prometheus.

  • D8NodeHasUnmetKernelRequirements CE S4
    Nodes have unmet kernel requirements

    Some nodes have unmet kernel constraints. This means that some modules cannot be run on that nodes. Current kernel constraint requirements: For Cilium module kernel should be >= 4.9.17. For Cilium with Istio modules kernel should be >= 5.7. For Cilium with OpenVPN modules kernel should be >= 5.7. For Cilium with Node-local-dns modules kernel should be >= 5.7.

    To observe affected nodes use the expr d8_node_kernel_does_not_satisfy_requirements == 1 in Prometheus.

  • DeckhouseReleaseDisruptionApprovalRequired CE S4
    Deckhouse release disruption approval required.

    Deckhouse release contains disruption update.

    You can figure out more details by running kubectl describe DeckhouseRelease NAME. If you are ready to deploy this release, run: kubectl annotate DeckhouseRelease NAME release.deckhouse.io/disruption-approved=true.

  • DeckhouseReleaseIsBlocked CE S5
    Deckhouse release requirements unmet.

    Deckhouse release requirements is not met.

    Please run kubectl describe DeckhouseRelease NAME for details.

  • DeckhouseReleaseIsWaitingManualApproval CE S3
    Deckhouse release is waiting for manual approval.

    Deckhouse release is waiting for manual approval.

    Please run kubectl patch DeckhouseRelease NAME --type=merge -p='{"approved": true}' for confirmation.

  • DeckhouseReleaseIsWaitingManualApproval CE S6
    Deckhouse release is waiting for manual approval.

    Deckhouse release is waiting for manual approval.

    Please run kubectl patch DeckhouseRelease NAME --type=merge -p='{"approved": true}' for confirmation.

  • DeckhouseReleaseIsWaitingManualApproval CE S9
    Deckhouse release is waiting for manual approval.

    Deckhouse release is waiting for manual approval.

    Please run kubectl patch DeckhouseRelease NAME --type=merge -p='{"approved": true}' for confirmation.

  • DeckhouseReleaseNotificationNotSent CE S4
    Deckhouse release notification webhook not sent.

    Failed to send the Deckhouse release notification webhook.

    Check the notification webhook address by running kubectl get mc deckhouse -o yaml.

  • DeckhouseUpdating CE S4

    Deckhouse is being updated.

  • DeckhouseUpdatingFailed CE S4
    Deckhouse updating is failed.

    Failed to update Deckhouse.

    Next version minor/path Deckhouse image is not available in the registry or the image is corrupted. Actual version: VERSION_NUMBER.

    Make sure that the next version Deckhouse image is available in the registry.

  • MigrationRequiredFromRBDInTreeProvisionerToCSIDriver CE S9
    Storage class STORAGECLASS_NAME uses the deprecated rbd provisioner. It is necessary to migrate the volumes to the Ceph CSI driver.

    To migrate volumes use this script https://github.com/deckhouse/deckhouse/blob//modules/031-ceph-csi/tools/rbd-in-tree-to-ceph-csi-migration-helper.sh A description of how the migration is performed can be found here https://github.com/deckhouse/deckhouse/blob//modules/031-ceph-csi/docs/internal/INTREE_MIGRATION.md

  • ModuleAtConflict CE S4
    Conflict detected for module MODULE_NAME.

    Conflicting sources for the MODULE_NAME module. Please specify the proper source in the module configuration.

  • ModuleConfigObsoleteVersion CE S4
    ModuleConfig NAME is outdated.

    ModuleConfig NAME is outdated. Update ModuleConfig NAME to the latest version.

  • ModuleHasDeprecatedUpdatePolicy CE S4
    The 'MODULE_NAME' module is matched by the 'XXX' deprecated module update policy.

    The ‘MODULE_NAME’ module is matched by the ‘XXX’ deprecated module update policy. The policy`s v1alpha1 has a selector that no longer works.

    Specify the update policy in the module config by running the following command:

    kubectl patch moduleconfig MODULE_NAME --type='json' -p='[{"op": "add", "path": "/spec/updatePolicy", "value": "XXX"}]'
    

    After solving all alerts for the ‘XXX’ update policy, use this command to clear the selector:

    kubectl patch moduleupdatepolicies.v1alpha1.deckhouse.io XXX --type='json' -p='[{"op": "replace", "path": "/spec/moduleReleaseSelector/labelSelector/matchLabels", "value": {"": ""}}]'
    
  • ModuleReleaseIsBlockedByRequirements CE S6
    Module release is blocked by the requirements.

    Module MODULE_NAME release is blocked by the requirements.

    Please check the requirements with the following command kubectl get mr NAME -o json | jq .spec.requirements.

  • ModuleReleaseIsWaitingManualApproval CE S6
    Module release is waiting for manual approval.

    Module MODULE_NAME release is waiting for manual approval.

    Please run kubectl annotate mr NAME modules.deckhouse.io/approved="true" for confirmation.

Module monitoring-kubernetes

  • CPUStealHigh CE S4
    CPU Steal on the NODE_NAME Node is too high.

    The CPU steal is too high on the NODE_NAME Node in the last 30 minutes.

    Probably, some other component is stealing Node resources (e.g., a neighboring virtual machine). This may be the result of “overselling” the hypervisor. In other words, there are more virtual machines than the hypervisor can handle.

  • DeadMansSwitch CE S4
    Alerting DeadMansSwitch

    This is a DeadMansSwitch meant to ensure that the entire Alerting pipeline is functional.

  • DeploymentGenerationMismatch CE S4
    Deployment is outdated

    Observed deployment generation does not match expected one for deployment NAMESPACE/DEPLOYMENT_NAME

  • EbpfExporterKernelNotSupported CE S8

    The BTF module required for ebpf_exporter is missing in the kernel. Possible actions to resolve the problem: * Built kernel with BTF type information info. * Disable ebpf_exporter

  • FdExhaustionClose CE S3
    file descriptors soon exhausted

    XXX: INSTANCE_NAME instance will exhaust in file/socket descriptors within the next hour

  • FdExhaustionClose CE S3
    file descriptors soon exhausted

    XXX: NAMESPACE/POD_NAME instance will exhaust in file/socket descriptors within the next hour

  • FdExhaustionClose CE S4
    file descriptors soon exhausted

    XXX: INSTANCE_NAME instance will exhaust in file/socket descriptors within the next 4 hours

  • FdExhaustionClose CE S4
    file descriptors soon exhausted

    XXX: NAMESPACE/POD_NAME instance will exhaust in file/socket descriptors within the next 4 hours

  • HelmReleasesHasResourcesWithDeprecatedVersions CE S5
    At least one HELM release contains resources with deprecated apiVersion, which will be removed in Kubernetes vVERSION_NUMBER.

    To observe all resources use the expr max by (helm_release_namespace, helm_release_name, helm_version, resource_namespace, resource_name, api_version, kind, k8s_version) (resource_versions_compatibility) == 1 in Prometheus.

    You can find more details for migration in the deprecation guide: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#vXXX.

    Attention: The check runs once per hour, so this alert should go out within an hour after deprecated resources migration.

  • HelmReleasesHasResourcesWithUnsupportedVersions CE S4
    At least one HELM release contains resources with unsupported apiVersion for Kubernetes vVERSION_NUMBER.

    To observe all resources use the expr max by (helm_release_namespace, helm_release_name, helm_version, resource_namespace, resource_name, api_version, kind, k8s_version) (resource_versions_compatibility) == 2 in Prometheus.

    You can find more details for migration in the deprecation guide: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#vXXX.

    Attention: The check runs once per hour, so this alert should go out within an hour after deprecated resources migration.

  • K8SKubeletDown CE S3
    Many kubelets cannot be scraped

    Prometheus failed to scrape VALUE% of kubelets.

  • K8SKubeletDown CE S4
    A few kubelets cannot be scraped

    Prometheus failed to scrape VALUE% of kubelets.

  • K8SKubeletTooManyPods CE S7
    Kubelet is close to pod limit

    Kubelet NODE_NAME is running VALUE pods, close to the limit of XXX

  • K8SManyNodesNotReady CE S3
    Too many nodes are not ready

    VALUE% of Kubernetes nodes are not ready

  • K8SNodeNotReady CE S3
    Node status is NotReady

    The Kubelet on NODE_NAME has not checked in with the API, or has set itself to NotReady, for more than 10 minutes

  • KubeletImageFSBytesUsage CE S5
    No more free bytes on imagefs on the NODE_NAME Node at the MOUNTPOINT mountpoint.

    No more free bytes on imagefs (filesystem that the container runtime uses for storing images and container writable layers) on node NODE_NAME mountpoint MOUNTPOINT.

  • KubeletImageFSBytesUsage CE S6
    Hard eviction of imagefs on the NODE_NAME Node at the MOUNTPOINT mountpoint is in progress.

    Hard eviction of imagefs (filesystem that the container runtime uses for storing images and container writable layers) on the NODE_NAME Node at the MOUNTPOINT mountpoint is in progress.

    Threshold at: XXX%

    Currently at: VALUE%

  • KubeletImageFSBytesUsage CE S7
    Close to hard eviction threshold of imagefs on the NODE_NAME Node at the MOUNTPOINT mountpoint.

    Close to hard eviction threshold of imagefs (filesystem that the container runtime uses for storing images and container writable layers) on node NODE_NAME mountpoint MOUNTPOINT.

    Threshold at: XXX%

    Currently at: VALUE%

  • KubeletImageFSBytesUsage CE S9
    Soft eviction of imagefs on the NODE_NAME Node at the MOUNTPOINT mountpoint is in progress.

    Soft eviction of imagefs (filesystem that the container runtime uses for storing images and container writable layers) on the NODE_NAME Node at the MOUNTPOINT mountpoint is in progress.

    Threshold at: XXX%

    Currently at: VALUE%

  • KubeletImageFSInodesUsage CE S5

    No more free inodes on imagefs on the NODE_NAME Node at the MOUNTPOINT mountpoint.

  • KubeletImageFSInodesUsage CE S6
    Hard eviction of imagefs on the NODE_NAME Node at the MOUNTPOINT mountpoint is in progress.

    Hard eviction of imagefs on the NODE_NAME Node at the MOUNTPOINT mountpoint is in progress.

    Threshold at: XXX%

    Currently at: VALUE%

  • KubeletImageFSInodesUsage CE S7
    Close to hard eviction threshold of imagefs on the NODE_NAME Node at the MOUNTPOINT mountpoint.

    Close to hard eviction threshold of imagefs on the NODE_NAME Node at the MOUNTPOINT mountpoint.

    Threshold at: XXX%

    Currently at: VALUE%

  • KubeletImageFSInodesUsage CE S9
    Soft eviction of imagefs on the NODE_NAME Node at the MOUNTPOINT mountpoint is in progress.

    Soft eviction of imagefs on the NODE_NAME Node at the MOUNTPOINT mountpoint is in progress.

    Threshold at: XXX%

    Currently at: VALUE%

  • KubeletNodeFSBytesUsage CE S5

    No more free space on nodefs on the NODE_NAME Node at the MOUNTPOINT mountpoint.

  • KubeletNodeFSBytesUsage CE S6
    Hard eviction of nodefs on the NODE_NAME Node at the MOUNTPOINT mountpoint is in progress.

    Hard eviction of nodefs on the NODE_NAME Node at the MOUNTPOINT mountpoint is in progress.

    Threshold at: XXX%

    Currently at: VALUE%

  • KubeletNodeFSBytesUsage CE S7
    Close to hard eviction threshold of nodefs on the NODE_NAME Node at the MOUNTPOINT mountpoint.

    Close to hard eviction threshold of nodefs on the NODE_NAME Node at the MOUNTPOINT mountpoint.

    Threshold at: XXX%

    Currently at: VALUE%

  • KubeletNodeFSBytesUsage CE S9
    Soft eviction of nodefs on the NODE_NAME Node at the MOUNTPOINT mountpoint is in progress.

    Soft eviction of nodefs on the NODE_NAME Node at the MOUNTPOINT mountpoint is in progress.

    Threshold at: XXX%

    Currently at: VALUE%

  • KubeletNodeFSInodesUsage CE S5

    No more free inodes on nodefs on the NODE_NAME Node at the MOUNTPOINT mountpoint.

  • KubeletNodeFSInodesUsage CE S6
    Hard eviction of nodefs on the NODE_NAME Node at the MOUNTPOINT mountpoint is in progress.

    Hard eviction of nodefs on the NODE_NAME Node at the MOUNTPOINT mountpoint is in progress.

    Threshold at: XXX%

    Currently at: VALUE%

  • KubeletNodeFSInodesUsage CE S7
    Close to hard eviction threshold of nodefs on the NODE_NAME Node at the MOUNTPOINT mountpoint.

    Close to hard eviction threshold of nodefs on the NODE_NAME Node at the MOUNTPOINT mountpoint.

    Threshold at: XXX%

    Currently at: VALUE%

  • KubeletNodeFSInodesUsage CE S9
    Soft eviction of nodefs on the NODE_NAME Node at the MOUNTPOINT mountpoint is in progress.

    Soft eviction of nodefs on the NODE_NAME Node at the MOUNTPOINT mountpoint is in progress.

    Threshold at: XXX%

    Currently at: VALUE%

  • KubernetesDnsTargetDown CE S5
    Kube-dns or CoreDNS are not under monitoring.

    Prometheus is unable to collect metrics from kube-dns. Thus its status is unknown.

    To debug the problem, use the following commands:

    1. kubectl -n kube-system describe deployment -l k8s-app=kube-dns
    2. kubectl -n kube-system describe pod -l k8s-app=kube-dns
  • KubeStateMetricsDown CE S3
    Kube-state-metrics is not working in the cluster.

    There are no metrics about cluster resources for 5 minutes.

    Most alerts an monitroing panels aren’t working.

    To debug the problem:

    1. Check kube-state-metrics pods: kubectl -n d8-monitoring describe pod -l app=kube-state-metrics
    2. Check its logs: kubectl -n d8-monitoring describe deploy kube-state-metrics
  • LoadBalancerServiceWithoutExternalIP CE S4
    A load balancer has not been created.

    One or more services with the LoadBalancer type cannot get an external address.

    The list of services can be obtained with the following command: kubectl get svc -Ao json | jq -r ‘.items[] | select(.spec.type == “LoadBalancer”) | select(.status.loadBalancer.ingress[0].ip == null) | “namespace: (.metadata.namespace), name: (.metadata.name), ip: (.status.loadBalancer.ingress[0].ip)”’ Check the cloud-controller-manager logs in the ‘d8-cloud-provider-*’ namespace If you are using a bare-metal cluster with the metallb module enabled, check that the address space of the pool has not been exhausted.

  • NodeConntrackTableFull CE S3
    The conntrack table is full.

    The conntrack table on the NODE_NAME Node is full!

    No new connections are created or accepted on the Node; note that this may result in strange software issues.

    The recommended course of action is to identify the source of “excess” conntrack entries using Okmeter or Grafana charts.

  • NodeConntrackTableFull CE S4
    The conntrack table is close to the maximum size.

    The conntrack table on the NODE_NAME is VALUE% of the maximum size.

    There’s nothing to worry about yet if the conntrack table is only 70-80 percent full. However, if it runs out, you will experience problems with new connections while the software will behave strangely.

    The recommended course of action is to identify the source of “excess” conntrack entries using Okmeter or Grafana charts.

  • NodeExporterDown CE S3
    Prometheus could not scrape a node-exporter

    Prometheus could not scrape a node-exporter for more than 10m, or node-exporters have disappeared from discovery

  • NodeFilesystemIsRO CE S4
    The file system of the node is in read-only mode.

    The file system on the node has switched to read-only mode.

    See the node logs to find out the cause and fix it.

  • NodeSUnreclaimBytesUsageHigh CE S4
    The NODE_NAME Node has high kernel memory usage.

    The NODE_NAME Node has potential kernel memory leak. There is one known issue that can be reason for it.

    You should check cgroupDriver on the NODE_NAME Node:

    • cat /var/lib/kubelet/config.yaml | grep 'cgroupDriver: systemd'

    If cgroupDriver is set to systemd then reboot is required to roll back to cgroupfs driver. Please, drain and reboot the node.

    You can check this issue for extra information.

  • NodeSystemExporterDoesNotExistsForNode CE S4

    Some of the Node system exporters don’t work correctly for the NODE_NAME Node.

    The recommended course of action:

    1. Find the Node exporter Pod for this Node: kubectl -n d8-monitoring get pod -l app=node-exporter -o json | jq -r ".items[] | select(.spec.nodeName==\"NODE_NAME\") | .metadata.name";
    2. Describe the Node exporter Pod: kubectl -n d8-monitoring describe pod POD_NAME;
    3. Check that kubelet is running on the NODE_NAME node.
  • NodeTCPMemoryExhaust CE S6
    The NODE_NAME node has high TCP stack memory usage.

    The TCP stack on the NODE_NAME node is experiencing high memory pressure. This could be caused by applications with intensive TCP networking functionality. Investigate the relevant applications and consider adjusting the system’s TCP memory configuration or addressing the source of increased network traffic.

  • NodeUnschedulable CE S8
    The NODE_NAME Node is cordon-protected; no new Pods can be scheduled onto it.

    The NODE_NAME Node is cordon-protected; no new Pods can be scheduled onto it.

    This means that someone has executed one of the following commands on that Node:

    • kubectl cordon NODE_NAME
    • kubectl drain NODE_NAME that runs for more than 20 minutes

    Probably, this is due to the maintenance of this Node.

  • PodStatusIsIncorrect CE
    The state of the NAMESPACE/POD_NAME Pod running on the NODE_NAME Node is incorrect. You need to restart kubelet.

    There is a NAMESPACE/POD_NAME Pod in the cluster that runs on the NODE_NAME and listed as NotReady while all the Pod’s containers are Ready.

    This could be due to the Kubernetes bug.

    The recommended course of action:

    1. Find all the Pods having this state: kubectl get pod -o json --all-namespaces | jq '.items[] | select(.status.phase == "Running") | select(.status.conditions[] | select(.type == "ContainersReady" and .status == "True")) | select(.status.conditions[] | select(.type == "Ready" and .status == "False")) | "\(.spec.nodeName)/\(.metadata.namespace)/\(.metadata.name)"';
    2. Find all the Nodes affected: kubectl get pod -o json --all-namespaces | jq '.items[] | select(.status.phase == "Running") | select(.status.conditions[] | select(.type == "ContainersReady" and .status == "True")) | select(.status.conditions[] | select(.type == "Ready" and .status == "False")) | .spec.nodeName' -r | sort | uniq -c;
    3. Restart kubelet on each Node: systemctl restart kubelet.
  • StorageClassCloudManual CE S6
    Manually deployed StorageClass NAME found in the cluster

    StorageClass having a cloud-provider provisioner shouldn’t be deployed manually. They are managed by the cloud-provider module, you only need to change the module configuration to fit your needs.

  • StorageClassDefaultDuplicate CE S6
    Multiple default StorageClasses found in the cluster

    More than one StorageClass in the cluster annotated as a default. Probably manually deployed StorageClass exists, that overlaps with cloud-provider module default Storage configuration.

  • UnsupportedContainerRuntimeVersion CE
    Unsupported version of CRI Containerd X.XX.XXX installed for Kubernetes version: VERSION_NUMBER

    Unsupported version Containerd X.XX.XXX of CRI installed on NODE_NAME node. Supported version of CRI for kubernetes VERSION_NUMBER version:

    • Containerd 1.4.*
    • Containerd 1.5.*
    • Containerd 1.6.*
    • Containerd 1.7.*

Module monitoring-kubernetes-control-plane

  • K8SApiserverDown CE S3
    No API servers are reachable

    No API servers are reachable or all have disappeared from service discovery

  • K8sCertificateExpiration CE S5
    Kubernetes has API clients with soon expiring certificates

    Some clients connect to COMPONENT_NAME with certificate which expiring soon (less than 1 day) on node COMPONENT_NAME.

    You need to use kubeadm to check control plane certificates.

    1. Install kubeadm: apt install kubeadm=1.24.*.
    2. Check certificates: kubeadm alpha certs check-expiration

    To check kubelet certificates, on each node you need to:

    1. Check kubelet config:
      ps aux \
        | grep "/usr/bin/kubelet" \
        | grep -o -e "--kubeconfig=\S*" \
        | cut -f2 -d"=" \
        | xargs cat
      
    2. Find field client-certificate or client-certificate-data
    3. Check certificate using openssl

    There are no tools to help you find other stale kubeconfigs. It will be better for you to enable control-plane-manager module to be able to debug in this case.

  • K8sCertificateExpiration CE S6
    Kubernetes has API clients with soon expiring certificates

    Some clients connect to COMPONENT_NAME with certificate which expiring soon (less than 7 days) on node NODE_NAME.

    You need to use kubeadm to check control plane certificates.

    1. Install kubeadm: apt install kubeadm=1.24.*.
    2. Check certificates: kubeadm alpha certs check-expiration

    To check kubelet certificates, on each node you need to:

    1. Check kubelet config:
      ps aux \
        | grep "/usr/bin/kubelet" \
        | grep -o -e "--kubeconfig=\S*" \
        | cut -f2 -d"=" \
        | xargs cat
      
    2. Find field client-certificate or client-certificate-data
    3. Check certificate using openssl

    There are no tools to help you find other stale kubeconfigs. It will be better for you to enable control-plane-manager module to be able to debug in this case.

  • K8SControllerManagerTargetDown CE S3
    Controller manager is down

    There is no running kube-controller-manager. Deployments and replication controllers are not making progress.

  • K8SSchedulerTargetDown CE S3
    Scheduler is down

    There is no running K8S scheduler. New pods are not being assigned to nodes.

  • KubeEtcdHighFsyncDurations CE S7
    Synching (fsync) WAL files to disk is slow.

    In the last 15 minutes, the 99th percentile of the fsync duration for WAL files is longer than 0.5 seconds: VALUE.

    Possible causes:

    1. High latency of the disk where the etcd data is located;
    2. High CPU usage on the Node.
  • KubeEtcdHighNumberOfLeaderChanges CE S5
    The etcd cluster re-elects the leader too often.

    There were VALUE leader re-elections for the etcd cluster member running on the NODE_NAME Node in the last 10 minutes.

    Possible causes:

    1. High latency of the disk where the etcd data is located;
    2. High CPU usage on the Node;
    3. Degradation of network connectivity between cluster members in the multi-master mode.
  • KubeEtcdInsufficientMembers CE S4
    There are insufficient members in the etcd cluster; the cluster will fail if one of the remaining members will become unavailable.

    Check the status of the etcd pods: kubectl -n kube-system get pod -l component=etcd.

  • KubeEtcdNoLeader CE S4
    The etcd cluster member running on the NODE_NAME Node has lost the leader.

    Check the status of the etcd Pods: kubectl -n kube-system get pod -l component=etcd | grep NODE_NAME.

  • KubeEtcdTargetAbsent CE S5
    There is no etcd target in Prometheus.

    Check the status of the etcd Pods: kubectl -n kube-system get pod -l component=etcd or Prometheus logs: kubectl -n d8-monitoring logs -l app.kubernetes.io/name=prometheus -c prometheus

  • KubeEtcdTargetDown CE S5
    Prometheus is unable to scrape etcd metrics.

    Check the status of the etcd Pods: kubectl -n kube-system get pod -l component=etcd or Prometheus logs: kubectl -n d8-monitoring logs -l app.kubernetes.io/name=prometheus -c prometheus.

Module monitoring-ping

  • NodePingPacketLoss CE S4
    Ping loss more than 5%

    ICMP packet loss to node NODE_NAME is more than 5%

Module node-manager

  • CapsInstanceUnavailable CE S8
    There are unavailable instances in the MACHINE_DEPLOYMENT_NAME MachineDeployment.

    In MachineDeployment MACHINE_DEPLOYMENT_NAME number of unavailable instances is VALUE. Take a look and check at the state of the instances in the cluster: kubectl get instance -l node.deckhouse.io/group=MACHINE_DEPLOYMENT_NAME

  • ClusterHasOrphanedDisks CE S6
    Cloud data discoverer finds disks in the cloud for which there is no PersistentVolume in the cluster

    Cloud data discoverer finds disks in the cloud for which there is no PersistentVolume in the cluster. You can manually delete these disks from your cloud: ID: ID, Name: NAME

  • D8BashibleApiserverLocked CE S6
    Bashible-apiserver is locked for too long

    Check bashible-apiserver pods are up-to-date and running kubectl -n d8-cloud-instance-manager get pods -l app=bashible-apiserver

  • D8CloudDataDiscovererCloudRequestError CE S6
    Cloud data discoverer cannot get data from cloud

    Cloud data discoverer cannot get data from cloud. See cloud data discoverer logs for more information: kubectl -n NAMESPACE logs deploy/cloud-data-discoverer

  • D8CloudDataDiscovererSaveError CE S6
    Cloud data discoverer cannot save data to k8s resource

    Cloud data discoverer cannot save data to k8s resource. See cloud data discoverer logs for more information: kubectl -n NAMESPACE logs deploy/cloud-data-discoverer

  • D8ClusterAutoscalerManagerPodIsNotReady CE S8

    The POD_NAME Pod is NOT Ready.

  • D8ClusterAutoscalerPodIsNotRunning CE S8
    The cluster-autoscaler Pod is NOT Running.

    The POD_NAME Pod is STATUS.

    Run the following command to check its status: kubectl -n NAMESPACE get pods POD_NAME -o json | jq .status.

  • D8ClusterAutoscalerPodIsRestartingTooOften CE S9
    Too many cluster-autoscaler restarts have been detected.

    The number of restarts in the last hour: VALUE.

    Excessive cluster-autoscaler restarts indicate that something is wrong. Normally, it should be up and running all the time.

    Please, refer to the corresponding logs: kubectl -n d8-cloud-instance-manager logs -f -l app=cluster-autoscaler -c cluster-autoscaler.

  • D8ClusterAutoscalerTargetAbsent CE S8
    There is no cluster-autoscaler target in Prometheus.

    Cluster-autoscaler automatically scales Nodes in the cluster; its unavailability will result in the inability to add new Nodes if there is a lack of resources to schedule Pods. In addition, the unavailability of cluster-autoscaler may result in over-spending due to provisioned but inactive cloud instances.

    The recommended course of action:

    1. Check the availability and status of cluster-autoscaler Pods: kubectl -n d8-cloud-instance-manager get pods -l app=cluster-autoscaler
    2. Check whether the cluster-autoscaler deployment is present: kubectl -n d8-cloud-instance-manager get deploy cluster-autoscaler
    3. Check the status of the cluster-autoscaler deployment: kubectl -n d8-cloud-instance-manager describe deploy cluster-autoscaler
  • D8ClusterAutoscalerTargetDown CE S8

    Prometheus is unable to scrape cluster-autoscaler’s metrics.

  • D8ClusterAutoscalerTooManyErrors CE S8
    Cluster-autoscaler issues too many errors.

    Cluster-autoscaler’s scaling attempt resulted in an error from the cloud provider.

    Please, refer to the corresponding logs: kubectl -n d8-cloud-instance-manager logs -f -l app=cluster-autoscaler -c cluster-autoscaler.

  • D8MachineControllerManagerPodIsNotReady CE S8

    The POD_NAME Pod is NOT Ready.

  • D8MachineControllerManagerPodIsNotRunning CE S8
    The machine-controller-manager Pod is NOT Running.

    The POD_NAME Pod is STATUS.

    Run the following command to check the status of the Pod: kubectl -n NAMESPACE get pods POD_NAME -o json | jq .status.

  • D8MachineControllerManagerPodIsRestartingTooOften CE S9
    The machine-controller-manager module restarts too often.

    The number of restarts in the last hour: VALUE.

    Excessive machine-controller-manager restarts indicate that something is wrong. Normally, it should be up and running all the time.

    Please, refer to the logs: kubectl -n d8-cloud-instance-manager logs -f -l app=machine-controller-manager -c controller.

  • D8MachineControllerManagerTargetAbsent CE S8
    There is no machine-controller-manager target in Prometheus.

    Machine controller manager manages ephemeral Nodes in the cluster. Its unavailability will result in the inability to add/delete Nodes.

    The recommended course of action:

    1. Check the availability and status of machine-controller-manager Pods: kubectl -n d8-cloud-instance-manager get pods -l app=machine-controller-manager;
    2. Check the availability of the machine-controller-manager Deployment: kubectl -n d8-cloud-instance-manager get deploy machine-controller-manager;
    3. Check the status of the machine-controller-manager Deployment: kubectl -n d8-cloud-instance-manager describe deploy machine-controller-manager.
  • D8MachineControllerManagerTargetDown CE S8

    Prometheus is unable to scrape machine-controller-manager’s metrics.

  • D8NodeGroupIsNotUpdating CE S8
    The NODE_GROUP_NAME node group is not handling the update correctly.

    There is a new update for Nodes of the NODE_GROUP_NAME group; Nodes have learned about the update. However, no Node can get approval to start updating.

    Most likely, there is a problem with the update_approval hook of the node-manager module.

  • D8NodeIsNotUpdating CE S7
    The NODE_NAME Node cannot complete the update.

    There is a new update for the NODE_NAME Node of the NODE_GROUP_NAME group; the Node has learned about the update, requested and received approval, started the update, ran into a step that causes possible downtime. The update manager (the update_approval hook of node-group module) performed the update, and the Node received downtime approval. However, there is no success message about the update.

    Here is how you can view Bashible logs on the Node:

    journalctl -fu bashible
    
  • D8NodeIsNotUpdating CE S8
    The NODE_NAME Node cannot complete the update.

    There is a new update for the NODE_NAME Node of the NODE_GROUP_NAME group}; the Node has learned about the update, requested and received approval, but cannot complete the update.

    Here is how you can view Bashible logs on the Node:

    journalctl -fu bashible
    
  • D8NodeIsNotUpdating CE S9
    The NODE_NAME Node does not update.

    There is a new update for the NODE_NAME Node of the NODE_GROUP_NAME group but it has not received the update nor trying to.

    Most likely Bashible for some reason is not handling the update correctly. At this point, it must add the update.node.deckhouse.io/waiting-for-approval annotation to the Node so that it can be approved.

    You can find out the most current version of the update using this command:

    kubectl -n d8-cloud-instance-manager get secret configuration-checksums -o jsonpath={.data.NODE_GROUP_NAME} | base64 -d
    

    Use the following command to find out the version on the Node:

    kubectl get node NODE_NAME -o jsonpath='{.metadata.annotations.node\.deckhouse\.io/configuration-checksum}'
    

    Here is how you can view Bashible logs on the Node:

    journalctl -fu bashible
    
  • D8NodeIsUnmanaged CE S9
    The NODE_NAME Node is not managed by the node-manager module.

    The NODE_NAME Node is not managed by the node-manager module.

    The recommended actions are as follows:

    • Follow these instructions to clean up the node and add it to the cluster: https://deckhouse.io/products/kubernetes-platform/documentation/v1/modules/node-manager/faq.html#how-to-clean-up-a-node-for-adding-to-the-cluster
  • D8NodeUpdateStuckWaitingForDisruptionApproval CE S8
    The NODE_NAME Node cannot get disruption approval.

    There is a new update for the NODE_NAME Node of the NODE_GROUP_NAME group; the Node has learned about the update, requested and received approval, started the update, and ran into a stage that causes possible downtime. For some reason, the Node cannot get that approval (it is issued fully automatically by the update_approval hook of the node-manager).

  • D8ProblematicNodeGroupConfiguration CE S8
    The NODE_NAME Node cannot begin the update.

    There is a new update for Nodes of the NODE_GROUP_NAME group; Nodes have learned about the update. However, NODE_NAME Node cannot be updated.

    Node NODE_NAME has no node.deckhouse.io/configuration-checksum annotation. Perhaps the bootstrap process of the Node did not complete correctly. Check the cloud-init logs (/var/log/cloud-init-output.log) of the Node. There is probably a problematic NodeGroupConfiguration resource for NODE_GROUP_NAME NodeGroup.

  • EarlyOOMPodIsNotReady CE S8

    The POD_NAME Pod has detected unavailable PSI subsystem. Check logs for additional information: kubectl -n d8-cloud-instance-manager logs POD_NAME Possible actions to resolve the problem: * Upgrade kernel to version 4.20 or higher. * Enable Pressure Stall Information. * Disable early oom.

  • NodeGroupHasStaticInternalNetworkCIDRsField CE S9
    NodeGroup NAME has deprecated filed spec.static.internalNetworkCIDRs

    Internal network CIDRs setting now located in the static cluster configuration. Delete this field from NodeGroup NAME to fix this alert. Do not worry, it has been already migrated to another place.

  • NodeGroupMasterTaintIsAbsent CE S4
    The 'master' node group does not contain desired taint.

    master node group has no node-role.kubernetes.io/control-plane taint. Probably control-plane nodes are misconfigured and are able to run not only control-plane Pods. Please, add:

      nodeTemplate:
        taints:
        - effect: NoSchedule
          key: node-role.kubernetes.io/control-plane
    

    to the master node group spec. key: node-role.kubernetes.io/master taint was deprecated and will have no effect in Kubernetes 1.24+.

  • NodeGroupReplicasUnavailable CE S7
    There are no available instances in the NODE_GROUP_NAME node group.

    Probably, machine-controller-manager is unable to create a machine using the cloud provider module. Possible causes:

    1. Cloud provider limits on available resources;
    2. No access to the cloud provider API;
    3. Cloud provider or instance class misconfiguration;
    4. Problems with bootstrapping the Machine.

    The recommended course of action:

    1. Run kubectl get ng NODE_GROUP_NAME -o yaml. In the .status.lastMachineFailures field you can find all errors related to the creation of Machines;
    2. The absence of Machines in the list that have been in Pending status for more than a couple of minutes means that Machines are continuously being created and deleted because of some error: kubectl -n d8-cloud-instance-manager get machine;
    3. Refer to the Machine description if the logs do not include error messages and the Machine continues to be Pending: kubectl -n d8-cloud-instance-manager get machine MACHINE_NAME -o json | jq .status.bootstrapStatus;
    4. The output similar to the one below means that you have to use nc to examine the bootstrap logs:
      {
        "description": "Use 'nc 192.168.199.158 8000' to get bootstrap logs.",
        "tcpEndpoint": "192.168.199.158"
      }
      
    5. The absence of information about the endpoint for getting logs means that cloudInit is not working correctly. This may be due to the incorrect configuration of the instance class for the cloud provider.
  • NodeGroupReplicasUnavailable CE S8
    The number of simultaneously unavailable instances in the NODE_GROUP_NAME node group exceeds the allowed value.

    Possibly, autoscaler has provisioned too many Nodes. Take a look at the state of the Machine in the cluster. Probably, machine-controller-manager is unable to create a machine using the cloud provider module. Possible causes:

    1. Cloud provider limits on available resources;
    2. No access to the cloud provider API;
    3. Cloud provider or instance class misconfiguration;
    4. Problems with bootstrapping the Machine.

    The recommended course of action:

    1. Run kubectl get ng NODE_GROUP_NAME -o yaml. In the .status.lastMachineFailures field you can find all errors related to the creation of Machines;
    2. The absence of Machines in the list that have been in Pending status for more than a couple of minutes means that Machines are continuously being created and deleted because of some error: kubectl -n d8-cloud-instance-manager get machine;
    3. Refer to the Machine description if the logs do not include error messages and the Machine continues to be Pending: kubectl -n d8-cloud-instance-manager get machine MACHINE_NAME -o json | jq .status.bootstrapStatus;
    4. The output similar to the one below means that you have to use nc to examine the bootstrap logs:
      {
        "description": "Use 'nc 192.168.199.158 8000' to get bootstrap logs.",
        "tcpEndpoint": "192.168.199.158"
      }
      
    5. The absence of information about the endpoint for getting logs means that cloudInit is not working correctly. This may be due to the incorrect configuration of the instance class for the cloud provider.
  • NodeGroupReplicasUnavailable CE S8
    There are unavailable instances in the NODE_GROUP_NAME node group.

    The number of unavailable instances is VALUE. See the relevant alerts for more information. Probably, machine-controller-manager is unable to create a machine using the cloud provider module. Possible causes:

    1. Cloud provider limits on available resources;
    2. No access to the cloud provider API;
    3. Cloud provider or instance class misconfiguration;
    4. Problems with bootstrapping the Machine.

    The recommended course of action:

    1. Run kubectl get ng NODE_GROUP_NAME -o yaml. In the .status.lastMachineFailures field you can find all errors related to the creation of Machines;
    2. The absence of Machines in the list that have been in Pending status for more than a couple of minutes means that Machines are continuously being created and deleted because of some error: kubectl -n d8-cloud-instance-manager get machine;
    3. Refer to the Machine description if the logs do not include error messages and the Machine continues to be Pending: kubectl -n d8-cloud-instance-manager get machine MACHINE_NAME -o json | jq .status.bootstrapStatus;
    4. The output similar to the one below means that you have to use nc to examine the bootstrap logs:
      {
        "description": "Use 'nc 192.168.199.158 8000' to get bootstrap logs.",
        "tcpEndpoint": "192.168.199.158"
      }
      
    5. The absence of information about the endpoint for getting logs means that cloudInit is not working correctly. This may be due to the incorrect configuration of the instance class for the cloud provider.
  • NodeRequiresDisruptionApprovalForUpdate CE S8
    The NODE_NAME Node requires disruption approval to proceed with the update

    There is a new update for Nodes and the NODE_NAME Node of the NODE_GROUP_NAME group has learned about the update, requested and received approval, started the update, and ran into a step that causes possible downtime.

    You have to manually approve the disruption since the Manual mode is active in the group settings (disruptions.approvalMode).

    Grant approval to the Node using the update.node.deckhouse.io/disruption-approved= annotation if it is ready for unsafe updates (e.g., drained).

    Caution!!! The Node will not be drained automatically since the manual mode is enabled (disruptions.approvalMode: Manual).

    Caution!!! No need to drain the master node.

    • Use the following commands to drain the Node and grant it update approval:
      kubectl drain NODE_NAME --delete-local-data=true --ignore-daemonsets=true --force=true &&
        kubectl annotate node NODE_NAME update.node.deckhouse.io/disruption-approved=
      
    • Note that you need to uncordon the node after the update is complete (i.e., after removing the update.node.deckhouse.io/approved annotation).
      while kubectl get node NODE_NAME -o json | jq -e '.metadata.annotations | has("update.node.deckhouse.io/approved")' > /dev/null; do sleep 1; done
      kubectl uncordon NODE_NAME
      

    Note that if there are several Nodes in a NodeGroup, you will need to repeat this operation for each Node since only one Node can be updated at a time. Perhaps it makes sense to temporarily enable the automatic disruption approval mode (disruptions.approvalMode: Automatic).

  • NodeStuckInDraining CE S6
    The NODE_NAME Node is stuck in draining.

    The NODE_NAME Node of the NODE_GROUP_NAME NodeGroup stuck in draining.

    You can get more info by running: kubectl -n default get event --field-selector involvedObject.name=NODE_NAME,reason=DrainFailed --sort-by='.metadata.creationTimestamp'

    The error message is: MESSAGE_CONTENTS

  • NodeStuckInDrainingForDisruptionDuringUpdate CE S6
    The NODE_NAME Node is stuck in draining.

    There is a new update for the NODE_NAME Node of the NODE_GROUP_NAME NodeGroup. The Node has learned about the update, requested and received approval, started the update, ran into a step that causes possible downtime, and stuck in draining in order to get disruption approval automatically.

    You can get more info by running: kubectl -n default get event --field-selector involvedObject.name=NODE_NAME,reason=ScaleDown --sort-by='.metadata.creationTimestamp'

Module okmeter

  • D8OkmeterAgentPodIsNotReady CE S6

    Okmeter agent is not Ready

Module operator-prometheus

  • D8PrometheusOperatorPodIsNotReady CE S7
    The prometheus-operator Pod is NOT Ready.

    The new Prometheus, PrometheusRules, ServiceMonitor settings cannot be applied in the cluster; however, all existing and configured components continue to operate correctly. This problem will not affect alerting or monitoring in the short term (a few days).

    The recommended course of action:

    1. Analyze the Deployment info: kubectl -n d8-operator-prometheus describe deploy prometheus-operator;
    2. Examine the status of the Pod and try to figure out why it is not running: kubectl -n d8-operator-prometheus describe pod -l app=prometheus-operator.
  • D8PrometheusOperatorPodIsNotRunning CE S7
    The prometheus-operator Pod is NOT Running.

    The new Prometheus, PrometheusRules, ServiceMonitor settings cannot be applied in the cluster; however, all existing and configured components continue to operate correctly. This problem will not affect alerting or monitoring in the short term (a few days).

    The recommended course of action:

    1. Analyze the Deployment info: kubectl -n d8-operator-prometheus describe deploy prometheus-operator;
    2. Examine the status of the Pod and try to figure out why it is not running: kubectl -n d8-operator-prometheus describe pod -l app=prometheus-operator.
  • D8PrometheusOperatorTargetAbsent CE S7
    There is no prometheus-operator target in Prometheus.

    The new Prometheus, PrometheusRules, ServiceMonitor settings cannot be applied in the cluster; however, all existing and configured components continue to operate correctly. This problem will not affect alerting or monitoring in the short term (a few days).

    The recommended course of action is to analyze the deployment information: kubectl -n d8-operator-prometheus describe deploy prometheus-operator.

  • D8PrometheusOperatorTargetDown CE S8
    Prometheus is unable to scrape prometheus-operator metrics.

    The prometheus-operator Pod is not available.

    The new Prometheus, PrometheusRules, ServiceMonitor settings cannot be applied in the cluster; however, all existing and configured components continue to operate correctly. This problem will not affect alerting or monitoring in the short term (a few days).

    The recommended course of action:

    1. Analyze the Deployment info: kubectl -n d8-operator-prometheus describe deploy prometheus-operator;
    2. Examine the status of the Pod and try to figure out why it is not running: kubectl -n d8-operator-prometheus describe pod -l app=prometheus-operator.

Module prometheus

  • D8GrafanaDeploymentReplicasUnavailable CE S6
    One or more Grafana Pods are NOT Running.

    The number of Grafana replicas is less than the specified number.

    The Deployment is in the MinimumReplicasUnavailable state.

    Run the following command to check the status of the Deployment: kubectl -n d8-monitoring get deployment grafana-v10 -o json | jq .status.

    Run the following command to check the status of the Pods: kubectl -n d8-monitoring get pods -l app=grafana-v10 -o json | jq '.items[] | {(.metadata.name):.status}'.

  • D8GrafanaDeprecatedCustomDashboardDefinition CE S9
    The deprecated ConfigMap for defining Grafana dashboards is detected.

    The grafana-dashboard-definitions-custom ConfigMap was found in the d8-monitoring namespace. This means that the deprecated method of registering custom dashboards in Grafana is being used.

    This method is no longer used! Please, use the custom GrafanaDashboardDefinition resource instead.

  • D8GrafanaPodIsNotReady CE S6

    The Grafana Pod is NOT Ready.

  • D8GrafanaPodIsRestartingTooOften CE S9
    Excessive Grafana restarts are detected.

    The number of restarts in the last hour: VALUE.

    Excessive Grafana restarts indicate that something is wrong. Normally, Grafana should be up and running all the time.

    Please, refer to the corresponding logs: kubectl -n d8-monitoring logs -f -l app=grafana-v10 -c grafana.

  • D8GrafanaTargetAbsent CE S6
    There is no Grafana target in Prometheus.

    Grafana visualizes metrics collected by Prometheus. Grafana is critical for some tasks, such as monitoring the state of applications and the cluster as a whole. Additionally, Grafana unavailability can negatively impact users who actively use it in their work.

    The recommended course of action:

    1. Check the availability and status of Grafana Pods: kubectl -n d8-monitoring get pods -l app=grafana-v10;
    2. Check the availability of the Grafana Deployment: kubectl -n d8-monitoring get deployment grafana-v10;
    3. Examine the status of the Grafana Deployment: kubectl -n d8-monitoring describe deployment grafana-v10.
  • D8GrafanaTargetDown CE S6

    Prometheus is unable to scrape Grafana metrics.

  • D8PrometheusLongtermFederationTargetDown CE S5
    prometheus-longterm cannot scrape prometheus.

    prometheus-longterm cannot scrape “/federate” endpoint from Prometheus. Check error cause in prometheus-longterm WebUI or logs.

  • D8PrometheusLongtermTargetAbsent CE S7
    There is no prometheus-longterm target in Prometheus.

    This Prometheus component is only used to display historical data and is not crucial. However, if its unavailability will last long enough, you will not be able to view the statistics.

    Usually, Pods of this type have problems because of disk unavailability (e.g., the disk cannot be mounted to a Node for some reason).

    The recommended course of action:

    1. Take a look at the StatefulSet data: kubectl -n d8-monitoring describe statefulset prometheus-longterm;
    2. Explore its PVC (if used): kubectl -n d8-monitoring describe pvc prometheus-longterm-db-prometheus-longterm-0;
    3. Explore the Pod’s state: kubectl -n d8-monitoring describe pod prometheus-longterm-0.
  • D8TricksterTargetAbsent CE S5
    There is no Trickster target in Prometheus.

    The following modules use this component:

    • prometheus-metrics-adapter — the unavailability of the component means that HPA (auto scaling) is not running and you cannot view resource consumption using kubectl;
    • vertical-pod-autoscaler — this module is quite capable of surviving a short-term unavailability, as VPA looks at the consumption history for 8 days;
    • grafana — by default, all dashboards use Trickster for caching requests to Prometheus. You can retrieve data directly from Prometheus (bypassing the Trickster). However, this may lead to high memory usage by Prometheus and, hence, to its unavailability.

    The recommended course of action:

    1. Analyze the Deployment information: kubectl -n d8-monitoring describe deployment trickster;
    2. Analyze the Pod information: kubectl -n d8-monitoring describe pod -l app=trickster;
    3. Usually, Trickster is unavailable due to Prometheus-related issues because the Trickster’s readinessProbe checks the Prometheus availability. Thus, make sure that Prometheus is running: kubectl -n d8-monitoring describe pod -l app.kubernetes.io/name=prometheus,prometheus=main.
  • D8TricksterTargetAbsent CE S5
    There is no Trickster target in Prometheus.

    The following modules use this component:

    • prometheus-metrics-adapter — the unavailability of the component means that HPA (auto scaling) is not running and you cannot view resource consumption using kubectl;
    • vertical-pod-autoscaler — this module is quite capable of surviving a short-term unavailability, as VPA looks at the consumption history for 8 days;
    • grafana — by default, all dashboards use Trickster for caching requests to Prometheus. You can retrieve data directly from Prometheus (bypassing the Trickster). However, this may lead to high memory usage by Prometheus and, hence, to unavailability.

    The recommended course of action:

    1. Analyze the Deployment stats: kubectl -n d8-monitoring describe deployment trickster;
    2. Analyze the Pod stats: kubectl -n d8-monitoring describe pod -l app=trickster;
    3. Usually, Trickster is unavailable due to Prometheus-related issues because the Trickster’s readinessProbe checks the Prometheus availability. Thus, make sure that Prometheus is running: kubectl -n d8-monitoring describe pod -l app.kubernetes.io/name=prometheus,prometheus=main.
  • DeckhouseModuleUseEmptyDir CE S9
    Deckhouse module MODULE_NAME use emptydir as storage.

    Deckhouse module MODULE_NAME use emptydir as storage.

  • GrafanaDashboardAlertRulesDeprecated CE S8
    Deprecated Grafana alerts have been found.

    Before updating to Grafana 10, it’s required to migrate an outdated alerts from Grafana to the external alertmanager (or exporter-alertmanager stack) To list all deprecated alert rules use the expr sum by (dashboard, panel, alert_rule) (d8_grafana_dashboards_deprecated_alert_rule) > 0

    Attention: The check runs once per hour, so this alert should go out within an hour after deprecated resources migration.

  • GrafanaDashboardPanelIntervalDeprecated CE S8
    Deprecated Grafana panel intervals have been found.

    Before updating to Grafana 10, it’s required to rewrite an outdated expressions that uses $interval_rv, interval_sx3 or interval_sx4 to $__rate_interval To list all deprecated panel intervals use the expr sum by (dashboard, panel, interval) (d8_grafana_dashboards_deprecated_interval) > 0

    Attention: The check runs once per hour, so this alert should go out within an hour after deprecated resources migration.

  • GrafanaDashboardPluginsDeprecated CE S8
    Deprecated Grafana plugins have been found.

    Before updating to Grafana 10, it’s required to check if currently installed plugins will work correctly with Grafana 10 To list all potentially outdated plugins use the expr sum by (dashboard, panel, plugin) (d8_grafana_dashboards_deprecated_plugin) > 0

    Plugin “flant-statusmap-panel” is being deprecated and won’t be supported in the near future We recommend you to migrate to the State Timeline plugin: https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/state-timeline/

    Attention: The check runs once per hour, so this alert should go out within an hour after deprecated resources migration.

  • K8STooManyNodes CE S7
    Nodes amount is close to the maximum allowed amount.

    Cluster is running VALUE nodes, close to the maximum amount of XXX nodes.

  • PrometheusDiskUsage CE S4
    Prometheus disk is over 95% used.

    For more information, use the command:

    kubectl -n NAMESPACE exec -ti POD_NAME -c prometheus -- df -PBG /prometheus
    

    Consider increasing it https://deckhouse.io/products/kubernetes-platform/documentation/v1/modules/prometheus/faq.html#how-to-expand-disk-size

  • PrometheusLongtermRotatingEarlierThanConfiguredRetentionDays CE S4
    Prometheus-longterm data is being rotated earlier than configured retention days

    You need to increase the disk size, reduce the number of metrics or decrease longtermRetentionDays module parameter.

  • PrometheusMainRotatingEarlierThanConfiguredRetentionDays CE S4
    Prometheus-main data is being rotated earlier than configured retention days

    You need to increase the disk size, reduce the number of metrics or decrease retentionDays module parameter.

  • PrometheusScapeConfigDeclarationDeprecated CE S8
    AdditionalScrapeConfigs from secrets will be deprecated in soon

    Old way for describing additional scrape config via secrets will be deprecated in prometheus-operator > v0.65.1. Please use CRD ScrapeConfig instead. https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/proposals/202212-scrape-config.md

  • PrometheusServiceMonitorDeprecated CE S8
    Deprecated Prometheus ServiceMonitor has found.

    Kubernetes cluster uses a more advanced network mechanism - EndpointSlice You service monitor NAMESPACE/NAME has relabeling with old Endpoint mechanism, starts with __meta_kubernetes_endpoints_. This relabeling rule support, based on the _endpoint_ label, will be remove in the future (Deckhouse release 1.60). Please, migrate to EndpointSlice relabeling rules. To do this, you have modify ServiceMonitor with changing the following labels:

    __meta_kubernetes_endpoints_name -> __meta_kubernetes_endpointslice_name
    __meta_kubernetes_endpoints_label_XXX -> __meta_kubernetes_endpointslice_label_XXX
    __meta_kubernetes_endpoints_labelpresent_XXX -> __meta_kubernetes_endpointslice_labelpresent_XXX
    __meta_kubernetes_endpoints_annotation_XXX -> __meta_kubernetes_endpointslice_annotation_XXX
    __meta_kubernetes_endpoints_annotationpresent_XXX -> __meta_kubernetes_endpointslice_annotationpresent_XXX
    __meta_kubernetes_endpoint_node_name -> __meta_kubernetes_endpointslice_endpoint_topology_kubernetes_io_hostname
    __meta_kubernetes_endpoint_ready -> __meta_kubernetes_endpointslice_endpoint_conditions_ready
    __meta_kubernetes_endpoint_port_name -> __meta_kubernetes_endpointslice_port_name
    __meta_kubernetes_endpoint_port_protocol -> __meta_kubernetes_endpointslice_port_protocol
    __meta_kubernetes_endpoint_address_target_kind -> __meta_kubernetes_endpointslice_address_target_kind
    __meta_kubernetes_endpoint_address_target_name -> __meta_kubernetes_endpointslice_address_target_name
    
  • TargetDown CE S5
    Target is down

    XXX target is down.

  • TargetDown CE S6
    Target is down

    XXX target is down.

  • TargetDown CE S7
    Target is down

    XXX target is down.

  • TargetSampleLimitExceeded CE S6
    Scrapes are exceeding sample limit

    Target are down because of a sample limit exceeded.

  • TargetSampleLimitExceeded CE S7
    The sampling limit is close.

    The target is close to exceeding the sampling limit. less than 10% left to the limit

Module runtime-audit-engine

  • D8RuntimeAuditEngineNotScheduledInCluster EE S4
    Pods of runtime-audit-engine cannot be scheduled in the cluster.

    A number of runtime-audit-engine pods are not scheduled. Security audit is not fully operational.

    Consider checking state of the d8-runtime-audit-engine/runtime-audit-engine DaemonSet. kubectl -n d8-runtime-audit-engine get daemonset,pod --selector=app=runtime-audit-engine Get a list of nodes that have pods in an not Ready state.

    kubectl -n NAMESPACE get pod -ojson | jq -r '.items[] | select(.metadata.ownerReferences[] | select(.name =="DAEMONSET_NAME")) | select(.status.phase != "Running" or ([ .status.conditions[] | select(.type == "Ready" and .status == "False") ] | length ) == 1 ) | .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[].matchFields[].values[]'
    

Module secret-copier

  • D8SecretCopierDeprecatedLabels CE S9
    Obsolete antiopa_secret_copier=yes label has been found.

    The secrets copier module has changed the service label for the original secrets in the default namespace.

    Soon we will abandon the old antiopa-secret-copier: "yes" label.

    You have to replace the antiopa-secret-copier: "yes" label with secret-copier.deckhouse.io/enabled: "" for all secrets that the secret-copier module uses in the default namespace.

Module snapshot-controller

  • D8SnapshotControllerPodIsNotReady CE S8
    The snapshot-controller Pod is NOT Ready.

    The recommended course of action:

    1. Retrieve details of the Deployment: kubectl -n d8-snapshot-controller describe deploy snapshot-controller
    2. View the status of the Pod and try to figure out why it is not running: kubectl -n d8-snapshot-controller describe pod -l app=snapshot-controller
  • D8SnapshotControllerPodIsNotRunning CE S8
    The snapshot-controller Pod is NOT Running.

    The recommended course of action:

    1. Retrieve details of the Deployment: kubectl -n d8-snapshot-controller describe deploy snapshot-controller
    2. View the status of the Pod and try to figure out why it is not running: kubectl -n d8-snapshot-controller describe pod -l app=snapshot-controller
  • D8SnapshotControllerTargetAbsent CE S8
    There is no snapshot-controller target in Prometheus.

    The recommended course of action:

    1. Check the Pod status: kubectl -n d8-snapshot-controller get pod -l app=snapshot-controller
    2. Or check the Pod logs: kubectl -n d8-snapshot-controller logs -l app=snapshot-controller -c snapshot-controller
  • D8SnapshotControllerTargetDown CE S8
    Prometheus cannot scrape the snapshot-controller metrics.

    The recommended course of action:

    1. Check the Pod status: kubectl -n d8-snapshot-controller get pod -l app=snapshot-controller
    2. Or check the Pod logs: kubectl -n d8-snapshot-controller logs -l app=snapshot-controller -c snapshot-controller
  • D8SnapshotValidationWebhookPodIsNotReady CE S8
    The snapshot-validation-webhook Pod is NOT Ready.

    The recommended course of action:

    1. Retrieve details of the Deployment: kubectl -n d8-snapshot-controller describe deploy snapshot-validation-webhook
    2. View the status of the Pod and try to figure out why it is not running: kubectl -n d8-snapshot-controller describe pod -l app=snapshot-validation-webhook
  • D8SnapshotValidationWebhookPodIsNotRunning CE S8
    The snapshot-validation-webhook Pod is NOT Running.

    The recommended course of action:

    1. Retrieve details of the Deployment: kubectl -n d8-snapshot-controller describe deploy snapshot-validation-webhook
    2. View the status of the Pod and try to figure out why it is not running: kubectl -n d8-snapshot-controller describe pod -l app=snapshot-validation-webhook

Module terraform-manager

  • D8TerraformStateExporterClusterStateChanged CE S8
    Terraform-state-exporter cluster state changed

    Real Kubernetes cluster state is STATUS_REFERENCE comparing to Terraform state.

    It’s important to make them equal. First, run the dhctl terraform check command to check what will change. To converge state of Kubernetes cluster, use dhctl converge command.

  • D8TerraformStateExporterClusterStateError CE S8
    Terraform-state-exporter cluster state error

    Terraform-state-exporter can’t check difference between Kubernetes cluster state and Terraform state.

    Probably, it occurred because Terraform-state-exporter had failed to run terraform with current state and config. First, run the dhctl terraform check command to check what will change. To converge state of Kubernetes cluster, use dhctl converge command.

  • D8TerraformStateExporterHasErrors CE S8
    Terraform-state-exporter has errors

    Errors occurred while terraform-state-exporter working.

    Check pods logs to get more details: kubectl -n d8-system logs -l app=terraform-state-exporter -c exporter

  • D8TerraformStateExporterNodeStateChanged CE S8
    Terraform-state-exporter node state changed

    Real Node NODE_GROUP_NAME/NAME state is STATUS_REFERENCE comparing to Terraform state.

    It’s important to make them equal. First, run the dhctl terraform check command to check what will change. To converge state of Kubernetes cluster, use dhctl converge command.

  • D8TerraformStateExporterNodeStateError CE S8
    Terraform-state-exporter node state error

    Terraform-state-exporter can’t check difference between Node NODE_GROUP_NAME/NAME state and Terraform state.

    Probably, it occurred because Terraform-manager had failed to run terraform with current state and config. First, run the dhctl terraform check command to check what will change. To converge state of Kubernetes cluster, use dhctl converge command.

  • D8TerraformStateExporterNodeTemplateChanged CE S8
    Terraform-state-exporter node template changed

    Terraform-state-exporter found difference between node template from cluster provider configuration and from NodeGroup NAME. Node template is STATUS_REFERENCE.

    First, run the dhctl terraform check command to check what will change. Use dhctl converge command or manually adjust NodeGroup settings to fix the issue.

  • D8TerraformStateExporterPodIsNotReady CE S8
    Pod terraform-state-exporter is not Ready

    Terraform-state-exporter doesn’t check the difference between real Kubernetes cluster state and Terraform state.

    Pease, check:

    1. Deployment description: kubectl -n d8-system describe deploy terraform-state-exporter
    2. Pod status: kubectl -n d8-system describe pod -l app=terraform-state-exporter
  • D8TerraformStateExporterPodIsNotRunning CE S8
    Pod terraform-state-exporter is not Running

    Terraform-state-exporter doesn’t check the difference between real Kubernetes cluster state and Terraform state.

    Pease, check:

    1. Deployment description: kubectl -n d8-system describe deploy terraform-state-exporter
    2. Pod status: kubectl -n d8-system describe pod -l app=terraform-state-exporter
  • D8TerraformStateExporterTargetAbsent CE S8
    Prometheus has no terraform-state-exporter target

    To get more details: Check pods state: kubectl -n d8-system get pod -l app=terraform-state-exporter or logs: kubectl -n d8-system logs -l app=terraform-state-exporter -c exporter

  • D8TerraformStateExporterTargetDown CE S8
    Prometheus can't scrape terraform-state-exporter

    To get more details: Check pods state: kubectl -n d8-system get pod -l app=terraform-state-exporter or logs: kubectl -n d8-system logs -l app=terraform-state-exporter -c exporter

Module upmeter

  • D8SmokeMiniNotBoundPersistentVolumeClaims CE S9
    Smoke-mini has unbound or lost persistent volume claims.

    PVC_NAME persistent volume claim status is STATUS.

    There is a problem with pv provisioning. Check the status of the pvc o find the problem: kubectl -n d8-upmeter get pvc PVC_NAME

    If you have no disk provisioning system in the cluster, you can disable ordering volumes for the some-mini through the module settings.

  • D8UpmeterAgentPodIsNotReady CE S6

    Upmeter agent is not Ready

  • D8UpmeterAgentReplicasUnavailable CE S6
    One or more Upmeter agent pods is NOT Running

    Check DaemonSet status: kubectl -n d8-upmeter get daemonset upmeter-agent -o json | jq .status

    Check the status of its pod: kubectl -n d8-upmeter get pods -l app=upmeter-agent -o json | jq '.items[] | {(.metadata.name):.status}'

  • D8UpmeterProbeGarbageConfigmap CE S9
    Garbage produced by basic probe is not being cleaned.

    Probe configmaps found.

    Upmeter agents should clean ConfigMaps produced by control-plane/basic probe. There should not be more configmaps than master nodes (upmeter-agent is a DaemonSet with master nodeSelector). Also, they should be deleted within seconds.

    This might be an indication of a problem with kube-apiserver. Or, possibly, the configmaps were left by old upmeter-agent pods due to Upmeter update.

    1. Check upmeter-agent logs

    kubectl -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "basic-functionality") | [.time, .level, .msg] | @tsv'

    1. Check that control plane is functional.

    2. Delete configmaps manually:

    kubectl -n d8-upmeter delete cm -l heritage=upmeter

  • D8UpmeterProbeGarbageDeployment CE S9
    Garbage produced by controller-manager probe is not being cleaned.

    Average probe deployments count per upmeter-agent pod: VALUE.

    Upmeter agents should clean Deployments produced by control-plane/controller-manager probe. There should not be more deployments than master nodes (upmeter-agent is a DaemonSet with master nodeSelector). Also, they should be deleted within seconds.

    This might be an indication of a problem with kube-apiserver. Or, possibly, the deployments were left by old upmeter-agent pods due to Upmeter update.

    1. Check upmeter-agent logs

    kubectl -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "controller-manager") | [.time, .level, .msg] | @tsv'

    1. Check that control plane is functional, kube-controller-manager in particular.

    2. Delete deployments manually:

    kubectl -n d8-upmeter delete deploy -l heritage=upmeter

  • D8UpmeterProbeGarbageNamespaces CE S9
    Garbage produced by namespace probe is not being cleaned.

    Average probe namespace per upmeter-agent pod: VALUE.

    Upmeter agents should clean namespaces produced by control-plane/namespace probe. There should not be more of these namespaces than master nodes (upmeter-agent is a DaemonSet with master nodeSelector). Also, they should be deleted within seconds.

    This might be an indication of a problem with kube-apiserver. Or, possibly, the namespaces were left by old upmeter-agent pods due to Upmeter update.

    1. Check upmeter-agent logs

    kubectl -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "namespace") | [.time, .level, .msg] | @tsv'

    1. Check that control plane is functional.

    2. Delete namespaces manually: kubectl -n d8-upmeter delete ns -l heritage=upmeter

  • D8UpmeterProbeGarbagePods CE S9
    Garbage produced by scheduler probe is not being cleaned.

    Average probe pods count per upmeter-agent pod: VALUE.

    Upmeter agents should clean Pods produced by control-plane/scheduler probe. There should not be more of these pods than master nodes (upmeter-agent is a DaemonSet with master nodeSelector). Also, they should be deleted within seconds.

    This might be an indication of a problem with kube-apiserver. Or, possibly, the pods were left by old upmeter-agent pods due to Upmeter update.

    1. Check upmeter-agent logs

    kubectl -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "scheduler") | [.time, .level, .msg] | @tsv'

    1. Check that control plane is functional.

    2. Delete pods manually:

    kubectl -n d8-upmeter delete po -l upmeter-probe=scheduler

  • D8UpmeterProbeGarbagePodsFromDeployments CE S9
    Garbage produced by controller-manager probe is not being cleaned.

    Average probe pods count per upmeter-agent pod: VALUE.

    Upmeter agents should clean Deployments produced by control-plane/controller-manager probe, and hence kube-controller-manager should clean their pods. There should not be more of these pods than master nodes (upmeter-agent is a DaemonSet with master nodeSelector). Also, they should be deleted within seconds.

    This might be an indication of a problem with kube-apiserver or kube-controller-manager. Or, probably, the pods were left by old upmeter-agent pods due to Upmeter update.

    1. Check upmeter-agent logs

    kubectl -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "controller-manager") | [.time, .level, .msg] | @tsv'

    1. Check that control plane is functional, kube-controller-manager in particular.

    2. Delete pods manually:

    kubectl -n d8-upmeter delete po -l upmeter-probe=controller-manager

  • D8UpmeterProbeGarbageSecretsByCertManager CE S9
    Garbage produced by cert-manager probe is not being cleaned.

    Probe secrets found.

    Upmeter agents should clean certificates, and thus secrets produced by cert-manager should clean, too. There should not be more secrets than master nodes (upmeter-agent is a DaemonSet with master nodeSelector). Also, they should be deleted within seconds.

    This might be an indication of a problem with kube-apiserver, or cert-manager, or upmeter itself. It is also possible, that the secrets were left by old upmeter-agent pods due to Upmeter update.

    1. Check upmeter-agent logs

    kubectl -n d8-upmeter logs -l app=upmeter-agent --tail=-1 | jq -rR 'fromjson? | select(.group=="control-plane" and .probe == "cert-manager") | [.time, .level, .msg] | @tsv'

    1. Check that control plane and cert-manager are functional.

    2. Delete certificates manually, and secrets, if needed:

    kubectl -n d8-upmeter delete certificate -l upmeter-probe=cert-manager
    kubectl -n d8-upmeter get secret -ojson | jq -r '.items[] | .metadata.name' | grep upmeter-cm-probe | xargs -n 1 -- kubectl -n d8-upmeter delete secret
    
  • D8UpmeterServerPodIsNotReady CE S6

    Upmeter server is not Ready

  • D8UpmeterServerPodIsRestartingTooOften CE S9
    Upmeter server is restarting too often.

    Restarts for the last hour: VALUE.

    Upmeter server should not restart too often. It should always be running and collecting episodes. Check its logs to find the problem: kubectl -n d8-upmeter logs -f upmeter-0 upmeter

  • D8UpmeterServerReplicasUnavailable CE S6
    One or more Upmeter server pods is NOT Running

    Check StatefulSet status: kubectl -n d8-upmeter get statefulset upmeter -o json | jq .status

    Check the status of its pod: kubectl -n d8-upmeter get pods upmeter-0 -o json | jq '.items[] | {(.metadata.name):.status}'

  • D8UpmeterSmokeMiniMoreThanOnePVxPVC CE S9
    Unnecessary smoke-mini volumes in cluster

    The number of unnecessary smoke-mini PVs: VALUE.

    Smoke-mini PVs should be deleted when released. Probably smoke-mini storage class has Retain policy by default, or there is CSI/cloud issue.

    These PVs have no valuable data on them an should be deleted.

    The list of PVs: kubectl get pv | grep disk-smoke-mini.

  • D8UpmeterTooManyHookProbeObjects CE S9
    Too many UpmeterHookProbe objects in cluster

    Average UpmeterHookProbe count per upmeter-agent pod is VALUE, but should be strictly 1.

    Some of the objects were left by old upmeter-agent pods due to Upmeter update or downscale.

    Leave only newest objects corresponding to upmeter-agent pods, when the reason it investigated.

    See kubectl get upmeterhookprobes.deckhouse.io.

Module user-authn

  • D8DexAllTargetsDown CE S6

    Prometheus is unable to scrape Dex metrics.