Every 15 minutes, the module analyzes the cluster state and performs pod eviction according to the conditions described in the active strategies. Evicted pods go through the scheduling process again, considering the current state of the cluster. This helps redistribute workloads according to the chosen strategy.

The module is based on the descheduler project.

Features of the module

  • The module can take into account the pod priority class (parameter spec.priorityClassThreshold), restricting its operation to only those pods that have a priority class lower than the specified threshold;
  • The module does not evict pods in the following cases:
    • a pod is in the d8-* or kube-system namespaces;
    • a pod has a priorityClassName system-cluster-critical or system-node-critical;
    • a pod is associated with a local storage;
    • a pod is associated with a DaemonSet;
    • pod eviction will violate Pod Disruption Budget (PDB);
    • there are no available nodes to run the evicted pod.
  • Pods with the Best effort priority class are evicted before those with Burstable and Guaranteed.

Descheduler uses parameters with the labelSelector syntax from Kubernetes to filter pods and nodes:

  • podLabelSelector — limits pods by labels;
  • namespaceLabelSelector — filters pods by namespaces;
  • nodeLabelSelector — selects nodes by labels.

Strategies

HighNodeUtilization

More compactly places pods. Requires scheduler configuration and enabling auto-scaling.

To use HighNodeUtilization, you must explicitly specify the high-node-utilization scheduler profile for each pod (this profile cannot be set as the default).

This strategy identifies under utilized nodes and evicts pods from them to redistribute them more compactly across fewer nodes.

Under utilized node — A node whose resource usage is below all the threshold values specified in the strategies.highNodeUtilization.thresholds section.

The strategy is enabled by the parameter spec.strategies.highNodeUtilization.enabled.

In GKE, you cannot configure the default scheduler, but you can use the optimize-utilization strategy or deploy a second custom scheduler.

Node resource usage takes into account extended resources and is calculated based on pod requests and limits (requests and limits), not actual consumption. This approach ensures consistency with the kube-scheduler, which uses a similar principle when scheduling pods on nodes. This means that resource usage metrics displayed by Kubelet (or tools like kubectl top) might differ from calculated metrics, as Kubelet and related tools show actual resource consumption.

LowNodeUtilization

Loads the nodes more evenly.

This strategy identifies under utilized nodes and evicts pods from other over utilized nodes. The strategy assumes that the evicted pods will be recreated on the under utilized nodes (following normal scheduler behavior).

Under utilized node — A node whose resource usage is below all the threshold values specified in the strategies.lowNodeUtilization.thresholds section.

Over utilized node — A node whose resource usage exceeds at least one of the threshold values specified in the strategies.lowNodeUtilization.targetThresholds section.

Nodes with resource usage in the range between thresholds and targetThresholds are considered optimally utilized. Pods on these nodes will not be evicted.

The strategy is enabled by the parameter spec.strategies.lowNodeUtilization.enabled.

Node resource usage takes into account extended resources and is calculated based on pod requests and limits (requests and limits), not actual consumption. This approach ensures consistency with the kube-scheduler, which uses a similar principle when scheduling pods on nodes. This means that resource usage metrics displayed by Kubelet (or tools like kubectl top) might differ from calculated metrics, as Kubelet and related tools show actual resource consumption.

RemoveDuplicates

Prevents multiple pods from the same controller (ReplicaSet, ReplicationController, StatefulSet) or the samee Job from running on the same node.

The strategy ensures that no more than one pod of a ReplicaSet, ReplicationController, StatefulSet, or pods of a single Job is running on the same node. If there are two or more such pods, the module evicts the excess pods so that they are better distributed across the cluster.

The situation can occur if some nodes in the cluster have failed for any reason, and the pods from those nodes have been moved to other nodes. Once the failed nodes become available again to accept load, this strategy can be used to evict duplicate pods from other nodes.

The strategy is enabled by the parameter strategies.removeDuplicates.enabled.

RemovePodsViolatingInterPodAntiAffinity

Evicts pods violating inter-pod affinity and anti-affinity rules to ensure compliance.

The strategy ensures that pods violating inter-pod affinity and anti-affinity rules are evicted from nodes.

For example, if there is podA on a node and podB and podC (running on the same node) have anti-affinity rules which prohibit them to run on the same node, then podA will be evicted from the node so that podB and podC could run. This issue could happen, when the anti-affinity rules for podB and podC are created when they are already running on node.

The strategy is enabled by the parameter spec.strategies.highNodeUtilization.enabled.

RemovePodsViolatingNodeAffinity

Evicts pods violating node affinity rules to ensure compliance.

The strategy makes sure all pods violating node affinity are eventually removed from nodes.

Essentially, depending on the settings of the parameter strategies.removePodsViolatingNodeAffinity.nodeAffinityType, the strategy temporarily implement the rule requiredDuringSchedulingIgnoredDuringExecution of the pod’s node affinity as the rule requiredDuringSchedulingRequiredDuringExecution, and the rule preferredDuringSchedulingIgnoredDuringExecution as the rule preferredDuringSchedulingPreferredDuringExecution.

Example for nodeAffinityType: requiredDuringSchedulingIgnoredDuringExecution. There is a pod scheduled to a node which satisfies the node affinity rule requiredDuringSchedulingIgnoredDuringExecution at the time of scheduling. If over time this node no longer satisfies the node affinity rule, and there is another node available that satisfies the node affinity rule, the strategy evicts the pod from the node it was originally scheduled to.

Example for nodeAffinityType: preferredDuringSchedulingIgnoredDuringExecution. There is a pod scheduled to a node because at the time of scheduling there were no other nodes that satisfied the node affinity rule preferredDuringSchedulingIgnoredDuringExecution. If over time an available node that satisfies this rule appears in the cluster, the strategy evicts the pod from the node it was originally scheduled to.

The strategy is enabled by the parameter strategies.removePodsViolatingNodeAffinity.enabled.