The module runs a descheduler with strategies defined in a Descheduler custom resource.

descheduler every 15 minutes evicts Pods that satisfy strategies enabled in the Descheduler custom resource. This leads to forced run the scheduling process for evicted Pods.

Nuances of descheduler operation

  • descheduler takes into account the priority class when evicting Pods from a high-loaded node (check out the priority-class module);
  • Pods with priorityClassName set to system-cluster-critical or system-node-critical (critical Pods) are never evicts;
  • Pods that are associated with a DaemonSet or aren’t covered by a controller are never evicts;
  • Pods with local storage enabled are never evicts;
  • The Best effort Pods are evicted before Burstable and Guaranteed ones;
  • descheduler takes into account the Pod Disruption Budget: the Pod will not be evicted if descheduling violates the PDB.

Strategies

You can enable, disable, and configure a strategy in the Descheduler custom resource.

HighNodeUtilization

This strategy finds nodes that are under utilized and evicts Pods in the hope that these Pods will be scheduled compactly into fewer nodes. This strategy must be used with the scheduler strategy MostRequestedPriority.

LowNodeUtilization

This strategy finds underutilized or overutilized nodes using cpu/memory/Pods (in %) thresholds and evicts Pods from overutilized nodes hoping that these Pods will be rescheduled on underutilized nodes. Note that this strategy takes into account Pod requests instead of actual resources consumed.

PodLifeTime

This strategy evicts Pods that are Pending for more than 24 hours.

RemoveDuplicates

This strategy makes sure that no more than one Pod of the same controller (RS, RC, Deploy, Job) is running on the same node. If there are two such Pods on one node, the descheduler kills one of them.

Suppose there are three nodes (say, the first node bears the greater load than the other two), and we want to deploy six application replicas. In this case, the scheduler will schedule 0 or 1 Pod to that overutilized node, while other replicas will be distributed between two other nodes. Thus, the descheduler will be killing “extra” Pods on those two nodes every 15 minutes, hoping that the scheduler will bind those Pods to the first node.

RemovePodsHavingTooManyRestarts

This strategy ensures that Pods having over a hundred container restarts (including init-containers) are removed from nodes.

RemovePodsViolatingInterPodAntiAffinity

This strategy ensures that Pods violating inter-pod anti-affinity are removed from nodes. We find it hard to imagine a situation when inter-pod anti-affinity can be violated, while the official descheduler documentation does not provide much guidance either:

This strategy makes sure that Pods violating inter-pod anti-affinity are removed from nodes. For example, if there is podA on node and podB and podC (running on same node) have anti-affinity rules which prohibit them to run on the same node, then podA will be evicted from the node so that podB and podC could run. This issue could happen, when the anti-affinity rules for Pods B, C are created when they are already running on node.

RemovePodsViolatingNodeAffinity

This strategy removes a Pod from a node if the latter no longer satisfies a Pod’s affinity rule (requiredDuringSchedulingIgnoredDuringExecution). The descheduler notices that and evicts the Pod if another node is available that satisfies the affinity rule.

RemovePodsViolatingNodeTaints

This strategy evicts Pods violating NoSchedule taints on nodes. Suppose a Pod set to tolerate some taint is running on a node with this taint. If the node’s taint is updated or removed, the Pod will be evicted.

RemovePodsViolatingTopologySpreadConstraint

This strategy ensures that Pods violating the Pod Topology Spread Constraints will be evicted from nodes.