How Kubernetes picks which pods to delete during scale-in
Source: https://rpadovani.com/k8s-algorithm-pick-pod-scale-in
Have you ever wondered how K8s choose which pods to delete when a deployment is scaled down? Given it is not documented, I dived in the source code to learn.
TLTR:
There are 8 different rules: when comparing two pods, each of them is applied in turn until one matches.
- The first thing that is compared is if a pod is assigned to a node: the ones that are not assigned are deleted first; (Unassigned < assigned;)
- Then, the phase of the pods is the next criteria. A pod in Pending state will be deleted before a pod in Unknown state, and the ones in Ready phase will be deleted last; (PodPending < PodUnknown < PodRunning;)
- Then, the Ready status is compared: pods not Ready will be deleted before pods marked as Ready; (Not ready < ready;)
- If the feature pod-deletion-cost is enabled, (we will speak about it later, as it is the only way to shape the choice of which pod to delete), the pod with a lower controller.kubernetes.io/pod-deletion-cost (if any), will be deleted first; (Lower pod-deletion-cost < higher pod-deletion cost;)
- Then, Kubernetes uses the rank of the pod: we explained above, is the number of related pods running on the same node. The one with a higher rank will be deleted first; (Doubled up < not doubled up;)
- Then, if both pods are Ready, the pod that has been ready for a shorter amount of time will be deleted before the pod that has been ready for longer; (Been ready for empty time < less time < more time;)
- Then, everything else equal, the pods that have restarted the most will be deleted first; (Pods with containers with higher restart counts < lower restart counts;)
- If nothing else matches, the pod that has been created most recently, according to the CreationTimestamp field, will be deleted first. (Empty creation time pods < newer pods < older pods;)
If all these 8 criteria are the same, so there is no clear indication of which pod should be deleted first, they are sorted by UUID to provide a pseudorandom order. The one that comes before in alphabetical order will be deleted first.