Skip to content

Commit

Permalink
Add user docs for pod priority and preemption
Browse files Browse the repository at this point in the history
  • Loading branch information
bsalamat committed Sep 7, 2017
1 parent 8d79723 commit bcee54e
Show file tree
Hide file tree
Showing 2 changed files with 177 additions and 0 deletions.
1 change: 1 addition & 0 deletions _data/concepts.yml
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ toc:
- docs/concepts/configuration/taint-and-toleration.md
- docs/concepts/configuration/secret.md
- docs/concepts/configuration/organize-cluster-access-kubeconfig.md
- docs/concepts/configuration/pod-priority-preemption.md

- title: Services, Load Balancing, and Networking
section:
Expand Down
176 changes: 176 additions & 0 deletions docs/concepts/configuration/pod-priority-preemption.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
---
approvers:
- davidopp
- wojtek-t
title: Pod Priority and Preemption
---

[Pods](/docs/user-guide/pods) in Kubernetes 1.8 and later can have priority. Priority
indicates the importance of a pod. When a pod cannot be scheduled, scheduler tries
to preempt lower priority pods in order to make scheduling of the pending pod possible.

* TOC
{:toc}

## How to use it
In order to use priority and preemption in Kubernetes 1.8, you should follow these
steps:
1. Enable Priority and Preemption.
1. Add one or more PriorityClasses.
1. Create pods with `PriorityClassName` set to one of the added PriorityClasses.

The following sections provide more information about these steps.

## Enable Priority and Preemption
Pod priority and preemption is disabled by default in Kubernetes 1.8 as it is an
__alpha__ feature. It can be enabled by a command-line flag:

```
--feature-gates=PodPriority=true
```

Once enabled you can add PriorityClasses and create pods with `PriorityClassName` set.
If you tried it and decided to disable it, you must remove this command-line flag or
set it to false and restart API server and Scheduler. Once disabled, the existing
pods may keep their priority fields, but preemption will be disabled and priority
fields will be ignored.

**Note:** Alpha features should not be used in production systems! Alpha
features are more likely to have bugs and future changes to them are not guaranteed to
be backward compatible.

## PriorityClass
A PriorityClass object defines a mapping from a PriorityClassName to the integer
value of the priority. The higher the value, the higher the priority. PriorityClass
objects can have any 32-bit integer value smaller than or equal to 1 billion. Larger
numbers are reserved for system use.

PriorityClass also has two optional fields: `globaleDefault` and `description`.
`globalDefault` indicates that the value of this PriorityClass should be used for
pods without a `PriorityClassName`. Only one PriorityClass with `globalDefault`
set to true can exists in the system. If there is no PriorityClass with `globalDefault`
set, priority of pods with no `PriorityClassName` will be zero.

`description` is an arbitrary string. It is meant to tell users of the cluster
when they should use this PriorityClass.


**Note 1:** If you upgrade your existing cluster and enable this feature, the priority
of your existing pods will be considered to be zero.

**Note 2:** Addition of a PriorityClass with `globalDefault` set to true, does not
change priority of existing pods. The value of such PriorityClass will be used only
for pods created after the PriorityClass is added.

#### Example PriorityClass
```yaml
apiVersion: v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for XYZ service pods only."
```
## Pod Priority
Once you have one or more PriorityClasses, you can create pods which specify one
of those PriorityClass names in their spec.
The following YAML is an example of a pod configuration that uses the PriorityClass
created above. Priority admission controller checks the spec and resolves the
priority of the pod to 1,000,000.
```yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
priorityClassName: high-priority
```
## Preemption
When pods are created, they go to a queue and wait to be scheduled. Scheduler picks a pod
from the queue and tries to schedule it on a node. If no node is found that satisfies
all the specified requirements of the pod, the pod is determined infeasible. At this
point preemption logic is triggered for the pending pod. Let's call the pending pod P.
Preemption logic tries to find a node where removal of pods with lower priority than
P helps schedule P. If such node is found, a one or more lower priority pods will
be deleted from the node. Once the pods are gone, P may be scheduled on the node.
### Limitations of Preemption (alpha version)
#### Starvation of Preempting Pod
When pods are preempted, the victims get their
[graceful termination period](https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods).
They have so much time to finish their work and exit. If they don't, they will be
killed. This graceful termination period creates a time gap between the point that
scheduler preempts pods until the pending pod (P) can be scheduled on the node (N).
When there are multiple victims on node N, they may exit or get terminated at
various points in time. When one exits, it creates some room on node N. Pod P can
only be scheduled on node N when __all__ the victims exit, but other smaller pods
may exist in the scheduling queue that fit on node N when some of the victims have
exited. Scheduler may schedule them on the node. In such a case, it is likely that
when all victims exit, pod P won't fit on node N anymore. So, scheduler will have to
preempt other pods on node N or another node to let P schedule. This scenario may
be repeated again for the second and subsequent rounds of preemption and P may not
get scheduled for a while. This scenario can cause problems in various clusters, but
is particularly problematic in clusters where many new pods are created all the time.
We intend to address this problem in beta version of pod preemption. The solution
we plan to implement is [provided here](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-preemption.md#preemption-mechanics).
#### PodDisruptionBudget is not supported
[Pod Disruption Budget (PDB)](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/)
allows application owners to limit the number pods of a replicated application that
are down simultaneously from voluntary disruptions. However, alpha version of
preemption does not respect PDB when choosing preemption victims.
We plan to add PDB support in beta, but even in beta respecting PDB will be best
effort. Scheduler will try to find victims whose
PDB won't be violated by preemption, but if no such victims are found, preemption
will still happen and lower priority pods will be removed despite their PDBs
being violated.
#### Inter-Pod Affinity on Lower Priority Pods
The current implementation of preemption considers a node for preemption only when
the answer to this question is positive: "If all the pods with lower priority than
the pending pod are removed from the node, can the pending pod be scheduled on
the node?"
If the answer is no, that node will not be chosen for preemption. If the pending
pod has inter-pod affinity on one or more of those lower priority pods on the node, the
inter-pod affinity rule cannot be satisfied in the absence of the lower priority
pods and scheduler will find the pending pod infeasible on the node. As a result,
it will not try to preempt any pods on that node.
Scheduler will try to find other nodes for preemption and could possibly find another
one, but there is no guarantee that such a node will be found.
We may address this issue in future versions, but we don't have a clear plan. Part
of the reason is that finding the set of lower priority pods that satisfy all
inter-pod affinity/anti-affinity rules is computationally expensive and adds
substantial complexity to the preemption logic. Besides, even if preemption keeps the lower
priority pods to satisfy inte-pod affinity, the lower priority pods may be preempted
later by other pods, which removes the benefits of having the complex logic of
respecting inter-pod affinity to lower priority pods.
Our recommended solution for this problem is to create inter-pod affinity towards
equal or higher priority pods.
#### Cross Node Preemption
When considering a node N for preemption in order to schedule pending pod P,
P may become feasible on N only when pods on other nodes are preempted. For
example, if there is anti-affinity from existing lower priority pods in a zone
towards pod P, P may be scheduled in the zone only when those lower priority pods
are preempted. Current preemption algorithm does not perform preemption of pods
on nodes other than N, when considering N for preemption.
We may consider adding cross node preemption in future versions if we find an
algorithm with reasonable performance.

0 comments on commit bcee54e

Please sign in to comment.