@bsalamat
May 2017
- How to specify priority for workloads in Kubernetes API.
- Define how the order of these priorities are specified.
- Define how new priority levels are added.
- Effect of priority on scheduling and preemption.
- How preemption works in Kubernetes.
- How quota allocation and accounting works for each priority.
It is fairly common in clusters to have more tasks than what the cluster resources can handle. Often times the workload is a mix of high priority critical tasks, and non-urgent tasks that can wait. Cluster management should be able to distinguish these workloads in order to decide which ones should acquire the resources sooner and which ones can wait. Priority of the workload is one of the key metrics that provides the information to the cluster. This document is a more detailed design proposal for part of the high-level architecture described in Resource sharing architecture for batch and serving workloads in Kubernetes.
This design doc introduces the concept of priorities for pods in Kubernetes and how the priority impacts scheduling and preemption of pods when the cluster runs out of resources. A pod can specify a priority at the creation time. The priority must be one of the valid values and there is a total order on the values. The priority of a pod is independent of its workload type. The priority is global and not specific to a particular namespace.
One could generally expect a pod with higher priority has a higher chance of getting scheduled than the same pod with lower priority. However, there are many other parameters that affect scheduling decisions. So, a high priority pod may or may not be scheduled before lower priority pods. The details of what determines the order at which pods are scheduled are beyond the scope of this document.
Generally, lower priority pods are more likely to get preempted by higher priority pods when cluster has reached a threshold. In such a case, scheduler may decide to preempt lower priority pods to release enough resources for higher priority pending pods. As mentioned before, there are many other parameters that affect scheduling decisions, such as affinity and anti-affinity. If scheduler determines that a high priority pod cannot be scheduled even if lower priority pods are preempted, it will not preempt lower priority pods. Scheduler may have other restrictions on preempting pods, for example, it may refuse to preempt a pod if PodDisruptionBudget is violated. The details of scheduling and preemption decisions are beyond the scope of this document.
Pods may have priority in their pod spec. PodSpec will have two new fields called "PriorityClassName" which is specified by user, and "Priority" which will be populated by Kubernetes. User-specified priority (PriorityClassName) is a string and all of the valid priority classes are defined by a system wide mapping that maps each string to an integer. The PriorityClassName specified in a pod spec must be found in this map or the pod creation request will be rejected. If PriorityClassName is empty, it will resolve to the default priority (See below for more info on name resolution). Once the PriorityClassName is resolved to an integer, it is placed in "Priority" field of PodSpec.
type PodSpec struct {
...
PriorityClassName string
Priority *int32 // Populated by Admission Controller. Users are not allowed to set it directly.
}
The cluster may have many user defined priority classes for various use cases. The following list is an example of how the priorities and their values may look like. Kubernetes will also have special priority class names reserved for critical system pods. Please see System Priority Class Names for more information. Any priority value above 1 billion is reserved for system use. Aside from those system priority classes, Kubernetes is not shipped with predefined priority classes usable by user pods. The main goal of having no built-in priority classes for user pods is to avoid creating defacto standard names which may be hard to change in the future.
system 2147483647 (int_max)
tier1 4000
tier2 2000
tier3 1000
The following shows a list of example workloads in a Kubernetes cluster in decreasing order of priority:
- Kubernetes system daemons (per-node like fluentd, and cluster-level like Heapster)
- Critical user infrastructure (e.g. storage servers, monitoring system like Prometheus, etc.)
- Components that are in the user-facing request serving path and must be able to scale up arbitrarily in response to load spikes (web servers, middleware, etc.)
- Important interruptible workloads that need strong guarantee of schedulability and of not being interrupted
- Less important interruptible workloads that need a less strong guarantee of schedulability and of not being interrupted
- Best effort / opportunistic
User requests sent to Kubernetes may have PriorityClassName
in their PodSpec.
Admission controller resolves a PriorityClassName to its corresponding number
and populates the "Priority" field of the pod spec. The rest of Kubernetes
components look at the "Priority" field of pod status and work with the integer
value. In other words, PriorityClassName
will be ignored by the rest of the
system.
We are going to add a new API object called PriorityClass. The priority class defines the mapping between the priority name and its value. It can have an optional description. It is an arbitrary string and is provided only as a guideline for users.
A priority class can be marked as "Global Default" by setting its
GlobalDefault
field to true. If a pod does not specify any PriorityClassName
,
the system resolves it to the value of the global default priority class if
exists. If there is no global default, the pod's priority will be resolved to
zero. Priority admission controller ensures that there is only one global
default priority class.
type PriorityClass struct {
metav1.TypeMeta
// +optional
metav1.ObjectMeta
// The value of this priority class. This is the actual priority that pods
// receive when they have the above name in their pod spec.
Value int32
GlobalDefault bool
Description string
}
As mentioned earlier, a PriorityClassName is resolved by the admission controller to its integral value and Kubernetes components use the integral value. The higher the value, the higher the priority.
There will be special priority class names reserved for system use only. These classes have a value larger than one billion. Priority admission controller ensures that new priority classes will be not created with those names. They are used for critical system pods that must not be preempted. We set default policies that deny creation of pods with PriorityClassNames corresponding to these priorities. Cluster admins can authorize users or service accounts to create pods with these priorities. When non-authorized users set PriorityClassName to one of these priority classes in their pod spec, their pod creation request will be rejected. For pods created by controllers, the service account must be authorized by cluster admins.
Priority classes can be added or removed, but their name and value cannot be
updated. We allow updating GlobalDefault
and Description
as long as there is
a maximum of one global default. While
Kubernetes can work fine if priority classes are changed at run-time, the change
can be confusing to users as pods with a priority class which were created
before the change will have a different priority value than those created after
the change. Deletion of priority classes is allowed, despite the fact that there
may be existing pods that have specified such priority class names in their pod
spec. In other words, there will be no referential integrity for priority
classes. This is another reason that all system components should only work with
the integer value of the priority and not with the PriorityClassName
.
One could delete an existing priority class and create another one with the same name and a different value. By doing so, they can achieve the same effect as updating a priority class, but we still do not allow updating priority classes to prevent accidental changes.
Newly added priority classes cannot have a value higher than what is reserved for "system". The reason for this restriction is that Kubernetes critical system pods will have one of the "system" priorities and no pod should be able to preempt them.
While Kubernetes effectively allows changing priority classes (by deleting and adding them with a different value), it should be done only when absolutely needed. Changing priority classes has the following disadvantages:
- May remove config portability: pod specs written for one cluster are no longer guaranteed to work on a different cluster if the same priority classes do not exist in the second cluster.
- If quota is specified for existing priority classes (at the time of this writing, we don't have this feature in Kubernetes), adding or deleting priority classes will require reconfiguration of quota allocations.
- An existing pods may have an integer value of priority that does not reflect the current value of its PriorityClass.
Kubernetes has three QoS classes which are derived from request and limit of pods. Priority is introduced as an independent concept; meaning that any QoS class may have any valid priority. When a node is out of resources and pods needs to be preempted, we give priority a higher weight over QoS classes. In other words, we preempt the lowest priority pod and break ties with some other metrics, such as, QoS class, usage above request, etc. This is not finalized yet. We will discuss and finalize preemption in a separate doc.