Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Exploration] Introducing Priority to Kubelet Memory Eviction #846

Closed
wants to merge 4 commits into from
Closed
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions contributors/design-proposals/priority-eviction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Introducing Priority to Kubelet Memory Eviction

**Author**: David Ashpole (@dashpole)

**Last Updated**: 7/25/2017

**Status**: Proposal

This document explores various schemes to include priority in kubelet memory evictions

## Introduction

### Definitions
["Priority"](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-priority-api.md) is an integer, somehow set by users when running their pods. Assumed to be intentionally set, and controlled properly by mechanisms outside of this proposal.
["Quality of Service"](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md), or QoS, is the performance SLO kubernetes provides based on their resource requests and limits.
- Guaranteed: Requests == Limits
- Burstable: Requests < Limits
- Besteffort: No Requests

["Memory Eviction"](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-eviction.md): is the process of removing a pod from the kubelet when under memory pressure in order to free up resources. Eviction decisions are made at the node level by the kubelet.
"Preemption": is the process of deleting one pod from a node in order to make room for another pod, which is deemed by the scheduler to be more important to run. Preemption decisions are made at the cluster level by the scheduler.

### Background and Motivation
Prior to kubernetes v1.6, the [critical pod](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/rescheduling-for-critical-pods.md) annotation was introduced to prevent the permanent eviction of "critical" system pods. For static pods, we never evict, as an evicted static pod is never re-run. For non-static critical pods, we guarantee that they are rescheduled. The [Kubernetes Priority](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-priority-api.md) proposal introduced the concept of priority to kubernetes v1.7 as a long-term solution to ensuring that more important pods are run ahead of those that are less important. Even though critical pods can no longer be permanently evicted, evictions can still lead to disruption in critical cluster functionality. Integrating priority into the existing eviction algorithm can improve cluster stability by decreasing the likelyhood of evictions for critical pods in most cases. This proposal explores possible implementations for integrating priority with the kubelet process of eviction.

The current method for ranking pods for eviction is by QoS, then usage over requests. This currently holds the invariant that a pod that does not exceed its requests is not evicted, since the sum of the requests on the node cannot exceed the allocatable memory on the node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for clarity, is oom_score_adj behavior worth explaining here as i think they go hand-in-hand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


If the kubelet is unable to respond to memory pressure in time, an OOM Kill may be triggered. In this case, processes are killed based on their OOM Score. See the [OOM Score configuration docs](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md#oom-score-configuration-at-the-nodes) for more details.

### Goals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like how you laid this out.

- Transparency. Users should be able to understand why their pods are evicted, and what, if anything, they can do to prevent future evictions.
- Low Abuse. High priority pods should not be able to intentionally, or unintentionally disrupt large numbers of well-behaved pods.
- Respect Priority. Pods that have higher priority should be less likely to be evicted.

The goal of this proposal is to build consensus on the general design of how priority is integrated with memory eviction. This proposal itself will not be merged, but will result in a set of changes to the [kubelet eviction documentation](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-eviction.md) after a broad design is settled on.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


### Non Goals
The implementation of priority itself is outside the scope of this proposal, and is covered in the [Priority](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-priority-api.md) proposal. This includes mechanisms to control which pods are allowed to be given which priority levels.

The scope of this design is restricted to the node, and does not make any proposals regarding cluster-level controllers.

## Proposed Implementations
This list is not expected to be exhaustive, but rather to explore options. New options may be added if there is support for them.

The following are proposals for how to rank pods when the node is under memory pressure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am aware of a minimum two priority bands

  • critical pods - deployed by a cluster operator to each node
  • normal pods - everything else

i am biased to evaluate the preferred ranking approach on that relationship.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to consider at least three -- critical pods (BTW that's not just one-per-node pods, but also one-or-a-small-number-per-cluster pods like DNS and Heapster), pods that are in the serving path for external user requests, and opportunistic work (e.g. batch jobs with a flexible deadline). When HPA kicks in to scale up that second category, you want it to push out the third category but not the first category. (Obviously this assumes you can't or don't want to increase the number of nodes, e.g. non-virtualized on-prem environment or a cloud environment where you don't want to pay more.)


### By QoS, priority, then usage over requests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cluster operator that does not reserve resources for peak density may prefer this option.

pros:

  • operator can size critical pods for an avg or target density
  • cluster operators may get better node utilization

cons:

  • i think transparency decreases as number of priorities increase
  • doesn't align as well with how we calculate oom_score_adj

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that priority is being added to the APIs, wouldn't this option be deviating from the priority APIs?

This solution is closest to the current behavior since it only makes a small modification by considering priority before usage over requests. Since this is a small change from the current implementation, it should be an easy transition for cluster admins and users to make. High-priority pods are only able to able to disrupt pods in their QoS tier or below, which lowers the potential for abuse since users can run their workloads as guaranteed if they need to avoid evictions. However, this means that burstable pods consuming less than their requests could be evicted if a different high priority burstable pod bursts. High priority pods are given increased availability depending on the QoS of pods that they share the node with. If the node has many guaranteed pods, it is still possible that the high priority pod could be evicted. If the node does not have many guaranteed pods, then the high-priority pod is able to consume all memory not consumed by Guaranteed pods.

### By priority, then usage over requests, but only evicting pods where usage > requests
Copy link
Member

@derekwaynecarr derekwaynecarr Jul 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cluster operator that reserves resource for peak density may prefer this option.

pros:

  • easy to explain to normal users
  • maps clearly to how the scheduler works with allocatable
  • very transparent (the operator is required to state their reservation)
  • hard to see a clear abuse vector

cons:

  • requires more resources reserved for critical pods

i wonder if we had vertical pod autoscaling (in-place) if this option would work well for all cases; right now, usage of a daemonset doesn't really allow for variable pod sizing per node, but if we could somehow allow a pod to right-size its requests based on usage, it would feel easiest to explain.

This solution is similar in practice to the "By QoS, priority, then usage over requests" proposal, but preserves the current invariant that pods are guaranteed to be able to consume their requests without facing eviction. Like the "By QoS, priority, then usage over requests" proposal, it exempts guaranteed pods from eviction, and thus lowers the potential for abuse by high priority pods. Users can easily understand that their pods are evicted because they exceed their requests. This solution provides additional availability for high-priority pods by giving priority access to remaining allocatable memory on the node and memory other pods request, but do not use.

### By Priority, QoS, then usage over requests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for me, this violates the low abuse goal.

This solution allows high-priority pods to consume up to their limits with very little chance of eviction, unless other high-priority pods are also present on the same node. This solution would require using Quota controls on pod limits, rather than requests, as high-priority pods' requests make only a minor difference in their chance for eviction, and limits are a better indicator of what they can consume. Evictions are easy for users to understand, as they can reason that a higher priority pod bursted, but does not provide a course of action to prevent evictions other than raising the priority of their own pod.

### Function(priority, QoS, usage over requests)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do not want to explain this to our users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should users care about eviction policies? Unless a cluster is oversubscribed or a pod doesn't express memory limits, evictions shouldn't matter to users right?

This solution specifies a mapping between priority, QoS, and usage - request. For example, a possible implementation could specify that 100 points priority, 1 QoS level, and 100Mb over requests are equivalent, and then rank pods based on their "score". It has the potential to be a solution that can balance prioritizing high-priority pods, and preventing abuse from high-priority pods. However, it would require cluster administrators to understand this mapping in order to correctly specify priority levels, and would be prone to configuration errors. Users would have little insight into why their pods are evicted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option would help in building a resource economy of some sorts for power users of Priority. Imagine each level of priority associated with a cost. The lower a user pays, the lesser SLOs they get.
On the other hand, regular users can ensure that higher priority bands aren't over subscribed, and reserve over subscription just for batch workloads which can tolerate evictions and (typically) do not care about why eviction happened.

Copy link
Contributor Author

@dashpole dashpole Aug 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of evaluation, did you have a specific function in mind, or a class of functions (e.g. priority/usage, priority - usage, etc)?
I think that in all options, there would be a resource economy for cluster resources. In some cases (e.g. usage < request, priority, usage - request), this resource economy would be primarily based around request quota, and priority would play a less prominent role. In others, (e.g. priority, QoS, usage-request), a resource economy would be entirely based around priority. In this case, it would be a combination of both priority and request quota.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that priority is meant to take the front seat where the underlying capacity may be over subscribed. For example, at priority 10000, a user get's 10% of cluster capacity with many 9s of SLA. At priority 1000, a user get's 50% of cluster capacity with two 9s of SLA. At priority 0, a user get's access to 80% of cluster capacity with little to no SLA. The fact that usage is above request is only one of the facets considered while choosing a victim.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsalamat any thoughts? I see you raise similar points in #846 (comment)
For some reason, this option 3 has made the most sense to me as well since the beginning of this conversation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. As I've said in this comment, I like this option as well, however, I am not so sure if we actually need to add QoS class to the formula. IMO, percentage of usage above request is enough as QoS is implied in it. For example, best effort pods has an infinite percentage of usage above request which makes them the first candidates for eviction. Similarly, guaranteed pods get the lowest amount of usage above request and burstable pods get something in between.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at priority 10000, a user get's 10% of cluster capacity with many 9s of SLA.

What does it mean to "get" or "have access to" a portion of the cluster? What mechanism are we using to enforce that that a priority is actually limited to 10% of capacity, if not requests? Can you clarify what the SLA would mean in this case? It doesnt seem possible to provide an SLA without respect to usage.
The scenario you provided could be easily configured in other options using quota on requests (for the first two) or limits (for the last solution), and doesn't seem unique to the function solution.

Ill update the document for the third to exclude QoS for now.


## Implementation Timeline:
The integration of priority and evictions is targeted for kubernetes v1.8. For clusters that have priority disabled, behavior will be as if all pods had equal priority.