From 770913ef6aee4ace1486791257fcc1c2cc5420cf Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Wojciech=20Tyczy=C5=84ski?= Date: Thu, 22 May 2025 09:08:23 +0200 Subject: [PATCH 1/2] Extending the semantics of nominated node name --- keps/prod-readiness/sig-scheduling/5278.yaml | 3 + .../README.md | 983 ++++++++++++++++++ .../kep.yaml | 36 + 3 files changed, 1022 insertions(+) create mode 100644 keps/prod-readiness/sig-scheduling/5278.yaml create mode 100644 keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md create mode 100644 keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml diff --git a/keps/prod-readiness/sig-scheduling/5278.yaml b/keps/prod-readiness/sig-scheduling/5278.yaml new file mode 100644 index 00000000000..b3de9d3e79c --- /dev/null +++ b/keps/prod-readiness/sig-scheduling/5278.yaml @@ -0,0 +1,3 @@ +kep-number: 5278 +alpha: + approver: "@soltysh" diff --git a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md new file mode 100644 index 00000000000..802c51a8e8a --- /dev/null +++ b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md @@ -0,0 +1,983 @@ + +# KEP-5278: Nominated node name for an expected pod placement + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [External components need to know the pod is going to be bound](#external-components-need-to-know-the-pod-is-going-to-be-bound) + - [External components want to specify a preferred pod placement](#external-components-want-to-specify-a-preferred-pod-placement) + - [Retain the scheduling decision](#retain-the-scheduling-decision) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1: Prevent inappropriate scale downs by Cluster Autoscaler](#story-1-prevent-inappropriate-scale-downs-by-cluster-autoscaler) + - [Story 2: Cluster Autoscaler specifies NominatedNodeName to indicate where pods can go after new nodes are created/registered](#story-2-cluster-autoscaler-specifies-nominatednodename-to-indicate-where-pods-can-go-after-new-nodes-are-createdregistered) + - [Risks and Mitigations](#risks-and-mitigations) + - [Increasing the load to kube-apiserver](#increasing-the-load-to-kube-apiserver) + - [Race condition](#race-condition) + - [Confusion if NominatedNodeName is different from NodeName after all](#confusion-if-nominatednodename-is-different-from-nodename-after-all) + - [What if there are multiple components that could set NominatedNodeName on the same pod](#what-if-there-are-multiple-components-that-could-set-nominatednodename-on-the-same-pod) + - [[CA scenario] If the cluster autoscaler puts unexisting node's name on NominatedNodeName, the scheduler clears it](#ca-scenario-if-the-cluster-autoscaler-puts-unexisting-nodes-name-on-nominatednodename-the-scheduler-clears-it) + - [[CA scenario] A new node's taint prevents the pod from going there, and the scheduler ends up clearing NominatedNodeName](#ca-scenario-a-new-nodes-taint-prevents-the-pod-from-going-there-and-the-scheduler-ends-up-clearing-nominatednodename) +- [Design Details](#design-details) + - [The scheduler puts NominatedNodeName](#the-scheduler-puts-nominatednodename) + - [External components put NominatedNodeName](#external-components-put-nominatednodename) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +Use `NominatedNodeName` to express an pod placement, expected by the scheduler or expected by other components. + +The scheduler puts `NominatedNodeName` at the beginning of binding cycles to show an expected pod placement to other components. +And, also other components can put `NominatedNodeName` on pending pods to indicate the pod is prefered to be scheduled on a specific node. + +## Motivation + +### External components need to know the pod is going to be bound + +The scheduler reserves the place for the pod when the pod is entering the binding cycle. +This reservation is internally implemented in the scheduler's cache, and is not visible to other components. + +The specific problem is, as shown at [#125491](https://github.com/kubernetes/kubernetes/issues/125491), +if the binding cycle takes time before binding pods to nodes (e.g., PreBind takes time to handle volumes) +the cluster autoscaler cannot understand the pod is going to be bound there, +misunderstands the node is low-utilized (because the scheduler keeps the place of the pod), and deletes the node. + +We can expose those internal reservations with `NominatedNodeName` so that external components can take a more appropriate action +based on the expected pod placement. + +### External components want to specify a preferred pod placement + +The cluster autoscaler or Kueue internally calculates the pod placement, +and create new nodes or un-gate pods based on the calculation result. + +So, they know where those pods are likely going to be scheduled. + +By specifing their expectation on `NominatedNodeName`, the scheduler can first check whether the pod can go to the nominated node, +speeding up the filter phase. + +### Retain the scheduling decision + +At the binding cycle (e.g., PreBind), some plugins could handle something (e.g., volumes, devices) based on the pod's scheduling result. + +If the scheduler restarts while it's handling some pods at binding cycles, +kube-scheduler could decide to schedule a pod to a different node. +If we can keep where the pod was going to go at `NominatedNodeName`, the scheduler likely picks up the same node, and the PreBind plugins can restart their work from where they were before the restart. + +### Goals + +- The scheduler use `NominatedNodeName` to express where the pod is going to go before actually bound them. +- Make sure external components can use `NominatedNodeName` to express where they prefer the pod is going to. + - Probably, you can do this with a today's scheduler as well. This proposal wants to discuss/make sure if it actually works, and then add tests etc. + +### Non-Goals + +- Extenral components can enforce the scheduler to pick up a specific node via `NominatedNodeName`. + - `NominatedNodeName` is just a hint for scheduler and doesn't represent a hard requirement + +## Proposal + +### User Stories (Optional) + +Here is the all use cases of NominatedNodeNames that we're taking into consideration: +- The scheduler puts it after the preemption (already implemented) +- The scheduler puts it at the beginning of binding cycles (only if the binding cycles invole PreBind phase) +- The cluster autoscaler puts it after creating a new node for pending pod(s) so that the scheduler can find a place faster when the node is created. +- Kueue uses it to determine a prefered node for the pod based on their internal calculation (Topology aware scheduling etc) + +(Possibly, our future initiative around the workload scheduling (including gang scheduling) can also utilize it, +but we don't discuss it here because it's not yet concreted at all.) + +#### Story 1: Prevent inappropriate scale downs by Cluster Autoscaler + +The scheduler starts to expose where the pod is going to with `NominatedNodeName` at the beginning of binding cycles. +And, the cluster autoscaler takes `NominatedNodeName` into consideration when calculating which nodes they delete. + +It helps the scenarios where the binding cycles take time, for example, VolumeBinding plugin takes time at PreBind extension point. + +#### Story 2: Cluster Autoscaler specifies `NominatedNodeName` to indicate where pods can go after new nodes are created/registered + +Usually, the scheduler scans all the nodes in the cluster when scheduling pods. + +When the cluster autoscaler creates instances for pending pods, it calculate which new node might get which pending pod. +If they can put `NominatedNodeName` based on those calculation, it could tell the scheduler that the node can probably picked up for the pod's scheduling, +prevenging the double effort of scanning/calculating all nodes again at the scheduling retries. + +#### Story 3: Kueue specifies `NominatedNodeName` to indicate where it prefers pods being scheduled to + +When Kueue determines where pods are prefered to being scheduled on, based on their internal scheduling soft constraints (Preferred Topology Aware Scheduling, etc) +currently, they just put the node selector to tell the scheduler about their preference, and then un-gate the pods. + +After this proposal, they can specify `NominatedNodeName` instead of a prefered node selector, +which makes the probability of pods being scheduled onto the node higher. + +### Risks and Mitigations + + + +#### Increasing the load to kube-apiserver + +If we simply implement this, we'd double the API calls during a simple binding cycle (NNN + actual binding), +which would increase the load to kube-apiserver significantly. + +To prevent that, we'll skip setting `NominatedNodeName` when all PreBind plugins have nothing to do with the pod. +(We'll discuss how-to in the later section.) +Then, setting `NominatedNodeName` happens only when, for example, a pod has a volume that VolumeBinding plugin needs to handle at PreBind. + +Of course, the API calls would still be increasing especially if most of pods have delayed binding. +However, those cases should actually be ok to have those additional calls because these will have other calls related to those operations (e.g., PV creation, etc.) - so the overhead of an additional call is effectively a smaller percentage of the e2e flow. + +#### Race condition + +If an external component adds `NominatedNodeName` to the pod that is going through a scheduling cycle, +`NominatedNodeName` isn't taken into account (of course), and the pod could be scheduled onto a different node. + +But, this should be fine because, either way, we're not saying `NominatedNodeName` is something forcing the scheduler to pick up the node, +rather it's just a preference. + +#### Confusion if `NominatedNodeName` is different from `NodeName` after all + +If an external component adds `NominatedNodeName`, but the scheduler picks up a different node, +`NominatedNodeName` is just overwritten by a final decision of the scheduler. + +But, if an external component updates `NominatedNodeName` that is set by the scheduler, +the pod could end up having different `NominatedNodeName` and `NodeName`. + +Probably we should clear `NominatedNodeName` when the pod is bound. (at binding api) + +#### What if there are multiple components that could set `NominatedNodeName` on the same pod + +Multiple controllers might keep overwriting NominatedNodeName that is set by the others. +Of course, we can regard that just as user's fault though, that'd be undesired situation. + +There could be several ideas to mitigate, or even completely solve by adding a new API. +But, we wouldn't like to introduce any complexity right now because we're not sure how many users would start using this, +and hit this problem. + +So, for now, we'll just document it somewhere as a risk, unrecommended situation, and in the future, we'll consider something +if we actually observe this problem getting bigger by many people starting using it. + +#### [CA scenario] If the cluster autoscaler puts unexisting node's name on `NominatedNodeName`, the scheduler clears it + +The current scheduler clears the node name from `NominatedNodeName` if the pod goes through the scheduling cycle, +and the node doesn't exist. + +In order for the cluster autoscaler to levarage this feature, +it has to put unexisting node's name, which is supposed to be registered later after its scale up, +so that the scheduler can schedule pending pods on those new nodes as soon as possible after nodes are registered. + +So, we need to keep the node's name on `NominatedNodeName` even when the node doesn't exist. +We'll discuss it at [Only modifying `NominatedNodeName`](#only-modifying-nominatednodename) section. + +#### [CA scenario] A new node's taint prevents the pod from going there, and the scheduler ends up clearing `NominatedNodeName` + +With the current scheduler, what happens if CA puts `NominatedNodeName` is: +1. Pods are unschedulable. For the simplicity, let's say all of them are rejected by NodeResourceFit plugin. (i.e., no node has enough CPU/memory for pod's request) +2. CA finds them, calculates nodes necessary to be created +3. CA puts `NominatedNodeName` on each pod +4. The scheduler keeps trying to schedule those pending pods though, here let's say they're unschedulable (no cluster event happens that could make pods schedulable) until the node is created. +5. The nodes are created, and registered to kube-apiserver. Let's say, at this point, nodes have un-ready taints. +6. The scheduler observes `Node/Create` event, `NodeResourceFit` plugin QHint returns `Queue`, and those pending pods are requeued to activeQ. +7. The scheduling cycle starts handling those pending pods. +8. However, because nodes have un-ready taints, pods are rejected by `TaintToleration` plugin. +9. The scheduler clears `NominatedNodeName` because it finds the nominated node (= new node) unschedulable. + +So, after all, `NominatedNodeName` added by CA in this scaling up scenario doesn't add any value, +unless the taints are removed in a short time (between 6 and 7). + +So, we need to keep the node's name on `NominatedNodeName` even when the node doesn't fit right now. +We'll discuss it at [Only modifying `NominatedNodeName`](#only-modifying-nominatednodename) section. + +## Design Details + + +### The scheduler puts `NominatedNodeName` + +After the pod is permitted at `WaitOnPermit`, the scheduler needs to update `NominatedNodeName` with the node that it determines the pod is going to. + +Also, in order to set `NominatedNodeName` only when some PreBind plugins work, we need to add a new function (or create a new extension point, if we are concerned about the breaking change to the existing PreBind plugins). + +```go +type PreBindPlugin interface { + Plugin + // **New Function** (or we can have a separate Plugin interface for this, if we're concerned about a breaking change for custom plugins) + // It's called before PreBind, and the plugin is supposed to return Success, Skip, or Error status. + // If it returns Skip, it means this PreBind plugin has nothing to do with the pod. + // This function should be lightweight, and shouldn't do any actual operation, e.g., creating a volume etc + PreBindPreFlight(ctx context.Context, state *CycleState, p *v1.Pod, nodeName string) *Status + + PreBind(ctx context.Context, state *CycleState, p *v1.Pod, nodeName string) *Status +} +``` + +The scheduler would run a new function `PreBindPreFlight()` before `PreBind()` functions, +and if all PreBind plugins return Skip status from new functions, we can skip setting `NominatedNodeName`. + +This is a similar approach we're doing with PreFilter/PreScore -> Filter/Score. +We determine if each plugin is relevant to the pod by Skip status from PreFilter/PreScore, and then determine whether to run Filter/Score function accordingly. + +In this way, even if users have some PreBind custom plugins, they can implement `PreBindPreFlight()` appropriately +so that the scheduler can wisely skip setting `NominatedNodeName`, taking their custom logic into consideration. + +### External components put `NominatedNodeName` + +There aren't any restrictions preventing other components from setting NominatedNodeName as of now. +However, we don't have any validation of how that currently works. +To support the usecases mentioned above we will adjust the scheduler to do the following: +- if NominatedNodeName is set, but corresponding Node doesn't exist, kube-scheduler will NOT clear it when the pod is unschedulable [assuming that a node might appear soon] +- We will rely on the fact that a pod with NominatedNodeName set is resulting in the in-memory reservation for requested resources. +Higher-priority pods can ignore it, but pods with equal or lower priority don't have access to these resources. +This allows us to prioritize nominated pods when nomination was done by external components. +We just need to ensure that in case when NominatedNodeName was assigned by an external component, this nomination will get reflected in scheduler memory. + +We will implement integration tests simulating the above behavior of external components. + +#### The scheduler only modifies `NominatedNodeName`, not clears it in any cases + +As described at the risk section, there are two problematic scenarios where this use case wouldn't work. +- [[CA scenario] If the cluster autoscaler puts unexisting node's name on `NominatedNodeName`, the scheduler clears it](#ca-scenario-if-the-cluster-autoscaler-puts-unexisting-nodes-name-on-nominatednodename-the-scheduler-clears-it) +- [[CA scenario] A new node's taint prevents the pod from going there, and the scheduler ends up clearing `NominatedNodeName`](#ca-scenario-a-new-nodes-taint-prevents-the-pod-from-going-there-and-the-scheduler-ends-up-clearing-nominatednodename) + +Currently, the scheduler clears `NominatedNodeName` at the end of failed scheduling cycles if it found the nominated node unschedulable for the pod. +In order to avoid above two scenarios, we have to remove this clearing logic; change the scheduler not to clear `NominatedNodeName` in any cases. +It means, even if the node on `NominatedNodeName` isn't valid anymore, the scheduler keeps trying the node first. +We regard the additional cost of checking `NominatedNodeName` first unnecessarily isn't reletively big (especially for big clusters, where the performance is critical) because it's just one iteration of Filter plugins. +e.g., if you have 1000 nodes and 16 parallelism (default value), the scheduler needs around 62 iterations of Filter plugins, approximately. So, adding one iteration on top of that doesn't matter. + +Also, note that we still allow the scheduler overwrite `NominatedNodeName` when it triggers the preemption for the pod. + +### Test Plan + + + +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + + + +- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) + +##### e2e tests + + + +- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/e2e/...): [SIG ...](https://testgrid.k8s.io/sig-...?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: + - Components depending on the feature gate: +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? + + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml new file mode 100644 index 00000000000..bc4e0db3d96 --- /dev/null +++ b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml @@ -0,0 +1,36 @@ +title: Nominated node name for an expected pod placement +kep-number: 5278 +authors: + - "@sanposhiho" + - "@wojtek-t" +owning-sig: sig-scheduling +participating-sigs: + - sig-autoscaling +status: provisional +creation-date: 2025-05-07 +reviewers: + - "@macsko" + - "@dom4ha" +approvers: + - "@macsko" + - "@dom4ha" + +stage: alpha + +latest-milestone: "v1.34" + +milestone: + alpha: "v1.34" + beta: "v1.35" + stable: "v1.36" + +feature-gates: + - name: NominatedNodeNameForExpectation + components: + - kube-scheduler + - kube-apiserver +disable-supported: true + +# The following PRR answers are required at beta release +metrics: + - tbd From 2553b15879eacabca136c0806cbc079e7df3b338 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Wojciech=20Tyczy=C5=84ski?= Date: Thu, 22 May 2025 11:08:05 +0200 Subject: [PATCH 2/2] NominatedNodeName KEP adjustments --- .../README.md | 228 ++++++++++++------ 1 file changed, 148 insertions(+), 80 deletions(-) diff --git a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md index 802c51a8e8a..8f25f45311d 100644 --- a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md +++ b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md @@ -205,13 +205,13 @@ based on the expected pod placement. ### External components want to specify a preferred pod placement -The cluster autoscaler or Kueue internally calculates the pod placement, -and create new nodes or un-gate pods based on the calculation result. +The ClusterAutoscaler or Karpenter internally calculate the pod placement, +and create new nodes or un-gate pods based on the calculation result. +The shape and count of newly added nodes assumes some particular pod placement +and the pods may not fit or satisfy scheduling constraints if placed differently. -So, they know where those pods are likely going to be scheduled. - -By specifing their expectation on `NominatedNodeName`, the scheduler can first check whether the pod can go to the nominated node, -speeding up the filter phase. +By specifing their expectation on `NominatedNodeName`, the scheduler can first check +whether the pod can go to the nominated node, increasing end-to-end scheduling time. ### Retain the scheduling decision @@ -219,7 +219,8 @@ At the binding cycle (e.g., PreBind), some plugins could handle something (e.g., If the scheduler restarts while it's handling some pods at binding cycles, kube-scheduler could decide to schedule a pod to a different node. -If we can keep where the pod was going to go at `NominatedNodeName`, the scheduler likely picks up the same node, and the PreBind plugins can restart their work from where they were before the restart. +If we can keep where the pod was going to go at `NominatedNodeName`, the scheduler likely picks up the same node, +and the PreBind plugins can restart their work from where they were before the restart. ### Goals @@ -229,7 +230,7 @@ If we can keep where the pod was going to go at `NominatedNodeName`, the schedul ### Non-Goals -- Extenral components can enforce the scheduler to pick up a specific node via `NominatedNodeName`. +- External components can enforce the scheduler to pick up a specific node via `NominatedNodeName`. - `NominatedNodeName` is just a hint for scheduler and doesn't represent a hard requirement ## Proposal @@ -239,68 +240,129 @@ If we can keep where the pod was going to go at `NominatedNodeName`, the schedul Here is the all use cases of NominatedNodeNames that we're taking into consideration: - The scheduler puts it after the preemption (already implemented) - The scheduler puts it at the beginning of binding cycles (only if the binding cycles invole PreBind phase) -- The cluster autoscaler puts it after creating a new node for pending pod(s) so that the scheduler can find a place faster when the node is created. -- Kueue uses it to determine a prefered node for the pod based on their internal calculation (Topology aware scheduling etc) +- The ClusterAutoscaler or Karpenter puts it after creating a new node for pending pod(s) so that the scheduler + can utilize the result of scheduling simulations already made by those components (Possibly, our future initiative around the workload scheduling (including gang scheduling) can also utilize it, -but we don't discuss it here because it's not yet concreted at all.) +but we don't discuss it here because it's not yet concrete.) #### Story 1: Prevent inappropriate scale downs by Cluster Autoscaler -The scheduler starts to expose where the pod is going to with `NominatedNodeName` at the beginning of binding cycles. -And, the cluster autoscaler takes `NominatedNodeName` into consideration when calculating which nodes they delete. +Pod binding may take significant amount of time (even at the order of minutes, e.g. due to volume binding). +During that time, components other than the scheduler don't have the information that such a placement decision +has already been made and is already executed. Without having this information, other components may decide +to take conflicting actions (e.g. ClusterAutoscaler or Karpenter may decide to delete that particular node). -It helps the scenarios where the binding cycles take time, for example, VolumeBinding plugin takes time at PreBind extension point. +We need a way to share the information about already made scheduling decisions with those components to prevent that. -#### Story 2: Cluster Autoscaler specifies `NominatedNodeName` to indicate where pods can go after new nodes are created/registered +#### Story 2: Scheduler can resume its work after restart -Usually, the scheduler scans all the nodes in the cluster when scheduling pods. +Pod binding may take significant amount of time (even at the order of minutes, e.g. due to volume binding). +During that time, scheduler may be restarted, lost its leader lock etc. +Given the placement decision was only stored in schedulers memory, the new incarnation of the scheduler +has no visibility into it and can decide to put a pod on a different node. This would result in wasting +the work that has already been done and increase the end-to-end pod startup latency. -When the cluster autoscaler creates instances for pending pods, it calculate which new node might get which pending pod. -If they can put `NominatedNodeName` based on those calculation, it could tell the scheduler that the node can probably picked up for the pod's scheduling, -prevenging the double effort of scanning/calculating all nodes again at the scheduling retries. +We need a mechanism to be able to resume the already started work in majority of such situations. -#### Story 3: Kueue specifies `NominatedNodeName` to indicate where it prefers pods being scheduled to +#### Story 3: ClusterAutoscaler or Karpenter can influence scheduling decisions -When Kueue determines where pods are prefered to being scheduled on, based on their internal scheduling soft constraints (Preferred Topology Aware Scheduling, etc) -currently, they just put the node selector to tell the scheduler about their preference, and then un-gate the pods. +ClusterAutoscaler or Karpenter perform scheduling simulations to decide what nodes should be +added to make pending pods schedulable. Their decisions assume a certain placement - if pending +pods are placed differently, they may not fit on the newly added nodes or may not satisfy their +scheduling constraints. -After this proposal, they can specify `NominatedNodeName` instead of a prefered node selector, -which makes the probability of pods being scheduled onto the node higher. +In order to improve the end-to-end pod startup latency when cluster scale-up is needed, we need a +mechanism to communicate the results of scheduling simulations from ClusterAutoscaler or Karpenter +to scheduler. ### Risks and Mitigations - +#### NominatedNodeName can already be set by other components now. + +There aren't any guardrails preventing other components from setting NominatedNodeName now. +In such cases, the semantic is not well defined now and the outcome of it may not match user +expectations. + +This KEP is a step towards clarifying this semantic instead of maintaining status-quo. + +#### Confusing semantics of `NominatedNodeName` + +Up until now, `NominatedNodeName` was expressing the decision made by scheduler to put a given +pod on a given node, while waiting for the preemption. The decision could be changed later so +it didn't have to be a final decision, but it was describing the "current plan of record". + +If we put more components into the picture (e.g. ClusterAutoscaler and Karpenter), we effectively +get a more complex state machine, with the following states: + +1. pending pod +1. pod proposed to node (by external component) [not approved by scheduler] +1. pod nominted to node (based on external proposal) and waiting for node (e.g. being created & ready) +1. pod nominated to node and waiting for preemption +1. pod allocated to node and waiting for binding +1. pod bound + +The important part is that if we decide to use `NominatedNodeName` to store all that information, +we're effectively losing the ability to distinguish between those states. + +We may argue that as long as the decision was made by the scheduler, the exact reason and state +probably isn't that important - the content of `NominatedNodeName` can be interpreted as +"current plan of record for this pod from scheduler perspective". + +But the `pod proposed to node` state is visibly different. In particular external components +may overallocate the pods on the node, those pods may not match scheduling constraints etc. +We can't claim that it's a current plan of record of the scheduler. It's a hint that we want +scheduler to take into account. + +In other words, from state machine perspective, there is visible difference in who set the +`NominatedNodeName`. If it was scheduler, it may mean that there is already ongoing preemption. +If it was an external component, it's just a hint that may even be ignored. +However, if we look from consumption point of view - these are effectively the same. We want +to expose the information, that as of now a given node is considered as a potential placement +for a given pod. It may change, but for now that's what considered. + +Eventually, we may introduce some state machine, where external components could also approve +schedulers decisions by exposing these states more concretely via the API. But we will be +able to achieve it in an additive way by exposing the information about the state. + +However, we don't need this state machine now, so we just introduce the following rules: +- Any component can set `NominatedNodeName` if it is currently unset. +- Scheduler is allowed to overwrite `NominatedNodeName` at any time in case of preemption or +the beginning of the binding cycle. +- No external components can't overwrite `NominatedNodeName` set by a different component. +- If `NominatedNodeName` is set, the component who set it is responsible for updating or +clearing it if its plans were changed (using PUT or APPLY to ensure it won't conflict with +potential update from scheduler) to reflect the new hint. + +Moreover: +- Regardless of who set `NominatedNodeName`, its readers should always take that into +consideration (e.g. ClusterAutoscaler or Karpenter when trying to scale down nodes). +- In case of faulty components (e.g. overallocation the nodes), these decisions will +simply be rejected by the scheduler (although the `NominatedNodeName` will remain set +for the unschedulability period). #### Increasing the load to kube-apiserver -If we simply implement this, we'd double the API calls during a simple binding cycle (NNN + actual binding), -which would increase the load to kube-apiserver significantly. +Setting a NominatedNodeName is an additional API call that then multiple components in the system +need to process. In the extreme case when this is always set before binding the pod, this would +double the number of API calls from scheduler, which isn't really acceptable from scalability and +performance reasons. -To prevent that, we'll skip setting `NominatedNodeName` when all PreBind plugins have nothing to do with the pod. +To mitigate this problem, we: +- skip setting `NNN` when all `Permit` and `PreBind` plugins have no work to do fir this pod. (We'll discuss how-to in the later section.) -Then, setting `NominatedNodeName` happens only when, for example, a pod has a volume that VolumeBinding plugin needs to handle at PreBind. -Of course, the API calls would still be increasing especially if most of pods have delayed binding. -However, those cases should actually be ok to have those additional calls because these will have other calls related to those operations (e.g., PV creation, etc.) - so the overhead of an additional call is effectively a smaller percentage of the e2e flow. +For cases with delayed binding, we make an argument that the additional calls are acceptable, as +there are other calls related to those operations (e.g. PV creation, PVC binding, etc.) - so the +overhead of setting `NNN` is a smaller percentage of the whole e2e pod startup flow. #### Race condition If an external component adds `NominatedNodeName` to the pod that is going through a scheduling cycle, `NominatedNodeName` isn't taken into account (of course), and the pod could be scheduled onto a different node. -But, this should be fine because, either way, we're not saying `NominatedNodeName` is something forcing the scheduler to pick up the node, -rather it's just a preference. +But, this should be fine because, either way, we're not saying `NominatedNodeName` is something +forcing the scheduler to pick up the node, rather it's just a preference. #### Confusion if `NominatedNodeName` is different from `NodeName` after all @@ -310,7 +372,11 @@ If an external component adds `NominatedNodeName`, but the scheduler picks up a But, if an external component updates `NominatedNodeName` that is set by the scheduler, the pod could end up having different `NominatedNodeName` and `NodeName`. -Probably we should clear `NominatedNodeName` when the pod is bound. (at binding api) +We will update the logic so that: +- `NominatedNodeName` field is cleared during `binding` call + +We believe that ensuring that `NominatedNodeName` can't be set after the pod is already bound +is niche enough feature that doesn't justify an attempt to strenghtening the validation. #### What if there are multiple components that could set `NominatedNodeName` on the same pod @@ -324,36 +390,25 @@ and hit this problem. So, for now, we'll just document it somewhere as a risk, unrecommended situation, and in the future, we'll consider something if we actually observe this problem getting bigger by many people starting using it. -#### [CA scenario] If the cluster autoscaler puts unexisting node's name on `NominatedNodeName`, the scheduler clears it - -The current scheduler clears the node name from `NominatedNodeName` if the pod goes through the scheduling cycle, -and the node doesn't exist. +#### Invalid `NominatedNodeName` prevents the pod from scheduling -In order for the cluster autoscaler to levarage this feature, -it has to put unexisting node's name, which is supposed to be registered later after its scale up, -so that the scheduler can schedule pending pods on those new nodes as soon as possible after nodes are registered. +Currently, `NominatedNodeName` field is cleared at the end of failed scheduling cycle if it found the nominated node +unschedulable for the pod. However, in order to make it work for ClusterAutoscaler and Karpenter, we will remove this +logic, and `NominatedNodeName` could stay on the node forever, despite not being a valid suggestions anymore. +As an example, imagine a scenario, where ClusterAutoscaler created a new node a nominated a pod to it, but +before that pod was scheduled, a new higher-priority pod appeared and used the space on that newly created node. +In such a case, it all worked as expected, but we ended up with `NominatedNodeName` set uncorrectly. -So, we need to keep the node's name on `NominatedNodeName` even when the node doesn't exist. -We'll discuss it at [Only modifying `NominatedNodeName`](#only-modifying-nominatednodename) section. +As a mitigation: +- an external component that originally set the `NominatedNodeName` is responsible for clearing or updating +the field to reflect the state +- if it won't happen, given that `NominatedNodeName` is just a hint for scheduler, it will continue to processing +the pod just having a minor performance hit (trying to process a node set via `NNN` first, but falling back to +all nodes anyway). We claim that the additional cost of checking `NominatedNodeName` first is acceptable (even +for big clusters where the performance is critical) because it's just one iteration of Filter plugins +(e.g., if you have 1000 nodes and 16 parallelism (default value), the scheduler needs around 62 iterations of +Filter plugins, approximately. So, adding one iteration on top of that doesn't matter). -#### [CA scenario] A new node's taint prevents the pod from going there, and the scheduler ends up clearing `NominatedNodeName` - -With the current scheduler, what happens if CA puts `NominatedNodeName` is: -1. Pods are unschedulable. For the simplicity, let's say all of them are rejected by NodeResourceFit plugin. (i.e., no node has enough CPU/memory for pod's request) -2. CA finds them, calculates nodes necessary to be created -3. CA puts `NominatedNodeName` on each pod -4. The scheduler keeps trying to schedule those pending pods though, here let's say they're unschedulable (no cluster event happens that could make pods schedulable) until the node is created. -5. The nodes are created, and registered to kube-apiserver. Let's say, at this point, nodes have un-ready taints. -6. The scheduler observes `Node/Create` event, `NodeResourceFit` plugin QHint returns `Queue`, and those pending pods are requeued to activeQ. -7. The scheduling cycle starts handling those pending pods. -8. However, because nodes have un-ready taints, pods are rejected by `TaintToleration` plugin. -9. The scheduler clears `NominatedNodeName` because it finds the nominated node (= new node) unschedulable. - -So, after all, `NominatedNodeName` added by CA in this scaling up scenario doesn't add any value, -unless the taints are removed in a short time (between 6 and 7). - -So, we need to keep the node's name on `NominatedNodeName` even when the node doesn't fit right now. -We'll discuss it at [Only modifying `NominatedNodeName`](#only-modifying-nominatednodename) section. ## Design Details @@ -402,21 +457,34 @@ Higher-priority pods can ignore it, but pods with equal or lower priority don't This allows us to prioritize nominated pods when nomination was done by external components. We just need to ensure that in case when NominatedNodeName was assigned by an external component, this nomination will get reflected in scheduler memory. +TODO: We need to ensure that works for non-existing nodes too and if those nodes won't appear in the future, it won't leak the memory. + We will implement integration tests simulating the above behavior of external components. #### The scheduler only modifies `NominatedNodeName`, not clears it in any cases -As described at the risk section, there are two problematic scenarios where this use case wouldn't work. -- [[CA scenario] If the cluster autoscaler puts unexisting node's name on `NominatedNodeName`, the scheduler clears it](#ca-scenario-if-the-cluster-autoscaler-puts-unexisting-nodes-name-on-nominatednodename-the-scheduler-clears-it) -- [[CA scenario] A new node's taint prevents the pod from going there, and the scheduler ends up clearing `NominatedNodeName`](#ca-scenario-a-new-nodes-taint-prevents-the-pod-from-going-there-and-the-scheduler-ends-up-clearing-nominatednodename) +As of now, scheduler clears the `NominatedNodeName` field at the end of failed scheduling cycle if it +found the nominated node unschedulable for the pod. However, this won't work if ClusterAutoscaler or Karpenter +would set it during scale up. + +In the most basic case, the node may not yet exist, so clearly it would be unschedulable for the pod. +However, potential mitigation of ignoring non-existing nodes wouldn't work either in the following case: -Currently, the scheduler clears `NominatedNodeName` at the end of failed scheduling cycles if it found the nominated node unschedulable for the pod. -In order to avoid above two scenarios, we have to remove this clearing logic; change the scheduler not to clear `NominatedNodeName` in any cases. -It means, even if the node on `NominatedNodeName` isn't valid anymore, the scheduler keeps trying the node first. -We regard the additional cost of checking `NominatedNodeName` first unnecessarily isn't reletively big (especially for big clusters, where the performance is critical) because it's just one iteration of Filter plugins. -e.g., if you have 1000 nodes and 16 parallelism (default value), the scheduler needs around 62 iterations of Filter plugins, approximately. So, adding one iteration on top of that doesn't matter. +1. Pods are unschedulable. For the simplicity, let's say all of them are rejected by NodeResourceFit plugin. (i.e., no node has enough CPU/memory for pod's request) +2. CA finds them, calculates nodes necessary to be created +3. CA puts `NominatedNodeName` on each pod +4. The scheduler keeps trying to schedule those pending pods though, here let's say they're unschedulable (no cluster event happens that could make pods schedulable) until the node is created. +5. The nodes are created, and registered to kube-apiserver. Let's say, at this point, nodes have un-ready taints. +6. The scheduler observes `Node/Create` event, `NodeResourceFit` plugin QHint returns `Queue`, and those pending pods are requeued to activeQ. +7. The scheduling cycle starts handling those pending pods. +8. However, because nodes have un-ready taints, pods are rejected by `TaintToleration` plugin. +9. The scheduler clears `NominatedNodeName` because it finds the nominated node (= new node) unschedulable. -Also, note that we still allow the scheduler overwrite `NominatedNodeName` when it triggers the preemption for the pod. +In order to avoid the above scenarios, we simply remove the clearing logic. This means that scheduler +will never clear the `NominatedNodeName` - it may update it though if based on its scheduling algorithm +it decides to ignore the current value of `NominatedNodeName` and put it on a different node (either to +signal the preemption, or record the decision before binding as described in the above sections). + ### Test Plan