KEP-5278: update KEP for NominatedNodeName, narrowing down the scope of the feature and moving it to beta #5618

ania-borowiec · 2025-10-06T10:09:37Z

One-line PR description: Narrow down the scope of the feature, allowing it to move to beta

Issue link: Use NominatedNodeName to express the expected pod placement #5278

Other comments:

ania-borowiec · 2025-10-06T10:10:04Z

/cc @dom4ha @sanposhiho @macsko

macsko · 2025-10-06T10:12:41Z

/hold
Waiting for the rest of updates for v1.35

keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml

stlaz

PRR shadow review

stlaz · 2025-10-10T12:51:29Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+simply be rejected by the scheduler (and the `NominatedNodeName` will be cleared before
+moving the rejected pod to unschedulable).

 #### Increasing the load to kube-apiserver


I don't understand scheduling all that well so correct me if my assumptions are incorrect.

Is the assumption here that if PreBind() plugins skip, the binding operation will never take too much time and so we don't need to expose NominatedNodeName? This is important for the KEP to not be in conflict with "User story 1".

It might be worth noting that updating NominatedNodeName for every pod would only double the API requests per pod in the happy path. If I understand the docs at https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#pre-bind correctly, if there were some good-looking (from scheduling perspectives) nodes that would however cause the prebind plugins to fail often, that might increase the API requests N-times, where N>=2.

Correct, the assumption is that all tasks related to binding that may take long to complete (e.g. creating volumes, attaching DRA devices) are executed in PreBind(), and Bind() should not take too much time.

As far as I know increasing the number of API requests by 2x is not acceptable, as this would happen for every pod being scheduled (or re-scheduled), so it might add up to a huge number.
Also adding an extra API call before binding makes the entire procedure a bit longer, and the scheduling throughput a bit lower - so if we assume that Bind() will be quick, we should avoid that extra cost.

stlaz · 2025-10-10T12:57:15Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+The feature can be disabled in Beta version by restarting the kube-scheduler and kube-apiserver with the feature-gate off.

 ###### What happens if we reenable the feature if it was previously rolled back?



In ###### Are there any tests for feature enablement/disablement?

This feature is only changing when a NominatedNodeName field will be set - it doesn't introduce a new API.

Is that correct? NominatedNodeName sounds like a new field in the Pod API.

In ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

We will do the following manual test after implementing the feature:

What was the result of the test?

NominatedNodeName field wasn't added by this KEP (it was added long time ago). KEP's purpose is to extend usage of this field.

stlaz · 2025-10-10T13:06:12Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+During the beta period, the feature gates `NominatedNodeNameForExpectation` and `ClearingNominatedNodeNameAfterBinding` are enabled by default, no action is needed.

 **Downgrade**



In ### Versions Skew Strategy:

What happens to the pods that already have the NominatedNodeName set in a cluster with kube-apiserver that does not understand that field?
What happens if a scheduler tries to set NominatedNodeName on all of its scheduled pods while contacting an older kube-apiserver that does not know the field?

These questions are related to rollout/rollback section of the PRR questionnaire.

kube-apiserver has known this field for a long time, but does not interpret it - setting / using NominatedNodeName field in components other that kube-scheduler is out of scope of this KEP.

This field was introduced in 2018 (kubernetes/kubernetes@384a86c) - I assume that if we try using kube-apiserver from pre-2018 with kube-scheduler v1.35, this would cause way bigger problems than just trouble with handling NominatedNodeName.

I didn't know we had the fields for such a long time, we don't need to worry about it not being present, then 👍

stlaz · 2025-10-10T13:20:42Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

 ###### What steps should be taken if SLOs are not being met to determine the problem?

 ## Implementation History



Per the tempalte, What steps should be taken if SLOs are not being met to determine the problem? must be completed when targetting beta.

I tried describing the general approach, PTAL

macsko · 2025-10-09T14:06:57Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

 expectations.

-This KEP is a step towards clarifying this semantic instead of maintaining status-quo.
+This KEP is a step towards clarifying this semantic and scheduler's behavior instead of maintaining status-quo.


Now, this KEP is not clarifying this point, so I think it's no longer valid

I think we can remove this NominatedNodeName can already be set by other components now section. Now, this KEP is just expanding how the scheduler uses NNN. Whether the external components can already set NNN or not is totally not related here anymore.

+1 to @sanposhiho

macsko · 2025-10-10T14:23:54Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+The feature can be disabled in Beta version by restarting the kube-scheduler and kube-apiserver with the feature-gate off.

 ###### What happens if we reenable the feature if it was previously rolled back?



NominatedNodeName field wasn't added by this KEP (it was added long time ago). KEP's purpose is to extend usage of this field.

sanposhiho · 2025-10-11T05:56:12Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml

You can add yourself to authors

sanposhiho · 2025-10-11T05:59:21Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

 expectations.

-This KEP is a step towards clarifying this semantic instead of maintaining status-quo.
+This KEP is a step towards clarifying this semantic and scheduler's behavior instead of maintaining status-quo.


I think we can remove this NominatedNodeName can already be set by other components now section. Now, this KEP is just expanding how the scheduler uses NNN. Whether the external components can already set NNN or not is totally not related here anymore.

sanposhiho · 2025-10-11T06:06:10Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

 - Scheduler is allowed to overwrite `NominatedNodeName` at any time in case of preemption or
 the beginning of the binding cycle.
- No external components can overwrite `NominatedNodeName` set by a different component.
- If `NominatedNodeName` is set, the component who set it is responsible for updating or
-clearing it if its plans were changed (using PUT or APPLY to ensure it won't conflict with
-potential update from scheduler) to reflect the new hint.
+- No external components are expected to overwrite `NominatedNodeName` set by the scheduler (although technically there are no guardrails).


Are those two points necessary to be mentioned now? because we simply don't expect any components other than the scheduler to put NNN after this KEP change.
Re: "Scheduler is allowed to overwrite..." -> In the first place, NNN is supposed to be put by the scheduler only.
Re: "No external components are expected to overwrite..." -> No external components are expected to put (i.e., not only overwrite)

I think we can simplify Confusing semantics of NominatedNodeName section more, or maybe even just remove it.

I wouldn't necessarily remove it - the "state machine" part still makes sense.

What I would remove are the last two paragraphs:

The "On top of the simple state machine ..."

The paragraph below "Moreover..." (the first point doesn't make sense - it's always scheduler), the second doesn't make sense either (if scheduler is faulty, then we have bigger problems...)

wojtek-t · 2025-10-13T09:36:22Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

 expectations.

-This KEP is a step towards clarifying this semantic instead of maintaining status-quo.
+This KEP is a step towards clarifying this semantic and scheduler's behavior instead of maintaining status-quo.


+1 to @sanposhiho

wojtek-t · 2025-10-13T09:41:54Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

 - Scheduler is allowed to overwrite `NominatedNodeName` at any time in case of preemption or
 the beginning of the binding cycle.
- No external components can overwrite `NominatedNodeName` set by a different component.
- If `NominatedNodeName` is set, the component who set it is responsible for updating or
-clearing it if its plans were changed (using PUT or APPLY to ensure it won't conflict with
-potential update from scheduler) to reflect the new hint.
+- No external components are expected to overwrite `NominatedNodeName` set by the scheduler (although technically there are no guardrails).


I wouldn't necessarily remove it - the "state machine" part still makes sense.

What I would remove are the last two paragraphs:

The "On top of the simple state machine ..."

The paragraph below "Moreover..." (the first point doesn't make sense - it's always scheduler), the second doesn't make sense either (if scheduler is faulty, then we have bigger problems...)

wojtek-t · 2025-10-13T09:43:02Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

-
 #### Confusion if `NominatedNodeName` is different from `NodeName` after all

 If an external component adds `NominatedNodeName`, but the scheduler picks up a different node,


We don't support external component setting it - so I think this whole subsection should be removed.

wojtek-t · 2025-10-13T09:44:45Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

 ### External components put `NominatedNodeName`

 There aren't any restrictions preventing other components from setting NominatedNodeName as of now.
 However, we don't have any validation of how that currently works.


It's more than that - we don't have a well defined semantics for that now.

However, with almost everything being removed from this subsection - I would actually remove it completely to avoid additional confusion.

k8s-ci-robot · 2025-10-13T11:57:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ania-borowiec
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2025-10-13T14:22:11Z

/lgtm

dom4ha

I believe it's important to cover DRA resources accounting as well to completely address scheduler->CA resource accounting problem for pods in delayed binding cases.

@ania-borowiec Sorry that I suggested wording of entire sections, but explaining what I suggest to add would be almost identical to what I wrote anyway. Feel free to reword and rearrange it.

@wojtek-t @sanposhiho @macsko

dom4ha · 2025-10-13T14:53:58Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

 the cluster autoscaler cannot understand the pod is going to be bound there,
 misunderstands the node is low-utilized (because the scheduler keeps the place of the pod), and deletes the node. 

 We can expose those internal reservations with `NominatedNodeName` so that external components can take a more appropriate action


You can add another paragraph how DRA interacts with NNN, as it's important in scheduler->CA resource accounting:

Please note that the NominatedNodeName can express reservation of node resources only, but some resources can be managed by DRA plugin and expressed in a form of ResourceClaim allocation. To correctly account all the resources that a pod needs, both the nomination and the ResourceClaim status update needs to be reflected in the api-server.

@x13n Can you confirm my understanding for the Cluster Autoscaler part? Would the DRA resources be correctly accounted as in-use as soon as the the ResourceClaim allocation is reflected in the api-server?

IIUC the accounting of ResourceClaim allocation does not depend on having a pod that is using it to be neither bound NN nor having NNN nomination.

dom4ha · 2025-10-13T16:34:41Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+If we look from consumption point of view - these are effectively the same. We want
 to expose the information, that as of now a given node is considered as a potential placement
 for a given pod. It may change, but for now that's what considered. 



Maybe we should specific section here:

Nominations and DRA resources

The semantic of node nomination is in fact resource reservation, either in scheduler memory or in external components after the nomination got persisted to the api-server. Since pod not only consumes node resources, but also DRA resources, it's important to persist them as well at around the same time. It currently happens as ResourceClaim allocation is stored in status in PreBinding phase, therefore in conjunction to node nomination it effectively allows to reserve complete set of resources (both node and DRA) to enable their correct accounting.

Note that node nomination is set before WaitOnPermit, but publishing ResourceClaim status in PreBinding, therefore pods waiting on WaitOnPermit would have nominations published but not ResourceClaim statuses. It's however not considered a problem as long as there are no in-tree plugins supporting WaitOnPermit and the Gang Scheduling feature is starting in alpha, therefore the fix to this issue will block Gang Scheduling beta promotion.

dom4ha · 2025-10-13T16:40:10Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

-it decides to ignore the current value of `NominatedNodeName` and put it on a different node (either to
-signal the preemption, or record the decision before binding as described in the above sections).
+As of now the scheduler clears the `NominatedNodeName` field at the end of failed scheduling cycle, if it
+found the nominated node unschedulable for the pod. This logic remains unchanged.


I think we need to mention that previous version of this KEP was deliberately leaving nominated nodes, so I think we should clearly state that we're reverting these changes for the time being.

dom4ha · 2025-10-13T16:54:52Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

-As discussed at [Confusion if `NominatedNodeName` is different from `NodeName` after all](#confusion-if-nominatednodename-is-different-from-nodename-after-all),
-we update kube-apiserver so that it clears `NominatedNodeName` when receiving binding requests.
+We update kube-apiserver so that it clears `NominatedNodeName` when receiving binding requests.



Handling ResourceClaim status updates

Since ResourceClaim status updates is complementary to resource nomination (reserves resources in a similar way), it's desired that they will be set at the beginning of the PreBinding phase (before it waits). The order of actions in the devicemanagement plugin is correct, however scheduler performs the binding actions of different plugins sequentially, therefore for instance it may happen that long lasting PVC provisioning may delay exporting ResourceClaim allocation status. It is not desired as it leaves gap of not-reserved DRA resources causing similar problems to the ones originally fixed by this KEP - kubernetes/kubernetes#125491

prr questionnaire updated

c0bce5c

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Oct 6, 2025

k8s-ci-robot requested review from dom4ha and macsko October 6, 2025 10:09

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Oct 6, 2025

github-project-automation bot added this to SIG Scheduling Oct 6, 2025

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 6, 2025

k8s-ci-robot requested a review from sanposhiho October 6, 2025 10:10

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 6, 2025

soltysh reviewed Oct 7, 2025

View reviewed changes

keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml Show resolved Hide resolved

prr approver updated

e0345cd

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 7, 2025

jmickey mentioned this pull request Oct 7, 2025

Use NominatedNodeName to express the expected pod placement #5278

Open

6 tasks

ania-borowiec added 2 commits October 8, 2025 11:03

update stage in kep.yaml

177e105

update kep to narrow down the scope

7d504eb

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 9, 2025

ania-borowiec changed the title ~~KEP-5278: update PRR review questionnaire for move to beta~~ KEP-5278: update KEP for NominatedNodeName, narrowing down the scope of the feature and moving it to beta Oct 10, 2025

stlaz reviewed Oct 10, 2025

View reviewed changes

macsko reviewed Oct 10, 2025

View reviewed changes

sanposhiho reviewed Oct 11, 2025

View reviewed changes

wojtek-t self-assigned this Oct 13, 2025

wojtek-t reviewed Oct 13, 2025

View reviewed changes

review comments applied

e2533f7

ania-borowiec force-pushed the nnn_update branch from 2c281b6 to e2533f7 Compare October 13, 2025 11:59

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 13, 2025

dom4ha reviewed Oct 13, 2025

View reviewed changes

		The feature can be disabled in Beta version by restarting the kube-scheduler and kube-apiserver with the feature-gate off.

		###### What happens if we reenable the feature if it was previously rolled back?

		During the beta period, the feature gates `NominatedNodeNameForExpectation` and `ClearingNominatedNodeNameAfterBinding` are enabled by default, no action is needed.

		Downgrade

		###### What steps should be taken if SLOs are not being met to determine the problem?

		## Implementation History


		#### Confusion if `NominatedNodeName` is different from `NodeName` after all

		If an external component adds `NominatedNodeName`, but the scheduler picks up a different node,

KEP-5278: update KEP for NominatedNodeName, narrowing down the scope of the feature and moving it to beta #5618

Are you sure you want to change the base?

KEP-5278: update KEP for NominatedNodeName, narrowing down the scope of the feature and moving it to beta #5618

Conversation

ania-borowiec commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ania-borowiec commented Oct 6, 2025

Uh oh!

macsko commented Oct 6, 2025

Uh oh!

Uh oh!

stlaz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ania-borowiec Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Oct 13, 2025

Uh oh!

wojtek-t commented Oct 13, 2025

Uh oh!

dom4ha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Nominations and DRA resources

Uh oh!

Choose a reason for hiding this comment

ania-borowiec commented Oct 6, 2025 •

edited

Loading

ania-borowiec Oct 13, 2025 •

edited

Loading

dom4ha Oct 13, 2025 •

edited

Loading