-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5278: update KEP for NominatedNodeName, narrowing down the scope of the feature and moving it to beta #5618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PRR shadow review
simply be rejected by the scheduler (and the `NominatedNodeName` will be cleared before | ||
moving the rejected pod to unschedulable). | ||
|
||
#### Increasing the load to kube-apiserver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand scheduling all that well so correct me if my assumptions are incorrect.
Is the assumption here that if PreBind()
plugins skip, the binding operation will never take too much time and so we don't need to expose NominatedNodeName
? This is important for the KEP to not be in conflict with "User story 1".
It might be worth noting that updating NominatedNodeName
for every pod would only double the API requests per pod in the happy path. If I understand the docs at https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#pre-bind correctly, if there were some good-looking (from scheduling perspectives) nodes that would however cause the prebind plugins to fail often, that might increase the API requests N-times, where N>=2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the assumption is that all tasks related to binding that may take long to complete (e.g. creating volumes, attaching DRA devices) are executed in PreBind()
, and Bind()
should not take too much time.
As far as I know increasing the number of API requests by 2x is not acceptable, as this would happen for every pod being scheduled (or re-scheduled), so it might add up to a huge number.
Also adding an extra API call before binding makes the entire procedure a bit longer, and the scheduling throughput a bit lower - so if we assume that Bind()
will be quick, we should avoid that extra cost.
The feature can be disabled in Beta version by restarting the kube-scheduler and kube-apiserver with the feature-gate off. | ||
|
||
###### What happens if we reenable the feature if it was previously rolled back? | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ###### Are there any tests for feature enablement/disablement?
This feature is only changing when a NominatedNodeName field will be set - it doesn't introduce a new API.
Is that correct? NominatedNodeName
sounds like a new field in the Pod API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
We will do the following manual test after implementing the feature:
What was the result of the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NominatedNodeName
field wasn't added by this KEP (it was added long time ago). KEP's purpose is to extend usage of this field.
During the beta period, the feature gates `NominatedNodeNameForExpectation` and `ClearingNominatedNodeNameAfterBinding` are enabled by default, no action is needed. | ||
|
||
**Downgrade** | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ### Versions Skew Strategy
:
What happens to the pods that already have the NominatedNodeName
set in a cluster with kube-apiserver that does not understand that field?
What happens if a scheduler tries to set NominatedNodeName
on all of its scheduled pods while contacting an older kube-apiserver that does not know the field?
These questions are related to rollout/rollback section of the PRR questionnaire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kube-apiserver has known this field for a long time, but does not interpret it - setting / using NominatedNodeName
field in components other that kube-scheduler is out of scope of this KEP.
This field was introduced in 2018 (kubernetes/kubernetes@384a86c) - I assume that if we try using kube-apiserver from pre-2018 with kube-scheduler v1.35, this would cause way bigger problems than just trouble with handling NominatedNodeName
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know we had the fields for such a long time, we don't need to worry about it not being present, then 👍
###### What steps should be taken if SLOs are not being met to determine the problem? | ||
|
||
## Implementation History | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the tempalte, What steps should be taken if SLOs are not being met to determine the problem?
must be completed when targetting beta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried describing the general approach, PTAL
expectations. | ||
|
||
This KEP is a step towards clarifying this semantic instead of maintaining status-quo. | ||
This KEP is a step towards clarifying this semantic and scheduler's behavior instead of maintaining status-quo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, this KEP is not clarifying this point, so I think it's no longer valid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this NominatedNodeName can already be set by other components now
section. Now, this KEP is just expanding how the scheduler uses NNN. Whether the external components can already set NNN or not is totally not related here anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @sanposhiho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
The feature can be disabled in Beta version by restarting the kube-scheduler and kube-apiserver with the feature-gate off. | ||
|
||
###### What happens if we reenable the feature if it was previously rolled back? | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NominatedNodeName
field wasn't added by this KEP (it was added long time ago). KEP's purpose is to extend usage of this field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add yourself to authors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
expectations. | ||
|
||
This KEP is a step towards clarifying this semantic instead of maintaining status-quo. | ||
This KEP is a step towards clarifying this semantic and scheduler's behavior instead of maintaining status-quo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this NominatedNodeName can already be set by other components now
section. Now, this KEP is just expanding how the scheduler uses NNN. Whether the external components can already set NNN or not is totally not related here anymore.
- Scheduler is allowed to overwrite `NominatedNodeName` at any time in case of preemption or | ||
the beginning of the binding cycle. | ||
- No external components can overwrite `NominatedNodeName` set by a different component. | ||
- If `NominatedNodeName` is set, the component who set it is responsible for updating or | ||
clearing it if its plans were changed (using PUT or APPLY to ensure it won't conflict with | ||
potential update from scheduler) to reflect the new hint. | ||
- No external components are expected to overwrite `NominatedNodeName` set by the scheduler (although technically there are no guardrails). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those two points necessary to be mentioned now? because we simply don't expect any components other than the scheduler to put NNN after this KEP change.
Re: "Scheduler is allowed to overwrite..." -> In the first place, NNN is supposed to be put by the scheduler only.
Re: "No external components are expected to overwrite..." -> No external components are expected to put (i.e., not only overwrite)
I think we can simplify Confusing semantics of NominatedNodeName
section more, or maybe even just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't necessarily remove it - the "state machine" part still makes sense.
What I would remove are the last two paragraphs:
- The "On top of the simple state machine ..."
- The paragraph below "Moreover..." (the first point doesn't make sense - it's always scheduler), the second doesn't make sense either (if scheduler is faulty, then we have bigger problems...)
expectations. | ||
|
||
This KEP is a step towards clarifying this semantic instead of maintaining status-quo. | ||
This KEP is a step towards clarifying this semantic and scheduler's behavior instead of maintaining status-quo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @sanposhiho
- Scheduler is allowed to overwrite `NominatedNodeName` at any time in case of preemption or | ||
the beginning of the binding cycle. | ||
- No external components can overwrite `NominatedNodeName` set by a different component. | ||
- If `NominatedNodeName` is set, the component who set it is responsible for updating or | ||
clearing it if its plans were changed (using PUT or APPLY to ensure it won't conflict with | ||
potential update from scheduler) to reflect the new hint. | ||
- No external components are expected to overwrite `NominatedNodeName` set by the scheduler (although technically there are no guardrails). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't necessarily remove it - the "state machine" part still makes sense.
What I would remove are the last two paragraphs:
- The "On top of the simple state machine ..."
- The paragraph below "Moreover..." (the first point doesn't make sense - it's always scheduler), the second doesn't make sense either (if scheduler is faulty, then we have bigger problems...)
|
||
#### Confusion if `NominatedNodeName` is different from `NodeName` after all | ||
|
||
If an external component adds `NominatedNodeName`, but the scheduler picks up a different node, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't support external component setting it - so I think this whole subsection should be removed.
### External components put `NominatedNodeName` | ||
|
||
There aren't any restrictions preventing other components from setting NominatedNodeName as of now. | ||
However, we don't have any validation of how that currently works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more than that - we don't have a well defined semantics for that now.
However, with almost everything being removed from this subsection - I would actually remove it completely to avoid additional confusion.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ania-borowiec The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
2c281b6
to
e2533f7
Compare
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's important to cover DRA resources accounting as well to completely address scheduler->CA resource accounting problem for pods in delayed binding cases.
@ania-borowiec Sorry that I suggested wording of entire sections, but explaining what I suggest to add would be almost identical to what I wrote anyway. Feel free to reword and rearrange it.
the cluster autoscaler cannot understand the pod is going to be bound there, | ||
misunderstands the node is low-utilized (because the scheduler keeps the place of the pod), and deletes the node. | ||
|
||
We can expose those internal reservations with `NominatedNodeName` so that external components can take a more appropriate action |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add another paragraph how DRA interacts with NNN, as it's important in scheduler->CA resource accounting:
Please note that the NominatedNodeName
can express reservation of node resources only, but some resources can be managed by DRA plugin and expressed in a form of ResourceClaim allocation. To correctly account all the resources that a pod needs, both the nomination and the ResourceClaim status update needs to be reflected in the api-server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@x13n Can you confirm my understanding for the Cluster Autoscaler part? Would the DRA resources be correctly accounted as in-use as soon as the the ResourceClaim allocation is reflected in the api-server?
IIUC the accounting of ResourceClaim allocation does not depend on having a pod that is using it to be neither bound NN nor having NNN nomination.
If we look from consumption point of view - these are effectively the same. We want | ||
to expose the information, that as of now a given node is considered as a potential placement | ||
for a given pod. It may change, but for now that's what considered. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should specific section here:
Nominations and DRA resources
The semantic of node nomination is in fact resource reservation, either in scheduler memory or in external components after the nomination got persisted to the api-server. Since pod not only consumes node resources, but also DRA resources, it's important to persist them as well at around the same time. It currently happens as ResourceClaim allocation is stored in status in PreBinding phase, therefore in conjunction to node nomination it effectively allows to reserve complete set of resources (both node and DRA) to enable their correct accounting.
Note that node nomination is set before WaitOnPermit, but publishing ResourceClaim status in PreBinding, therefore pods waiting on WaitOnPermit would have nominations published but not ResourceClaim statuses. It's however not considered a problem as long as there are no in-tree plugins supporting WaitOnPermit and the Gang Scheduling feature is starting in alpha, therefore the fix to this issue will block Gang Scheduling beta promotion.
it decides to ignore the current value of `NominatedNodeName` and put it on a different node (either to | ||
signal the preemption, or record the decision before binding as described in the above sections). | ||
As of now the scheduler clears the `NominatedNodeName` field at the end of failed scheduling cycle, if it | ||
found the nominated node unschedulable for the pod. This logic remains unchanged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to mention that previous version of this KEP was deliberately leaving nominated nodes, so I think we should clearly state that we're reverting these changes for the time being.
As discussed at [Confusion if `NominatedNodeName` is different from `NodeName` after all](#confusion-if-nominatednodename-is-different-from-nodename-after-all), | ||
we update kube-apiserver so that it clears `NominatedNodeName` when receiving binding requests. | ||
We update kube-apiserver so that it clears `NominatedNodeName` when receiving binding requests. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling ResourceClaim status updates
Since ResourceClaim status updates is complementary to resource nomination (reserves resources in a similar way), it's desired that they will be set at the beginning of the PreBinding phase (before it waits). The order of actions in the devicemanagement plugin is correct, however scheduler performs the binding actions of different plugins sequentially, therefore for instance it may happen that long lasting PVC provisioning may delay exporting ResourceClaim allocation status. It is not desired as it leaves gap of not-reserved DRA resources causing similar problems to the ones originally fixed by this KEP - kubernetes/kubernetes#125491
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @x13n
Uh oh!
There was an error while loading. Please reload this page.