You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
+36-3Lines changed: 36 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -90,6 +90,7 @@ tags, and then generate with `hack/update-toc.sh`.
90
90
-[Story 2: Scheduler can resume its work after restart](#story-2-scheduler-can-resume-its-work-after-restart)
91
91
-[Risks and Mitigations](#risks-and-mitigations)
92
92
-[Confusing semantics of <code>NominatedNodeName</code>](#confusing-semantics-of-nominatednodename)
93
+
-[External components may set <code>NominatedNodeName</code>](#external-components-may-set-nominatednodename)
93
94
-[Node nominations need to be considered together with reserving DRA resources](#node-nominations-need-to-be-considered-together-with-reserving-dra-resources)
94
95
-[Increasing the load to kube-apiserver](#increasing-the-load-to-kube-apiserver)
95
96
-[Design Details](#design-details)
@@ -290,6 +291,21 @@ If we look from consumption point of view - these are effectively the same. We w
290
291
to expose the information, that as of now a given node is considered as a potential placement
291
292
for a given pod. It may change, but for now that's what is being considered.
292
293
294
+
#### External components may set `NominatedNodeName`
295
+
296
+
Currently `NominatedNodeName` field is intended as read-only for components other than kube-scheduler. However there are no measures preventing other actors from overwriting the field. This is not considered a substantial risk to scheduling.
297
+
298
+
Scheduler interprets `NominatedNodeName` as a suggestion for optimal placement for a pod. If at the beginning of a scheduling cycle NNN is set (e.g. to `N1`), the scheduler will start the scheduling attempt with trying to place the pod on `N1`. This could go two ways:
299
+
300
+
A. Pod fits on `N1`. Pod is bound, after binding NNN gets cleared in api-server. The only risk here is that `N1` could not be the optimal placement for the pod.
301
+
302
+
B. Pods does not fit on `N1` (or `N1` is invalid). Scheduler restarts the scheduling cycle, ignoring NNN value. Filtering, Scoring and other phases get executed, standard scheduling procedure continues. If the pod is deemed unschedulable, scheduler clears NNN field before moving the pod to unschedulable / backoff queue. The risk in this case is that the scheduler spends time trying to fit the pod on `N1` in the beginning - which is not a huge overhead compared to the entire scheduling cycle.
303
+
304
+
305
+
If `NominatedNodeName` gets overwritten further into the scheduling cycle, or when the pod is waiting in a scheduling queue, it does not impact kube-scheduler's work.
306
+
307
+
Note that this logic is not newly introduced by this KEP, it's present in kube-scheduler since v1.22 and [KEP-1923](https://github.com/kubernetes/enhancements/tree/94277fd2b7683836465e97f1f7b974ff11fa58b0/keps/sig-scheduling/1923-prefer-nominated-node).
308
+
293
309
#### Node nominations need to be considered together with reserving DRA resources
294
310
295
311
The semantics of node nomination are in fact resource reservation, either in scheduler memory or in external components (after the nomination got persisted to the api-server). Since pods consume both node resources and DRA resources, it's important to persist both at the same (or almost the same) point in time.
@@ -401,12 +417,19 @@ to implement this enhancement.
- scheduler sets NNN before PreBind and WaitOnPermit, and does not set NNN when PreBind and Permit phases are skipped for the pod
425
+
426
+
More tests are WIP https://github.com/kubernetes/kubernetes/pull/133215
427
+
404
428
We're going to add these integration tests:
405
429
- The scheduler prefers to picking up nodes based on NominatedNodeName on pods, if the nodes are available.
406
430
- The scheduler ignores NominatedNodeName reservations on pods when it's scheduling higher priority pods.
407
431
- The scheduler overwrites NominatedNodeName when it performs the preemption, or when it finds another spot in another node and proceeding to the binding cycle (assuming there's a PreBind plugin).
408
-
- The scheduler puts NominatedNodeName at the beginning of binding cycles if Permit or PreBind plugin will do some work.
409
-
- And, the scheduler (actually kube-apiserver, when receiving a binding request) clears NominatedNodeName when the pod is actually bound.
432
+
- And, the scheduler (actually kube-apiserver, when receiving a binding request) clears NominatedNodeName when the pod is actually bound.
410
433
411
434
Also, with [scheduler-perf](https://github.com/kubernetes/kubernetes/tree/master/test/integration/scheduler_perf), we'll make sure the scheduling throughputs for pods that go through Permit or PreBind don't get regress too much.
412
435
We need to accept a small regression to some extent since there'll be a new API call to set NominatedNodeName.
@@ -519,7 +542,17 @@ there'll be nothing behaving wrong in the scheduling flow, see [Version Skew Str
519
542
520
543
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
521
544
522
-
TODO: update the test scenario
545
+
We will do the following manual test after implementing the feature:
546
+
547
+
1. upgrade
548
+
2. request scheduling of a pod that will need a long preBinding phase (e.g. uses volumes)
549
+
3. check that NNN gets set for that pod
550
+
4. before binding completes, restart the scheduler with nominatedNodeNameForExpectationEnabled = false
551
+
5. check that the pod gets scheduled and bound successfully to the same node
552
+
6. request scheduling another pod with expected long preBinding phase
553
+
7. check that NNN does not get set in PreBind
554
+
8. restart the scheduler with nominatedNodeNameForExpectationEnabled = true
555
+
9. check that the pod gets scheduled and bound on any node
523
556
524
557
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
0 commit comments