You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
+17-1Lines changed: 17 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -211,6 +211,8 @@ misunderstands the node is low-utilized (because the scheduler keeps the place o
211
211
We can expose those internal reservations with `NominatedNodeName` so that external components can take a more appropriate action
212
212
based on the expected pod placement.
213
213
214
+
Please note that the `NominatedNodeName` can express reservation of node resources only, but some resources can be managed by the DRA plugin and be expressed using ResourceClaim allocation. In order to correctly account all the resources needed by a pod, both the nomination and ResourceClaim status update needs to be reflected in the api-server.
215
+
214
216
### Retain the scheduling decision
215
217
216
218
At the binding cycle (e.g., PreBind), some plugins could handle something (e.g., volumes, devices) based on the pod's scheduling result.
@@ -284,7 +286,15 @@ probably isn't that important - the content of `NominatedNodeName` can be interp
284
286
285
287
If we look from consumption point of view - these are effectively the same. We want
286
288
to expose the information, that as of now a given node is considered as a potential placement
287
-
for a given pod. It may change, but for now that's what considered.
289
+
for a given pod. It may change, but for now that's what is being considered.
290
+
291
+
#### Node nominations need to be considered together with reserving DRA resources
292
+
293
+
The semantics of node nomination are in fact resource reservation, either in scheduler memory or in external components (after the nomination got persisted to the api-server). Since pods consume both node resources and DRA resources, it's important to persist both at the same (or almost the same) point in time.
294
+
295
+
This is consistent with the current implementation: ResourceClaim allocation is stored in status in PreBinding phase, therefore in conjunction to node nomination it effectively allows to reserve a complete set of resources (both node and DRA) to enable their correct accounting.
296
+
297
+
Note that node nomination is set before WaitOnPermit phase, but ResourceClaim status gets published in PreBinding, therefore pods waiting on WaitOnPermit would have only nominations published, and not ResourceClaim statuses. This however is not considered an issue, as long as there are no in-tree plugins supporting WaitOnPermit, and the Gang Scheduling feature is starting in alpha. This means that the fix to this issue will block Gang Scheduling promotion to beta.
288
298
289
299
#### Increasing the load to kube-apiserver
290
300
@@ -362,11 +372,17 @@ We'll ensure this scenario works correctly via tests.
362
372
363
373
As of now the scheduler clears the `NominatedNodeName` field at the end of failed scheduling cycle, if it
364
374
found the nominated node unschedulable for the pod. This logic remains unchanged.
375
+
376
+
NOTE: The previous version of this KEP, that allowed external components to set `NominatedNodeName`, deliberately left the `NominatedNodeName` field unchanged after scheduling failure. With the KEP update for v1.35 this logic is being reverted, and scheduler goes back to clearing the field after scheduling failure.
365
377
366
378
### Kube-apiserver clears `NominatedNodeName` when receiving binding requests
367
379
368
380
We update kube-apiserver so that it clears `NominatedNodeName` when receiving binding requests.
369
381
382
+
### Handling ResourceClaim status updates
383
+
384
+
Since ResourceClaim status update is complementary to node nomination (reserves resources in a similar way), it's desired that both will be set at the beginning of the PreBinding phase (before the pod starts waiting for resources to be ready for binding). The order of actions in the device management plugin is correct, however the scheduler performs the prebinding actions of different plugins sequentially. As a result it may happen that e.g. a long lasting PVC provisioning may delay exporting ResourceClaim allocation status. This is not desired, as it allows a gap in time when DRA resources are not reserved - causing problems similar to the ones originally fixed by this KEP - kubernetes/kubernetes#125491
385
+
370
386
### Test Plan
371
387
372
388
[x] I/we understand the owners of the involved components may require updates to
0 commit comments