Skip to content

Commit 72bef55

Browse files
committed
apply dom4ha's comments
1 parent e2533f7 commit 72bef55

File tree

1 file changed

+17
-1
lines changed
  • keps/sig-scheduling/5278-nominated-node-name-for-expectation

1 file changed

+17
-1
lines changed

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -211,6 +211,8 @@ misunderstands the node is low-utilized (because the scheduler keeps the place o
211211
We can expose those internal reservations with `NominatedNodeName` so that external components can take a more appropriate action
212212
based on the expected pod placement.
213213

214+
Please note that the `NominatedNodeName` can express reservation of node resources only, but some resources can be managed by the DRA plugin and be expressed using ResourceClaim allocation. In order to correctly account all the resources needed by a pod, both the nomination and ResourceClaim status update needs to be reflected in the api-server.
215+
214216
### Retain the scheduling decision
215217

216218
At the binding cycle (e.g., PreBind), some plugins could handle something (e.g., volumes, devices) based on the pod's scheduling result.
@@ -284,7 +286,15 @@ probably isn't that important - the content of `NominatedNodeName` can be interp
284286

285287
If we look from consumption point of view - these are effectively the same. We want
286288
to expose the information, that as of now a given node is considered as a potential placement
287-
for a given pod. It may change, but for now that's what considered.
289+
for a given pod. It may change, but for now that's what is being considered.
290+
291+
#### Node nominations need to be considered together with reserving DRA resources
292+
293+
The semantics of node nomination are in fact resource reservation, either in scheduler memory or in external components (after the nomination got persisted to the api-server). Since pods consume both node resources and DRA resources, it's important to persist both at the same (or almost the same) point in time.
294+
295+
This is consistent with the current implementation: ResourceClaim allocation is stored in status in PreBinding phase, therefore in conjunction to node nomination it effectively allows to reserve a complete set of resources (both node and DRA) to enable their correct accounting.
296+
297+
Note that node nomination is set before WaitOnPermit phase, but ResourceClaim status gets published in PreBinding, therefore pods waiting on WaitOnPermit would have only nominations published, and not ResourceClaim statuses. This however is not considered an issue, as long as there are no in-tree plugins supporting WaitOnPermit, and the Gang Scheduling feature is starting in alpha. This means that the fix to this issue will block Gang Scheduling promotion to beta.
288298

289299
#### Increasing the load to kube-apiserver
290300

@@ -362,11 +372,17 @@ We'll ensure this scenario works correctly via tests.
362372

363373
As of now the scheduler clears the `NominatedNodeName` field at the end of failed scheduling cycle, if it
364374
found the nominated node unschedulable for the pod. This logic remains unchanged.
375+
376+
NOTE: The previous version of this KEP, that allowed external components to set `NominatedNodeName`, deliberately left the `NominatedNodeName` field unchanged after scheduling failure. With the KEP update for v1.35 this logic is being reverted, and scheduler goes back to clearing the field after scheduling failure.
365377

366378
### Kube-apiserver clears `NominatedNodeName` when receiving binding requests
367379

368380
We update kube-apiserver so that it clears `NominatedNodeName` when receiving binding requests.
369381

382+
### Handling ResourceClaim status updates
383+
384+
Since ResourceClaim status update is complementary to node nomination (reserves resources in a similar way), it's desired that both will be set at the beginning of the PreBinding phase (before the pod starts waiting for resources to be ready for binding). The order of actions in the device management plugin is correct, however the scheduler performs the prebinding actions of different plugins sequentially. As a result it may happen that e.g. a long lasting PVC provisioning may delay exporting ResourceClaim allocation status. This is not desired, as it allows a gap in time when DRA resources are not reserved - causing problems similar to the ones originally fixed by this KEP - kubernetes/kubernetes#125491
385+
370386
### Test Plan
371387

372388
[x] I/we understand the owners of the involved components may require updates to

0 commit comments

Comments
 (0)