KEP-5832: Decouple PodGroup API from Workload API#5833
KEP-5832: Decouple PodGroup API from Workload API#5833k8s-ci-robot merged 20 commits intokubernetes:masterfrom
Conversation
keps/sig-apps/3541-add-recreate-strategy-to-statefulset/README.md
Outdated
Show resolved
Hide resolved
be2191b to
0111cb0
Compare
0111cb0 to
bf21536
Compare
bf21536 to
52db292
Compare
| Conditions []metav1.Condition | ||
| } | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Can we add one more subsection about deletion protection?
I would like to ensure that PodGroup will not get deleted if there are still some pods in non-terminal state in that PodGroup (i.e. not in Succeeded/Failure state).
The way I envision it is that PodGroup will be created with some dedicated finalizer (name TBD).
We will have a controller that will be looking into PodGRoups with initiated deletion (deletionTimestamp) and for those waiting for all pods linking it to terminate and only then removing the finalizer.
Thinking about it, it may not be a requirement for Alpha (though it would be nice), but let's put it into KEP and add it to beta criteria so we won't forget about it.
There was a problem hiding this comment.
For alpha, could we at least mention that true workload controllers (or whoever creates a PodGroup) should (must?) add a finalizer to make sure a Pod doesn't outlive its PodGroup? I'm depending on some mechanism like that to exist for #5736. Or should we document that expectation there instead?
There was a problem hiding this comment.
Can we add one more subsection about deletion protection?
Added. PTAL.
There was a problem hiding this comment.
For alpha, could we at least mention that true workload controllers
I believe we should document this expecation and add it as beta blocker similar to here and https://github.com/kubernetes/enhancements/pull/5871/files#diff-03402ccdde6d2da9ed283ef0c1b203ef09baec88c05c9631e6ec0e7a8463a29dR214
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| - Increase API calls volume: More objects means more API calls for creation, updates, and watches. The mitigation is split the responsibility.`Workload` is rarely updated (as a policy object) while `PodGroup` handles runtime state. In addition, `PodGroups` allow per-replica sharding of status updates. |
There was a problem hiding this comment.
NIT: ... The mitigation is to split the responsibility. The Workload object is rarely updated (as a template object), while the PodGroup handles runtime state:
| // PodGroupTemplateReference references the PodGroupTemplate object that | ||
| // defines the template used to create the PodGroup. | ||
| type PodGroupTemplateReference struct { | ||
| // WorkloadName defines the name of the Workload object (Scheduling Policy) this Pod belongs to. |
There was a problem hiding this comment.
NIT: I'd remove the "(Scheduling Policy)" or rename to "(Scheduling Policy Template)".
In the decoupled model, the actual policy is inside the PodGroup.
|
|
||
| - Read `pod.spec.podGroupRef.name` to identify the `PodGroup` | ||
| - Lookup the `PodGroup` object to check its existence and to get the scheduling policy | ||
| - Read `pod.spec.podGroupRef.workloadName` to identify the Workload and check its existence |
There was a problem hiding this comment.
+1. The TL;DR change for scheduler is: Instead of reading the Workload read the PodGroup. It contains all the information needed by scheduler.
I'd write it explicitly (one sentence) here - maybe as the first sentence.
Signed-off-by: helayoty <heelayot@microsoft.com>
|
|
||
| The `PodGroup` lifecycle needs to make sure that `PodGroup` will not be deleted while any pod that references it is in a non-terminal phase (i.e. not `Succeeded` or `Failed`). | ||
|
|
||
| `PodGroup` objects are created with a dedicated finalizer that the controller is responsible for removing only when the deletion-safe condition is met. The mechanism for this is: |
There was a problem hiding this comment.
Could we clarify here whether the controller is the true workload controller or one run by kube-controller-manager?
There was a problem hiding this comment.
updated. @wojtek-t can you please confirm the update is correct?
| - Any referencing pod is non-terminal, the controller leaves the finalizer in place and re-enqueues (i.e., on pod updates) | ||
| - To find the referencing pods, we can use an index keyed by `workloadRef.podGroupName` (and optionally namespace) so the controller can efficiently list pods that reference a given `PodGroup` | ||
|
|
||
| Deletion protection is not required for alpha (nice-to-have), however it is required for beta graduation. |
There was a problem hiding this comment.
If this section is describing new functionality that will become part of kube-controller-manager only when this KEP graduates to beta, do we need to say that true workload controllers need to handle this themselves in the meantime? Or is this only relevant to #5736 so I should mention this there?
There was a problem hiding this comment.
For me it's actually relevant here (it's rather infrastructure of the PodGroup API).
The reason I'm saying it's not alpha is that I want to ensure that we will be able to deliver everything. Alpha should prove the feature and the finalizer stuff is part of making it production-ready.
@nojnhuh - my claim is that given that noone will really use Alpha in production anyway, we can ignore that problem in Alpha and ensure that it's solved by the controller for Beta; I don't think you should be adding any custom logic for that on your side.
dom4ha
left a comment
There was a problem hiding this comment.
All changes looks good, waiting with approval for the update to workloadRef
|
|
||
| ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? | ||
|
|
||
| The increase of CPU/MEM consumption of kube-apiserver and kube-scheduler should be negligible percentage of the current resource usage. |
There was a problem hiding this comment.
Can you revisit this section especially the use of 'negligible'.
as you mentioned above in "Informers and Watches" section:
The kube-scheduler will add a new informer to watch
PodGroupobjects and stop watchingWorkloadobjects.
While the design replaces a Workload informer with a PodGroup informer, the effective cardinality changes significantly. Instead of the scheduler's cache scaling with the number of high-level Workloads, it will now scale with the number of individual PodGroups.
What do you think?
There was a problem hiding this comment.
I agree that this section is a bit optimistic. Updated. PTAL.
Signed-off-by: helayoty <heelayot@microsoft.com>
Signed-off-by: helayoty <heelayot@microsoft.com>
|
This looks great both with my scheduling hat as well as with my PRR hat. /lgtm @dom4ha - I think this is ready for your (hopefully last) pass |
dom4ha
left a comment
There was a problem hiding this comment.
/approve
I noticed two leftovers, the rest is great. Thank you Heba!
|
|
||
| `PodGroup` status mirrors `Pod` status semantics: | ||
| - If pods are unschedulable(i.e., timeout, resources, affinity, etc.), the scheduler updates the `PodGroupScheduled` condition to `False` and sets the reason fields accordingly. | ||
| - If pods are scheduled, the scheduler updates the `PodGroupScheduled` condition to `True` after the last pod in the gang completes binding. |
There was a problem hiding this comment.
| - If pods are scheduled, the scheduler updates the `PodGroupScheduled` condition to `True` after the last pod in the gang completes binding. | |
| - If pods are scheduled, the scheduler updates the `PodGroupScheduled` condition to `True` after the group got accepted by the Permit phase. |
|
|
||
| #### GangScheduling plugin | ||
|
|
||
| The GangScheduling plugin will maintain a lister for `PodGroup` and check if the `PodGroup` object exists along with the `Workload` object. This is in addition to the following changes: |
There was a problem hiding this comment.
| The GangScheduling plugin will maintain a lister for `PodGroup` and check if the `PodGroup` object exists along with the `Workload` object. This is in addition to the following changes: | |
| The GangScheduling plugin will maintain a lister for `PodGroup` and check if the `PodGroup` object exists. This is in addition to the following changes: |
mm4tt
left a comment
There was a problem hiding this comment.
One minor comment. Other than that
/lgtm
Thanks!
|
|
||
| #### GangScheduling plugin | ||
|
|
||
| The GangScheduling plugin will maintain a lister for `PodGroup` and check if the `PodGroup` object exists along with the `Workload` object. This is in addition to the following changes: |
There was a problem hiding this comment.
Above we're saying that scheduler will stop watching Workoad objects. Here we're saying that "will check existence of PodGroup along with the Workload". Let's make it consistent - i.e. remove the "along with the Workload object".
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dom4ha, helayoty, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| // - "Scheduled": All required pods have been successfully scheduled. | ||
| // - "Unschedulable": The PodGroup cannot be scheduled due to resource constraints, | ||
| // affinity/anti-affinity rules, or insufficient capacity for the gang. | ||
| // - "SchedulingGated": One or more pods in the PodGroup have scheduling gates |
There was a problem hiding this comment.
I think setting the SchedulingGated condition would be difficult since sending any synchronous API calls from the scheduling queue is undesirable. The apiserver applies this condition for scheduling gates on pods when it sees that the pod has scheduling gates. However, I don't think we can do the same for pod groups.
There was a problem hiding this comment.
Are you saying we should remove this? Or can we just send the API call from other place in scheduler code (e.g. a separate go-routine)?
There was a problem hiding this comment.
I'd rather remove it and reconsider when we have the fully working async API calls feature (#5229)
There was a problem hiding this comment.
+1 for removing it - I think that deciding exact reasons we support is the discussion for the code review
There was a problem hiding this comment.
Anyway, if this is non-trivial to implement then I don't think this should be part of alpha. Removing in #5912.
There was a problem hiding this comment.
We can remove it. We can send Unschedulable status for Pods (maybe for PodGroup as well) when they are waiting for too long for in PreEnqueue on min or desired count, but it's indeed dependent on Async API calls feature which introduces such possibility.
|
/area workload-aware |
One-line PR description: Decouple PodGroup API as a runtime object from Workload API
Issue link: WAS: Decouple PodGroup API #5832
Other comments: This PR is part of the workload-aware scheding workstream to implement gang-scheduling. It introduce an new separate runtime object,
PodGroup, that helps keepingWorkloadAPI simple and uers friendly./sig scheduling