KEP-4671: Introduce Workload Scheduling Cycle and Delayed Preemption#5730
KEP-4671: Introduce Workload Scheduling Cycle and Delayed Preemption#5730k8s-ci-robot merged 23 commits intokubernetes:masterfrom
Conversation
erictune
left a comment
There was a problem hiding this comment.
This looks great overall.
I have one proposal to consider renaming part of the API, and some clarifying questions.
| // +required | ||
| Name string | ||
|
|
||
| // PodGroup is the name of the PodGroup within the Workload that this Pod |
There was a problem hiding this comment.
The term PodGroup has become ambiguous. It can mean:
- a list item in
Workload.podGroups - a specific set of pods having the same PodGroupReplicaKey
These are often the same thing but not always.
If we want to be clear what we mean, we can either:
- Always call (1) a PodGroup and always call (2) a PodGroupReplica, including in the case when podGroupReplicaKey is nil
- Or, rename PodGroup to PodGroup Template, and call (2) a PodGroup
There was a problem hiding this comment.
This may seem pedantic, but when reviewing this KEP, I felt that current double meaning made the text imprecise in a number of places. We do have a chance to fix the naming with v1beta1, but I don't think we are supposed to rename when we go to GA.
There was a problem hiding this comment.
My personal preference: when talking about (2), I would just call that either PodGroupReplica or alternatively (which is what I was using) PodGroup instance. WIth that, I wouldn't rename it - but these I my personal preferences.
There was a problem hiding this comment.
If we rename it to PodGroupReplica, are we planning to introduce PodSubGroup under it, as we discussed here: #5711 (comment) ? And what about PodSubGroupReplica?
There was a problem hiding this comment.
Agree with andreyvelich, PodSubGroupReplica and PodSetReplica are confusing.
This would be less confusing:
- 1 Workload object in APIserver names exactly one 1 group of pods at runtime.
- 1 Workload.podGroupTemplates[i] is a template for N independent PodGroups which are distinguished by PodGroupReplicaKey
Later, if PodSubGroup is added:
- 1 Workload.podGroupTemplate[i].podSubGroup[j] names exactly one PodSubGroup within a single group of pods (PodGroup) defined as above.
Later, if PodSet is added under PodSubGroup:
- 1 Workload.podGroupTemplate[i].podSubGroup[j].podSet[k] names exactly one PodSet within PodSubGroup, defined as above.
There was a problem hiding this comment.
This needs a longer discussion than belongs in the review comments of this PR.
Suggestion to KEP author:
- Keep Workload Scheduling Cycle proposal in this PR.
- Keep Basic policy enhancements in this PR.
- Remove advancement to beta to.
Let's all as reviewers approve the above. And we can take a week to discuss the "PodGroup as separate object" proposal. which may affect alpha/beta status of Workload.
There was a problem hiding this comment.
Agree with @johnbelamaric and @erictune that we should separate discussion whether we need separate object for PodGroup.
Remove advancement to beta to.
Is my understanding correct that we will create another KEP for beta graduation in that case?
There was a problem hiding this comment.
I'm going back and forth with whether we should separate PodGroup into a separate object (in fact my original counter-proposal was going that way though for different reasons https://docs.google.com/document/d/13UkLjVMj_edMh7biqVU6SVyNNTIfGAT35p6-pNsF5AY/edit?resourcekey=0-dqUEiwiXWICLwAg6Tqkupw&tab=t.9le0fmf90j3w#heading=h.vf43rjyfidc6 )
I agree it's the last moment to change that before going to Beta.
But I also agree that pretty much neither of the proposed changes (basicPolicy, workload scheduling cycle, scheduling algorithm, ...) depend on this decision and can proceed independently.
So I'm heavily +1 on Eric's suggestion above to focus this PR on those changes but still leave the KEP in Alpha after this PR.
And have a separate PR that will be focused on the API itself (it can't be a separate KEP, but it can be a separate PR and discussion).
Also - we should probably start a dedicated document for that to clearly describe Pros/Cons of both options to make the more data-driven decision.
@macsko - I'll be OOO for the next 2 weeks, would you be able to start such doc and I'm happy to contribute once I'm back
There was a problem hiding this comment.
Okay, so I'll remove the beta graduation part from this PR and the discussion about API can be moved to a new document
There was a problem hiding this comment.
So I went ahead and started a doc: https://docs.google.com/document/d/1zVdNyMGuSi861Uw16LAKXzKkBgZaICOWdPRQB9YAwTk/edit?resourcekey=0-bD8cjW_B6ZfOpSGgDrU6Mg&tab=t.0#heading=h.c4vrtnmf9f4o
It's only a starter and requires a lot of work so I would appreciate all contributions, especially given I will be OOO until Jan 7th
| ControllerRef *TypedLocalObjectReference | ||
|
|
||
| // PodGroups is the list of pod groups that make up the Workload. | ||
| // The maximum number of pod groups is 8. This field is immutable. |
There was a problem hiding this comment.
But the maximum number of PodGroup Replicas is not limited.
There was a problem hiding this comment.
That's right. It would be even hard to enforce such limitation in the current form of replication - it's based solely on pods' workloadRefs.
There was a problem hiding this comment.
Was suggesting to make this more clear in the code comment.
There was a problem hiding this comment.
Splitting PodGroup into its own resource would make it a lot clearer that there is a distinction between the template and the instance, and that the limit applies to defined templates, not instances.
|
|
||
| When the scheduler pops a Pod from the active queue, it checks if that Pod | ||
| belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler | ||
| initiates the Workload Scheduling Cycle. |
There was a problem hiding this comment.
If a pod belongs to an already scheduled PodGroup, it is not clear what to do. We could:
- Hold it back, on the assumption that it is a replacement for an already scheduled pod. When
minCountadditional pods show up, then handle all those at once. - Treat it as if it is a "gang scale up", and try to place it too (on a best-effort basis). If we do this, does it go through Workload cycle, or just normal pod-at-a-time path?
I thinking about races that could happen when a workload is failing and getting pods replaced. And I am thinking about the case when a workload wants to tolerate a small number of failures by having actual count > minCount.
There was a problem hiding this comment.
Taking a step back from the implementation, we should think what we really need to do conceptually in that case.
If the pod is part of PodGroup, then what we conceptually want is to schedule that in the context of its PodGroup instance. I think there are effectively three cases here:
- this PodGroup instance doesn't satisfy its minCount now, but with this pod it will satisfy it
- this PodGroup instance doesn't satisfy its minCount now and with that pod it won't satisfy it either
- this PodGroup instance already satisfies its minCount even without this pod
It's worth noting at this point, that with topology-aware scheduling introduce PodGroup instance could have tas requirements too when thinking what we should do with it.
The last point in my opinion means that we effectively should always go through Workload cycle, because we always want to consider it in the context of whole workload.
The primary question is if we want to go kick this workload cycle off with individual pod or wait for more and if so based on what criteria.
I think that the exact criteria will be evolving in the future (I can definitely imagine it depend on "preemption unit" that we're introducing in KEP-5710). So for now, I wouldn't try to settle down on the final solution.
I would suggest starting with:
- always go through Workload cycle (to keep in mind whole podGroup instance as context)
- for now, always schedule individual pod for the sake of making progress here (we will most probably adjust it in the future but it will be pretty localized change and I think it's fine)
There was a problem hiding this comment.
It seems that the easiest way is to do what @wojtek-t described, i.e., go best effort (as for the basic policy), so take as many pods as we have available, and try to schedule them in the workload cycle. Only pods that have passed the workload cycle will be able to move on to the pod-by-pod cycle. In the future, we can try to create more intelligent grouping, but for now, let's focus on delivering a good enough implementation.
There was a problem hiding this comment.
What is the plan if PodGroup has been scheduled with TAS constraints based on minCount, and a new Pod comes in that is part of the PodGroup but doesn't fit in the TAS constraints?
There was a problem hiding this comment.
That's a good point. I suppose, the new Pod should go through a workload scheduling cycle and the TAS algorithm there should take the scheduled pods from a gang into consideration. If the pod doesn't fit, it will remain unschedulable until something changes in the cluster.
There was a problem hiding this comment.
Right - any pod that is part of PodGroup should always go through WorkloadScheduling cycle first (which can look at all (including already scheduled) pods from it) - no matter if the PodGroup is unscheduled or (partially) scheduled.
I see that it was already reflected in the text below.
wojtek-t
left a comment
There was a problem hiding this comment.
Added few comments but they are pretty localized - overall this is great proposal pretty aligned with how I was thinking about it too.
| // +required | ||
| Name string | ||
|
|
||
| // PodGroup is the name of the PodGroup within the Workload that this Pod |
There was a problem hiding this comment.
My personal preference: when talking about (2), I would just call that either PodGroupReplica or alternatively (which is what I was using) PodGroup instance. WIth that, I wouldn't rename it - but these I my personal preferences.
| Can report dedicated logs and metrics with less confusion to the user. | ||
| * *Cons:* Significant and non-trivial architectural change to the scheduling queue | ||
| and `scheduleOne` loop. | ||
| <<[/UNRESOLVED]>> |
There was a problem hiding this comment.
I would more value opinions from people more hands-on with scheduler code recently than myself, but on paper I think the third alternative seems preferred to me:
-
Alternative 1 - it will be extremely hard to reason about it if pods sit in different queues (backoff, active, ..) independently and the evolution point is extremely important imho - we know we will be evolving it a lot and preparing for that is super important
-
Alternative 2 - I share the concerns about corner cases (pod deleted is exactly what I have on my mind). Once we get to rescheduling cases (new pods appear but they should trigger removing and recreating some of the existing pods it will get even more complicated with even harder corner cases). Given that we know that rescheduling is super important, I'm also reluctant about it.
-
Alternative 3 - It's clearly the cleanest option. The primary drawback of the need for non-trivial changes is imho justified given we know that we will be evolving it significantly in the future.
So I have quite strong preference towards (3) at this point, but I'm happy to be challenged by implementation-specific counter-arguments.
There was a problem hiding this comment.
That was my thought as well. I think the effort involved in adding workload queuing will be comparable to modifying the code for the previous alternatives, but maybe I don't see some significant drawbacks.
There was a problem hiding this comment.
For the third alternative, do we want to introduce only the queue for PodGroups? If so then we have the same problem of pulling pods from backoff and unschedulable queues right? Or do we mean by the workload queue a structure that will hold all the pods related to workload scheduling?
Let's say that we add a Pod that is part of a group but does not make us meet the min count . Should it land in unschedulable queue as it does right now? Or some additional structure?
If the pod group failed scheduling, where do we put pods from it? We need to have them somewhere, so we can check them against cluster events in case some event makes the pod schedulable thus potentially makes the whole pod group schedulable.
There was a problem hiding this comment.
I think, one way might be to introduce both:
- a queue for pod groups (let's say
podGroupsQ). This will contain the pod groups that can be popped for workload scheduling - a data structure for unscheduled pods (let's say
workloadPods) that have to wait for workload scheduling to finish to proceed with their pod-by-pod cycles.
I imagine the pod transitions (for pods that have a workloadRef) would be:
pod is created -> when PreEnqueue passes for a pod, add it to the workloadPods, otherwise add to unschedulablePods -> [for gang pod group] if the >= minCount pods for the group is in a workloadPods, the group is added to the podGroupsQ -> pod group is processed and workload scheduling cycle is executed
When the workload cycle finishes successfully:
pod gets the NNN and is moved to the activeQ -> pod is popped and goes through its pod-by-pod scheduling cycle -> ...
When the workload cycle fails:
pod is moved to the unschedulablePods, where it waits for a cluster change to happen -> when the change happens for this pod or other from its pod group, the pod(s) are moved to the workloadPods -> workload manager detects the group should be retried and adds the group to the podGroupsQ -> processing continues as previously
There was a problem hiding this comment.
Again, alternative 3 is the obvious choice IMO if PodGroup is a separate resource.
There was a problem hiding this comment.
In Koordinator, I used Alternative2. The desired effect is that, in ActiveQ/BackoffQ, a PodGroup has only one Item, and this Item can carry all queuing attributes (Priority, LastScheduleTime, BackoffTime, etc.). Whether this Item is a representative pod or a real PodGroup is less important.
Another important point is that when a representative Pod or PodGroup is dequeued, its sibling Pods are also dequeued. In Koordinator, we implemented our own NextPod (koordinator-sh/koordinator#2417) to achieve this.
Regarding this KEP, I feel that a fusion of Alternative 2 and Alternative 3 might be a better solution:
-
Use a QueuedPodGroupInfo that aggregates the queuing attributes of all member Pods to flow PodGroups between ActiveQ, BackoffQ, and UnschedulableQ, allowing the previous QueueHint mechanism to seamlessly integrate with PodGroups.
-
Use a separate Map to store the Pods belonging to QueuedPodGroupInfo for easy indexing.
There was a problem hiding this comment.
Consider that we should have PodGroups without pods at some point in time and we should be able to schedule them somehow.
However, before we get there, I'd consider scheduling PodGroup simply whenever we encounter the first pod that refers to it and the PodGroup wasn't scheduled yet. It's similar to option 1 but without modifying sorting function.
I think it would be the simplest implementation as all pods stay in the active queue unless an attempt to schedule a PodGroup they belong to, makes the PodGroup unschedulable. Obviously the question is when to reconsider unschedulable PodGroup. At the beginning I'd make it periodically schedulable again (after some timeout) without defining PodGroup queue yet.
I'm not sure yet how Workload aware preemption may interact with it yet.
There was a problem hiding this comment.
It's similar to option 1 but without modifying sorting function.
Added it as an "Alternative 0" to the proposal.
There was a problem hiding this comment.
I still think that alternative 3 is the best one given all the pros/cons described.
| // +required | ||
| Name string | ||
|
|
||
| // PodGroup is the name of the PodGroup within the Workload that this Pod |
There was a problem hiding this comment.
If we rename it to PodGroupReplica, are we planning to introduce PodSubGroup under it, as we discussed here: #5711 (comment) ? And what about PodSubGroupReplica?
a63f0ef to
71ef75a
Compare
71ef75a to
9ce3fc7
Compare
9ce3fc7 to
eae3ddb
Compare
| // +required | ||
| Name string | ||
|
|
||
| // PodGroup is the name of the PodGroup within the Workload that this Pod |
There was a problem hiding this comment.
@thockin - I would like to appreciate your feedback from API approver perspective.
| the standard pod-by-pod scheduling cycle. | ||
|
|
||
| When the scheduler pops a Pod from the active queue, it checks if that Pod | ||
| belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler |
There was a problem hiding this comment.
As discussed in the Basic policy section above, I actually think that all pods that belong to workloads should go through this phase (whether they form gangs or just are from basic policy).
I acknowledge that for basic it will be kind of best-effort (if more pods were created we will get all of them, if we only observed 1 it will be just one), but that better opens doors for future extensions once we have pod templates etc. defined in Workload.
There was a problem hiding this comment.
+1
I'd say that Workload scheduling is a phase scheduling PodGroups. Depending on what policy type the group is, the logic is different.
In case of the Basic policy, the group becomes scheduled unconditionally, still pods belonging to that group cannot be scheduled until the PG itself is scheduled.
There was a problem hiding this comment.
Removed "Gang" from this sentence. I think we are aligned that the basic-policy pods should be scheduled by this phase
|
|
||
| *Proposed:* Implement it on PodGroup Level for Beta. However, future migration | ||
| to the Workload Level might necessitate non-trivial changes to the phase | ||
| introduced by this KEP. |
There was a problem hiding this comment.
I suggest resolving that (and potentially mentioning the risk of non-trivial changes in the Risks section).
| 2. The scheduler nominates the victims for preemption and the gang Pod | ||
| for scheduling on their place. This way, the gang can be attempted | ||
| without making any intermediate disruptions to the cluster. | ||
| * If the quorum is met, the scheduler continues scheduling the gang Pods pod-by-pod. |
There was a problem hiding this comment.
This should be adjusted once we settle down on the details in #5711
Currently these are not aligned :)
[It's a comment for myself too]
dom4ha
left a comment
There was a problem hiding this comment.
/lgtm
/hold for @sanposhiho lgtm
In case someone overlooked the recent change. Pods belonging to a PodGroup that got successfully scheduled will go directly to the binding phase. It simplifies a few things, so it's great we did that.
The KEP is good to go from my perspective. There are a few purely wording comments
|
The changes LGTM |
|
/lgtm |
|
/lgtm for enforcing same scheduling profile for a PodGroup cc @wojtek-t |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: macsko, sanposhiho The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/unhold |
|
/hold |
This LGTM too. @macsko - thanks a lot for pushing that /lgtm |
…ubernetes#5730) * Add a section about scheduler changes for v1.36 * Add a section about basic policy update * Remove beta graduation from the PR, extend sections about workload scheduling cycle * Expand queueing alternatives. Add unresolved section about enforcing minCount * Apply comments * Add NNN risks, extend workload algorithm with more details * Move delayed preemption details to this KEP * Update toc * Make delayed preemption description consistent. Extend integration tests section * Resolve queueing strategy and feasibility plugin. List algorithm limitations. Make NNN a hard requirement. Apply comments * Fix lint * Apply review comments * Apply comments * Apply comments * Apply comments * Remove Basic policy desiredCount from the KEP scope * Apply comments * Update toc * Apply comments * Update the KEP with a decision to skip pod-by-pod scheduling phase after workload cycle * Apply comments * Update toc * Add a paragraph about requirement of consistent schedulerName
|
/area workload-aware |
Uh oh!
There was an error while loading. Please reload this page.