-
Notifications
You must be signed in to change notification settings - Fork 475
[KEP-2724] Introduce two-level scheduling #5449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KEP-2724] Introduce two-level scheduling #5449
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
Hi @lchrzaszcz. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
change to Part of, or KEP for. we have a robot which will close the issue once PR with Fixes before issue number is merged. |
|
|
||
| Extension: support for indicating the order of pods for external Jobs. | ||
|
|
||
| #### Story 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, extending to "LWS" is definetly on our radar. However, for LWS we need to co-schedule PodSets rather than "chunk" them. So, we consider it a follow up.
|
|
||
| | mode | PodSet size | 6 | 5 | 4 | 3 | 2 | | ||
| | -----------------------| ----------- | - | - | - | - | - | | ||
| | BestFit | 12 | 6 | . | 4 | . | 2 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth noting that even with the BestFit algorithm there is no guarantee that there won't be any leftover capacity in the selected nodes, as there is already some rounding in the assumption nodes can accomodate an integer number of pods. The BestFit algorithm excellently mitigates resource fragmentation, and it's no doubt it's the way to go, but let's just keep in mind even that has some limits, and we won't mitigate fragmentation completely. Of course this reasoning applies to both single-level and two-level scheduling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a one-sentence disclaimer that the "tight fit" I'm referring to does not guarantee no dangling resources.
f05ab6e to
e3480a3
Compare
| metadata: | ||
| annotations: | ||
| kueue.x-k8s.io/podset-slice-required-topology: cloud.provider.com/topology-host | ||
| kueue.x-k8s.io/podset-slice-size: 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if this annotation kueue.x-k8s.io/podset-slice-size is not provided by user?
Do we;
- return error from validation webhook
- default it implicitly?
- default it explicitly in our Jobset webhook?
It seems setting it to anything different than "parallelism" in case of JobSet will be very exceptional. So, I would be leaning to default the annotation to parallelism when kueue.x-k8s.io/podset-slice-required-topology. Leaving this to be set by user every time by the user makes room for error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea to provide the default for a JobSet. Right now in this PR: #5353 I'm assuming that the default is 1 pod (because technically TAS scheduling without slice topology is TAS scheduling with slice topology of the lowest topology level with size of 1 pod).
From the validation point of view I would say we should assume a default value of parallelism if it's JobSet, but return an error if it's something else (of course assuming the slice topology is requested).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, IIUC we throw an error for all CRDs requiring to set kueue.x-k8s.io/podset-slice-size explicitly, except for JobSet. For JobSet we have the sensible default: parallelism.
We will decide during implementation if we default it implicitly or explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a sentence to clarify that for JobSet we will introduce a default of parallelism. Added also a separate section about validation of podset slice size.
| annotations: | ||
| kueue.x-k8s.io/podset-preferred-topology: cloud.provider.com/topology-block | ||
| kueue.x-k8s.io/podset-slice-required-topology: cloud.provider.com/topology-host | ||
| kueue.x-k8s.io/podset-slice-size: 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question about handling JobSets without this annotation, but with kueue.x-k8s.io/podset-slice-required-topology
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a sentence to clarify that for JobSet we will introduce a default of parallelism. Added also a separate section about validation of podset slice size.
mimowo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. I just have a question to clarify about the handling the case without "kueue.x-k8s.io/podset-slice-size".
tenzen-y
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the story 2 Notes since we are working on this as part of alpha stage?
Note: not planned for [Alpha](https://github.com/kubernetes-sigs/kueue/tree/main/keps/2724-topology-aware-scheduling#alpha), to be evaluated for [Beta](https://github.com/kubernetes-sigs/kueue/tree/main/keps/2724-topology-aware-scheduling#beta).`
| In this example there will be 8 (2 ReplicatedJob instances with 4 | ||
| "completions" each) worker pods and we say that we want to split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if completions and parallelism are different values?
Do you want to say 2 ReplicatedJob instances with 4 "parallelism" each?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for spotting that!
| Since slice size is equal to "completions" for the Job, the end result | ||
| is that each Job will be placed within a "host". | ||
|
|
||
| For now PodSet Slice topology has to be required (cannot be "preferred"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean for now? Does this mean preferred will be supported in Beta stage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not something we've planned, so I think it's just a poor choice of word that suggest it is planned. I'll change it to just state that without any assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have currently any use-cases going beyond "required". I would suggest to indeed drop "For now", because it warrants questions. When we have such use cases we will need to update the KEP anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for an explanation. Could you add it to Non Goal section?
Once we decide to support it, we can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I've added it there.
|
|
||
| // PodSetSliceRequiredTopology indicates the topology level required by the PodSet slice, as | ||
| // indicated by the `kueue.x-k8s.io/podset-slice-required-topology` annotation. | ||
| // | ||
| // +optional | ||
| PodSetSliceRequiredTopology *string `json:"podSetSliceRequiredTopology,omitempty"` | ||
|
|
||
| // PodSetSliceSize indicates the size of a subgroup of pods in a PodSet for which | ||
| // Kueue finds a requested topology domain on a level defined | ||
| // in `kueue.x-k8s.io/podset-slice-required-topology` annotation. | ||
| // | ||
| // +optional | ||
| PodSetSliceSize *int32 `json:"podSetSliceSize,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
topologyRequest:
podSetSlice:
requiredTopology: xxx
size: yyy
preferred: zzz // This is not included in this proposal. for future expanding.Do we need to consider future API expansion?
@mimowo Do you have any plans to expand the PodSet Slice capability in TAS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mimowo Do you have any plans to expand the PodSet Slice capability in TAS?
No, the current API proposal is already covering all use-cases we currently have regarding PodSet slices.
There might be future extensions which are hard to predict as typically with future :).
Personally, I think flat structure as in the current KEP is ok for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. In that case, let's keep the current form.
| mode. See the table below where the numbers are pods assigned to a particular node. | ||
| The header for columns dedicated to nodes correspond node's initial capacity. | ||
|
|
||
| | mode | PodSet size | 6 | 5 | 4 | 3 | 2 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does PodSet size mean? PodSet count? PodSet Slice size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a number of pods in PodSet. To make it more readable I have a few ideas how to rephrase it:
- PodSet count - consistent with the code I guess
- Pods count
- Pods count in PodSet
- Number of pods in PodSet
- What do you think suits the most?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a number of pods in PodSet. To make it more readable I have a few ideas how to rephrase it:
- PodSet count - consistent with the code I guess
- Pods count
- Pods count in PodSet
- Number of pods in PodSet
- What do you think suits the most?
IIUC, is it calculated by parallelism (Job) × count (PodSet)? If yes, let's say Pods count.
The PodSet count indicates .spec.podSet[*].count, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not mistaken, PodSet count is parallelism x replicas (ReplicatedJob), at least that's the whole PodSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of JobSet, replicated Job replicas corresponds to the PodSet count.
So, in case of JobSet, PodSet count (.spec.podSet[*].count) is replicatedJob replicas (.spec.template.spec.replicatedJobs[*].replicas).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a new contributor, so I might be mistaken, but looking here https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/jobs/jobset/jobset_controller.go#L232, we multiply parallelism/completions and replicas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevertheless, I suggest that I'll just rename that column to "Pods count", so it's more obvious. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a new contributor, so I might be mistaken, but looking here https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/jobs/jobset/jobset_controller.go#L232, we multiply parallelism/completions and replicas
yes, that's right. The above yaml place was not correct. Let me show those more accurately.
I wanted to show in the following:
- PodSet Count:
.spec.podSet[*].count(Workload) - Pod Count:
.spec.replicatedJobs[*].replicas(JobSet) ×.spec.replicatedJobs[*].template.spec.parallelism - Pod Count for more generic:
.spec.podSet[*].count(Workload) ×.spec.replicatedJobs[*].template.spec.parallelism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevertheless, I suggest that I'll just rename that column to "Pods count", so it's more obvious. WDYT?
If this indicates my above definition, "Pods count" would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. Replaced it with Pods count
|
|
||
| Explanation: | ||
| - `BestFit` - We prioritized 3rd node over 2nd node because 3rd node was a tight fit among all domains that could fit 2 slices. The last domain has been "optimized" to find the tight fit. | ||
| - `MostFreeCapacity` - We prioritized 3rd node over 2nd node because 3rd node was a tight fit among all domains that could fit 2 slices, so we assigned as much pods as possible there and filled it entirely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, it sounds weird. It looks like this violates MostFree strategy.
Why do we need to violate it only when MostFreeCapacity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great question. I was thinking about sorting domains based on state as that feels intuitive for "MostFree" strategy. However we've discussed that with @mimowo that this mode is deprecated and do not really want to invest time in that mode, so this is the result of just using "BestFit" logic without optimizing the last domain.
Do you think we should just apply sorting by "state" to MostFree mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and in fact I would like to drop MostFreeCapacity in 0.13. It was implemented first, but we quickly realized BestFit is better in all known cases, we introduced the MostFreeCapacity feature gate just as a bailout option in case there is a bug in BestFit. It is deprecated since 0.11. It makes like harder I would suggest sending a preparatory or follow up PR to drop it.
wdyt @tenzen-y ?
btw. for LeastFreeCapacity we would like to still maintain for a bit, because there might be use-cases for reducing fragmentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lchrzaszcz I think the current algorithm looks like a mixed strategy between MostFree and LeastFree as we can see in your example: | MostFreeCapacity | 12 | 6 | 2 | 4 | . | . |
If we want to keep maintaining MostFreeCapacity strategy, we should not violate the strategy in case of Two-Level scheduling as well. However, as @mimowo mentioned in the above, he want to drop supporting the MostFreeCapacity strategy. So, Let's delete the strategy first before we implement the Two-Level scheduling feature.
Yes, and in fact I would like to drop MostFreeCapacity in 0.13. It was implemented first, but we quickly realized BestFit is better in all known cases, we introduced the MostFreeCapacity feature gate just as a bailout option in case there is a bug in BestFit. It is deprecated since 0.11. It makes like harder I would suggest sending a preparatory or follow up PR to drop it.
wdyt @tenzen-y ?
btw. for LeastFreeCapacity we would like to still maintain for a bit, because there might be use-cases for reducing fragmentation.
@mimowo Yes, that makes sense. As we discussed in TAS profiles KEP, we should drop those step by step.
Surely, we should remove the MostFreeCapacity TAS profile first.
For LeastFreeCapacity, let us consider how to grow the maturity and support TAS profiles in the future issue and KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lchrzaszcz @mimowo I'm ok with refining MostFreeCapacity description when we delete the TAS profile.
In other words, we can move all MostFreeCapacity descriptions to the Alternative section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed MostFreeCapacity from two-level scheduling descriptions (this PR). In the other PR (#5536) I'm moving MostFreeCapacity documentation to Alternatives section.
|
Thank you 👍 This is a great extension to TAS capabilities. I'm pretty sure there will pop up some questions as we go (as always 😄 ). /lgtm |
|
LGTM label has been added. Git tree hash: 50634042ff1e05fcee89777c523205bcc9710f04
|
| mode. See the table below where the numbers are pods assigned to a particular node. | ||
| The header for columns dedicated to nodes correspond node's initial capacity. | ||
|
|
||
| | mode | PodSet size | 6 | 5 | 4 | 3 | 2 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a number of pods in PodSet. To make it more readable I have a few ideas how to rephrase it:
- PodSet count - consistent with the code I guess
- Pods count
- Pods count in PodSet
- Number of pods in PodSet
- What do you think suits the most?
IIUC, is it calculated by parallelism (Job) × count (PodSet)? If yes, let's say Pods count.
The PodSet count indicates .spec.podSet[*].count, I think.
|
|
||
| Explanation: | ||
| - `BestFit` - We prioritized 3rd node over 2nd node because 3rd node was a tight fit among all domains that could fit 2 slices. The last domain has been "optimized" to find the tight fit. | ||
| - `MostFreeCapacity` - We prioritized 3rd node over 2nd node because 3rd node was a tight fit among all domains that could fit 2 slices, so we assigned as much pods as possible there and filled it entirely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lchrzaszcz I think the current algorithm looks like a mixed strategy between MostFree and LeastFree as we can see in your example: | MostFreeCapacity | 12 | 6 | 2 | 4 | . | . |
If we want to keep maintaining MostFreeCapacity strategy, we should not violate the strategy in case of Two-Level scheduling as well. However, as @mimowo mentioned in the above, he want to drop supporting the MostFreeCapacity strategy. So, Let's delete the strategy first before we implement the Two-Level scheduling feature.
Yes, and in fact I would like to drop MostFreeCapacity in 0.13. It was implemented first, but we quickly realized BestFit is better in all known cases, we introduced the MostFreeCapacity feature gate just as a bailout option in case there is a bug in BestFit. It is deprecated since 0.11. It makes like harder I would suggest sending a preparatory or follow up PR to drop it.
wdyt @tenzen-y ?
btw. for LeastFreeCapacity we would like to still maintain for a bit, because there might be use-cases for reducing fragmentation.
@mimowo Yes, that makes sense. As we discussed in TAS profiles KEP, we should drop those step by step.
Surely, we should remove the MostFreeCapacity TAS profile first.
For LeastFreeCapacity, let us consider how to grow the maturity and support TAS profiles in the future issue and KEP.
tenzen-y
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
Thank you!
/lgtm
/approve
|
LGTM label has been added. Git tree hash: e45813b9fb7e9fddede8c2717c7c4b82d1321e0c
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lchrzaszcz, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@lchrzaszcz could you fix CI with |
|
Once you perform |
|
Oh, right, sorry, I thought it got merged already and it failed on main branch. I've just updated TOC but it resulted in no changes. I'll just try to rerun that CI. If it fails again it might be the case that merge with current main makes it fail (if CI merges with main before tests), I'll rebase it then. |
b743cae to
82644ff
Compare
mimowo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
LGTM label has been added. Git tree hash: d6feb784e214e7277430da87b398c01985c63e09
|
What type of PR is this?
/kind documentation
What this PR does / why we need it:
This PR updates Topology-Aware Scheduling KEP to introduce Two-Level Scheduling and PodSet Chunk topology feature.
KEP for #5439
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?