-
Notifications
You must be signed in to change notification settings - Fork 474
Two level TAS scheduling / PodSet chunk topology #5353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two level TAS scheduling / PodSet chunk topology #5353
Conversation
|
Hi @lchrzaszcz. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
/ok-to-test |
|
/test all |
|
@lchrzaszcz @mimowo Could you extend TAS KEP? Or do we need a new KEP? I am not fully sure what your objectives are and what you are going on. |
|
/test pull-kueue-verify-main |
|
/test pull-kueue-test-unit-main |
|
/test pull-kueue-test-integration-baseline-main |
|
/test pull-kueue-test-e2e-tas-main |
|
/test pull-kueue-verify-main |
|
/retest |
pkg/cache/tas_flavor_snapshot.go
Outdated
| lowerFitDomains := s.lowerLevelDomains(currFitDomain) | ||
| sortedLowerDomains := s.sortedDomains(lowerFitDomains, unconstrained) | ||
| currFitDomain = s.updateCountsToMinimum(sortedLowerDomains, count, unconstrained) | ||
| if levelIdx < chunkLevelIdx { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the previous algorithm there is counterintuitive behavior with preferred topologies. If there are two domans b1 and b2. b1 contains 3 hosts (x1, x2, x3) with capacity of 3 pods. b2 contains host x4 with capacity of 6. We want to schedule 12 pods with. Clearly it cannot fit in a single block, so in phase 2a of the algorithm we will "assign" 9 pods to b1 and 3 pods to b2. However in 2b we will list all children domans (x1, x2, x3, x4), sort them and greedily assign pods to those domains which will result in 6 pods in x4, 3 in x1 and 3 in x2, which means 6 pods in b1 and 6 pods in b2, which means that our "BestFit" algorithm does not always fill domains in a tight fit fashion.
After introducing chunks I'm doing the same assuming that's a behavior we would like to stick to. However on levels below chunks we have to be precise in pods assignment, so that's why I'm iterating through each domains and its children separately.
I wrote a test for that behavior: block preferred for podset; host required for chunks; 2 blocks with unbalanced subdomains; BestFit
pkg/cache/tas_cache_test.go
Outdated
| }, | ||
| enableFeatureGates: []featuregate.Feature{features.TASProfileLeastFreeCapacity}, | ||
| }, | ||
| "block preferred for podset; host required for chunks; 2 blocks with unbalanced subdomains; BestFit": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a test that shows counter-intuitive behavior of "forgetting" tight fit from higher domains and greedily assigning pods on each level in spite of number of pods assigned on higher level. Details are described in the other comment.
pkg/cache/tas_flavor_snapshot.go
Outdated
|
|
||
| // if this is where slices topology is requested then we calculate the number of slices | ||
| // that can fit. Otherwise we assign 0 and this value won't be used. | ||
| if (len(s.levelKeys) - 1) == sliceLevelIdx { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually,, this check and the comment really made me wonder what is going on.
I checked locally and all tests pass if we replace it all with leaf.sliceState = leaf.state / sliceSize.
Do we need that if? If so, we need some test to prove it is worth to have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need it. Similar check (and the most important one) is in fillInCountsHelper method:
for _, child := range domain.children {
addChildrenCapacity, addChildrenSliceCapacity := s.fillInCountsHelper(child, sliceSize, sliceLevelIdx, level+1)
childrenCapacity += addChildrenCapacity
sliceCapacity += addChildrenSliceCapacity
}
domain.state = childrenCapacity
if level == sliceLevelIdx {
sliceCapacity = domain.state / sliceSize
}
domain.sliceState = sliceCapacitySo even if you get rid of an "if" it'll work. It'll just assign gibberish numbers to lower-level domains and then we ignore them entirely in the method I mentioned. By gibberish I mean it - if user requested slices to be scheduled on a rack level, then saying that we can fit 2 slices in a host makes no sense, because that's not where we want to assign slices. In other words, if we have a rack with 2 hosts, each of them can fit 3 pods and we're trying to fit podset with 6 pods with slices size of 2 (so 3 slices in total). We would state that each host has a sliceState of 1, but on a rack level we cannot just sum up the sliceStates of hosts, because we can clearly fit all 3 slices in a rack (1 slice will just have 1 pod on 1st host and 1 pod in 2nd host). That's why calculating sliceState for domains below slice requested level makes no sense and that's why I thought I'll just assign 0 to those domains so in case anyone is debugging that algorithm does not try to interpret those numbers in any way.
In both ways the algorithm will work just fine, so it's just a matter of readability. If you think this "if" is confusing I'm happy to just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for clarifying. In that case. I understand now.
However, it makes me wonder this will be still tricky for others. To simplify we could:
- move the computation entirely to
fillInCountsHelper; or - improve the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be leaning to (1.), but if you prefer (2.), then maybe:
// We only compute the number of slices for leaves if the requested level for slices
// indicates the lowest level. Otherwise the information would be ignored by the
// algorithm anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved this part into leaf logic inside fillInCountsHelper. Good idea!
pkg/cache/tas_flavor_snapshot.go
Outdated
| } | ||
|
|
||
| required := isRequired(tasPodSetRequests.PodSet.TopologyRequest) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, revert formatting
pkg/cache/tas_flavor_snapshot.go
Outdated
| sliceTopologyKey := s.sliceLevelKeyWithDefault(&tasPodSetRequests, ptr.To(s.lowestLevel())) | ||
|
|
||
| unconstrained := isUnconstrained(tasPodSetRequests.PodSet.TopologyRequest, &tasPodSetRequests) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, revert formatting
mimowo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. This is a really great extension to TAS 👍
| } | ||
| domain.state = childrenCapacity | ||
| return childrenCapacity | ||
| if level == sliceLevelIdx { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if level == sliceLevelIdx { | |
| if level <= sliceLevelIdx { |
To avoid filling sliceState with unnecessary non-zero values as described in https://github.com/kubernetes-sigs/kueue/pull/5353/files/4bfd8844c197791804491904ff032376d2f83975..904b7a02548624789bf9c8942a9f7834cbc90fef#r2142562024
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's tricky part. There are two ways of calculating sliceState:
- Sum children's
sliceState state / sliceSize
As an example, let's assume we have 4 topology levels: zone, block, rack, hostname. We request slices on block level. We start by assigning 0 as sliceState to all "hostname" domains (there is a special logic for leaves). Then we iterate through racks and we sum sliceStates of children - we get 0, because children have 0 everywhere, so racks also have sliceState 0. Then we iterate through blocks, but that's the slice requestes level, so we switch to 2nd way of calculating sliceState and we use state / sliceSize. Then we iterate through zones and we go back to the 1st method of summing children sliceStates, but children no longer have all 0 sliceState, so we propagate sliceState upwards.
If we change that "==" to "<=" we start using 2nd method of calculating sliceState for all domains above the slice level which will override any actual sliceState we can fit on the slice requested level. This will cause bugs similar to the following example:
- SliceSize = 3 pods
- Number of pods = 6 (so 2 slices)
- Slice level = host
- Rack has 3 hosts. Each host can fit 2 pods
- So rack has state = 6, but sliceState = 0, because no host can fit a single slice, so we should noFit this workload
- If we change "==" to "<=" we will calculate rack's
sliceState=state / size= 2, so algorithm will say that we can fit that workload.
I'll write a test case for that as it seems it's easy to break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for explaining. Since this is tricky, and easy to break, as I just did without breaking any tests, then +1 for the unit test. It can be a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I'll add that test in a follow-up then.
|
/lgtm |
|
LGTM label has been added. Git tree hash: 54b639b9db5a438b64cc7ec3ac84db00070c8ff2
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lchrzaszcz, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/kind feature |
|
/release-note-edit |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Currently Topology-Aware Scheduling considers the whole PodSet as a single unit for which we look for a matching domain. However, there are use-cases in which we would like to allow user to define a sub-podset topology and two-level topologies (schedule the whole podset in one higher-level topology and schedule each subset of the podset in other, lower-level topology).
An example is a JobSet in which user can define topology for all Jobs resulting from a single ReplicatedJob. We want to allow user to define topology for each such Job and possibly co-locate all such Jobs in higher-level topology.
This PR introduces two-level scheduling allowing user to define required topology for PodSet chunks and required or preferred topology for the whole PodSet. We are also allowing user to define the size of the chunk.
Which issue(s) this PR fixes:
Fixes #5439
Special notes for your reviewer:
There are follow-up PRs planned:
Does this PR introduce a user-facing change?