Skip to content

Conversation

@lchrzaszcz
Copy link
Contributor

@lchrzaszcz lchrzaszcz commented May 26, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

Currently Topology-Aware Scheduling considers the whole PodSet as a single unit for which we look for a matching domain. However, there are use-cases in which we would like to allow user to define a sub-podset topology and two-level topologies (schedule the whole podset in one higher-level topology and schedule each subset of the podset in other, lower-level topology).

An example is a JobSet in which user can define topology for all Jobs resulting from a single ReplicatedJob. We want to allow user to define topology for each such Job and possibly co-locate all such Jobs in higher-level topology.

This PR introduces two-level scheduling allowing user to define required topology for PodSet chunks and required or preferred topology for the whole PodSet. We are also allowing user to define the size of the chunk.

Which issue(s) this PR fixes:

Fixes #5439

Special notes for your reviewer:

There are follow-up PRs planned:

  • Introduce PodSet chunk TAS scheduling only (without expectations of the topology for the whole PodSet)
  • Robust validation and corner cases

Does this PR introduce a user-facing change?

TAS: Introduce two-level scheduling

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 26, 2025
@k8s-ci-robot k8s-ci-robot requested review from kannon92 and mimowo May 26, 2025 14:53
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 26, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @lchrzaszcz. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 26, 2025
@netlify
Copy link

netlify bot commented May 26, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 6c5966a
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/684ad363afe32b0008cf2247

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 27, 2025
@lchrzaszcz lchrzaszcz changed the title Two level TAS scheduling / PodSet chunk topology WIP Two level TAS scheduling / PodSet chunk topology May 28, 2025
@mimowo
Copy link
Contributor

mimowo commented May 28, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 28, 2025
@mimowo
Copy link
Contributor

mimowo commented May 28, 2025

/test all

@tenzen-y
Copy link
Member

@lchrzaszcz @mimowo Could you extend TAS KEP? Or do we need a new KEP? I am not fully sure what your objectives are and what you are going on.

https://github.com/kubernetes-sigs/kueue/tree/main/keps/2724-topology-aware-scheduling

@lchrzaszcz
Copy link
Contributor Author

/test pull-kueue-verify-main

@lchrzaszcz
Copy link
Contributor Author

/test pull-kueue-test-unit-main

@lchrzaszcz
Copy link
Contributor Author

/test pull-kueue-test-integration-baseline-main
/test pull-kueue-verify-main

@lchrzaszcz
Copy link
Contributor Author

/test pull-kueue-test-e2e-tas-main

@lchrzaszcz
Copy link
Contributor Author

/test pull-kueue-verify-main

@lchrzaszcz
Copy link
Contributor Author

/retest

lowerFitDomains := s.lowerLevelDomains(currFitDomain)
sortedLowerDomains := s.sortedDomains(lowerFitDomains, unconstrained)
currFitDomain = s.updateCountsToMinimum(sortedLowerDomains, count, unconstrained)
if levelIdx < chunkLevelIdx {
Copy link
Contributor Author

@lchrzaszcz lchrzaszcz May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous algorithm there is counterintuitive behavior with preferred topologies. If there are two domans b1 and b2. b1 contains 3 hosts (x1, x2, x3) with capacity of 3 pods. b2 contains host x4 with capacity of 6. We want to schedule 12 pods with. Clearly it cannot fit in a single block, so in phase 2a of the algorithm we will "assign" 9 pods to b1 and 3 pods to b2. However in 2b we will list all children domans (x1, x2, x3, x4), sort them and greedily assign pods to those domains which will result in 6 pods in x4, 3 in x1 and 3 in x2, which means 6 pods in b1 and 6 pods in b2, which means that our "BestFit" algorithm does not always fill domains in a tight fit fashion.

After introducing chunks I'm doing the same assuming that's a behavior we would like to stick to. However on levels below chunks we have to be precise in pods assignment, so that's why I'm iterating through each domains and its children separately.

I wrote a test for that behavior: block preferred for podset; host required for chunks; 2 blocks with unbalanced subdomains; BestFit

},
enableFeatureGates: []featuregate.Feature{features.TASProfileLeastFreeCapacity},
},
"block preferred for podset; host required for chunks; 2 blocks with unbalanced subdomains; BestFit": {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test that shows counter-intuitive behavior of "forgetting" tight fit from higher domains and greedily assigning pods on each level in spite of number of pods assigned on higher level. Details are described in the other comment.

@lchrzaszcz lchrzaszcz marked this pull request as ready for review May 29, 2025 17:24
@k8s-ci-robot k8s-ci-robot requested a review from gabesaba May 29, 2025 17:24
@lchrzaszcz lchrzaszcz changed the title WIP Two level TAS scheduling / PodSet chunk topology Two level TAS scheduling / PodSet chunk topology Jun 2, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 2, 2025

// if this is where slices topology is requested then we calculate the number of slices
// that can fit. Otherwise we assign 0 and this value won't be used.
if (len(s.levelKeys) - 1) == sliceLevelIdx {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually,, this check and the comment really made me wonder what is going on.

I checked locally and all tests pass if we replace it all with leaf.sliceState = leaf.state / sliceSize.

Do we need that if? If so, we need some test to prove it is worth to have it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need it. Similar check (and the most important one) is in fillInCountsHelper method:

for _, child := range domain.children {
	addChildrenCapacity, addChildrenSliceCapacity := s.fillInCountsHelper(child, sliceSize, sliceLevelIdx, level+1)
	childrenCapacity += addChildrenCapacity
	sliceCapacity += addChildrenSliceCapacity
}
domain.state = childrenCapacity
if level == sliceLevelIdx {
	sliceCapacity = domain.state / sliceSize
}
domain.sliceState = sliceCapacity

So even if you get rid of an "if" it'll work. It'll just assign gibberish numbers to lower-level domains and then we ignore them entirely in the method I mentioned. By gibberish I mean it - if user requested slices to be scheduled on a rack level, then saying that we can fit 2 slices in a host makes no sense, because that's not where we want to assign slices. In other words, if we have a rack with 2 hosts, each of them can fit 3 pods and we're trying to fit podset with 6 pods with slices size of 2 (so 3 slices in total). We would state that each host has a sliceState of 1, but on a rack level we cannot just sum up the sliceStates of hosts, because we can clearly fit all 3 slices in a rack (1 slice will just have 1 pod on 1st host and 1 pod in 2nd host). That's why calculating sliceState for domains below slice requested level makes no sense and that's why I thought I'll just assign 0 to those domains so in case anyone is debugging that algorithm does not try to interpret those numbers in any way.

In both ways the algorithm will work just fine, so it's just a matter of readability. If you think this "if" is confusing I'm happy to just remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for clarifying. In that case. I understand now.

However, it makes me wonder this will be still tricky for others. To simplify we could:

  1. move the computation entirely to fillInCountsHelper; or
  2. improve the comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be leaning to (1.), but if you prefer (2.), then maybe:

// We only compute the number of slices for leaves if the requested level for slices
// indicates the lowest level. Otherwise the information would be ignored by the 
// algorithm anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved this part into leaf logic inside fillInCountsHelper. Good idea!

}

required := isRequired(tasPodSetRequests.PodSet.TopologyRequest)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, revert formatting

sliceTopologyKey := s.sliceLevelKeyWithDefault(&tasPodSetRequests, ptr.To(s.lowestLevel()))

unconstrained := isUnconstrained(tasPodSetRequests.PodSet.TopologyRequest, &tasPodSetRequests)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, revert formatting

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. This is a really great extension to TAS 👍

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 12, 2025
}
domain.state = childrenCapacity
return childrenCapacity
if level == sliceLevelIdx {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if level == sliceLevelIdx {
if level <= sliceLevelIdx {

To avoid filling sliceState with unnecessary non-zero values as described in https://github.com/kubernetes-sigs/kueue/pull/5353/files/4bfd8844c197791804491904ff032376d2f83975..904b7a02548624789bf9c8942a9f7834cbc90fef#r2142562024

wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's tricky part. There are two ways of calculating sliceState:

  • Sum children's sliceState
  • state / sliceSize

As an example, let's assume we have 4 topology levels: zone, block, rack, hostname. We request slices on block level. We start by assigning 0 as sliceState to all "hostname" domains (there is a special logic for leaves). Then we iterate through racks and we sum sliceStates of children - we get 0, because children have 0 everywhere, so racks also have sliceState 0. Then we iterate through blocks, but that's the slice requestes level, so we switch to 2nd way of calculating sliceState and we use state / sliceSize. Then we iterate through zones and we go back to the 1st method of summing children sliceStates, but children no longer have all 0 sliceState, so we propagate sliceState upwards.

If we change that "==" to "<=" we start using 2nd method of calculating sliceState for all domains above the slice level which will override any actual sliceState we can fit on the slice requested level. This will cause bugs similar to the following example:

  • SliceSize = 3 pods
  • Number of pods = 6 (so 2 slices)
  • Slice level = host
  • Rack has 3 hosts. Each host can fit 2 pods
  • So rack has state = 6, but sliceState = 0, because no host can fit a single slice, so we should noFit this workload
  • If we change "==" to "<=" we will calculate rack's sliceState = state / size = 2, so algorithm will say that we can fit that workload.

I'll write a test case for that as it seems it's easy to break.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for explaining. Since this is tricky, and easy to break, as I just did without breaking any tests, then +1 for the unit test. It can be a follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'll add that test in a follow-up then.

@mimowo
Copy link
Contributor

mimowo commented Jun 12, 2025

/lgtm
/approve
Nice, thank you for working relentlessly on addressing all comments 👍

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 12, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 54b639b9db5a438b64cc7ec3ac84db00070c8ff2

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lchrzaszcz, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 12, 2025
@k8s-ci-robot k8s-ci-robot merged commit d81f629 into kubernetes-sigs:main Jun 12, 2025
23 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.13 milestone Jun 12, 2025
@mimowo
Copy link
Contributor

mimowo commented Jul 7, 2025

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 7, 2025
@tenzen-y
Copy link
Member

/release-note-edit

TAS: Introduce two-level scheduling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TAS: Two-level JobSet scheduling

6 participants