Two level TAS scheduling / PodSet chunk topology #5353

lchrzaszcz · 2025-05-26T14:53:44Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Currently Topology-Aware Scheduling considers the whole PodSet as a single unit for which we look for a matching domain. However, there are use-cases in which we would like to allow user to define a sub-podset topology and two-level topologies (schedule the whole podset in one higher-level topology and schedule each subset of the podset in other, lower-level topology).

An example is a JobSet in which user can define topology for all Jobs resulting from a single ReplicatedJob. We want to allow user to define topology for each such Job and possibly co-locate all such Jobs in higher-level topology.

This PR introduces two-level scheduling allowing user to define required topology for PodSet chunks and required or preferred topology for the whole PodSet. We are also allowing user to define the size of the chunk.

Which issue(s) this PR fixes:

Fixes #5439

Special notes for your reviewer:

There are follow-up PRs planned:

Introduce PodSet chunk TAS scheduling only (without expectations of the topology for the whole PodSet)
Robust validation and corner cases

Does this PR introduce a user-facing change?

TAS: Introduce two-level scheduling

k8s-ci-robot · 2025-05-26T14:53:53Z

Hi @lchrzaszcz. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2025-05-26T14:54:06Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`6c5966a`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/684ad363afe32b0008cf2247

mimowo · 2025-05-28T12:06:10Z

/ok-to-test

mimowo · 2025-05-28T12:06:16Z

/test all

tenzen-y · 2025-05-28T12:43:20Z

@lchrzaszcz @mimowo Could you extend TAS KEP? Or do we need a new KEP? I am not fully sure what your objectives are and what you are going on.

https://github.com/kubernetes-sigs/kueue/tree/main/keps/2724-topology-aware-scheduling

lchrzaszcz · 2025-05-28T16:21:44Z

/test pull-kueue-verify-main

lchrzaszcz · 2025-05-28T16:23:05Z

/test pull-kueue-test-unit-main

lchrzaszcz · 2025-05-29T09:06:10Z

/test pull-kueue-test-integration-baseline-main
/test pull-kueue-verify-main

lchrzaszcz · 2025-05-29T09:22:18Z

/test pull-kueue-test-e2e-tas-main

lchrzaszcz · 2025-05-29T10:12:20Z

/test pull-kueue-verify-main

lchrzaszcz · 2025-05-29T13:08:37Z

/retest

lchrzaszcz · 2025-05-29T13:24:07Z

pkg/cache/tas_flavor_snapshot.go

-		lowerFitDomains := s.lowerLevelDomains(currFitDomain)
-		sortedLowerDomains := s.sortedDomains(lowerFitDomains, unconstrained)
-		currFitDomain = s.updateCountsToMinimum(sortedLowerDomains, count, unconstrained)
+		if levelIdx < chunkLevelIdx {


In the previous algorithm there is counterintuitive behavior with preferred topologies. If there are two domans b1 and b2. b1 contains 3 hosts (x1, x2, x3) with capacity of 3 pods. b2 contains host x4 with capacity of 6. We want to schedule 12 pods with. Clearly it cannot fit in a single block, so in phase 2a of the algorithm we will "assign" 9 pods to b1 and 3 pods to b2. However in 2b we will list all children domans (x1, x2, x3, x4), sort them and greedily assign pods to those domains which will result in 6 pods in x4, 3 in x1 and 3 in x2, which means 6 pods in b1 and 6 pods in b2, which means that our "BestFit" algorithm does not always fill domains in a tight fit fashion.

After introducing chunks I'm doing the same assuming that's a behavior we would like to stick to. However on levels below chunks we have to be precise in pods assignment, so that's why I'm iterating through each domains and its children separately.

I wrote a test for that behavior: block preferred for podset; host required for chunks; 2 blocks with unbalanced subdomains; BestFit

lchrzaszcz · 2025-05-29T13:26:26Z

pkg/cache/tas_cache_test.go

+			},
+			enableFeatureGates: []featuregate.Feature{features.TASProfileLeastFreeCapacity},
+		},
+		"block preferred for podset; host required for chunks; 2 blocks with unbalanced subdomains; BestFit": {


This is a test that shows counter-intuitive behavior of "forgetting" tight fit from higher domains and greedily assigning pods on each level in spite of number of pods assigned on higher level. Details are described in the other comment.

mimowo · 2025-06-12T11:20:14Z

pkg/cache/tas_flavor_snapshot.go

+
+		// if this is where slices topology is requested then we calculate the number of slices
+		// that can fit. Otherwise we assign 0 and this value won't be used.
+		if (len(s.levelKeys) - 1) == sliceLevelIdx {


Actually,, this check and the comment really made me wonder what is going on.

I checked locally and all tests pass if we replace it all with leaf.sliceState = leaf.state / sliceSize.

Do we need that if? If so, we need some test to prove it is worth to have it.

We don't need it. Similar check (and the most important one) is in fillInCountsHelper method:

for _, child := range domain.children { addChildrenCapacity, addChildrenSliceCapacity := s.fillInCountsHelper(child, sliceSize, sliceLevelIdx, level+1) childrenCapacity += addChildrenCapacity sliceCapacity += addChildrenSliceCapacity } domain.state = childrenCapacity if level == sliceLevelIdx { sliceCapacity = domain.state / sliceSize } domain.sliceState = sliceCapacity

So even if you get rid of an "if" it'll work. It'll just assign gibberish numbers to lower-level domains and then we ignore them entirely in the method I mentioned. By gibberish I mean it - if user requested slices to be scheduled on a rack level, then saying that we can fit 2 slices in a host makes no sense, because that's not where we want to assign slices. In other words, if we have a rack with 2 hosts, each of them can fit 3 pods and we're trying to fit podset with 6 pods with slices size of 2 (so 3 slices in total). We would state that each host has a sliceState of 1, but on a rack level we cannot just sum up the sliceStates of hosts, because we can clearly fit all 3 slices in a rack (1 slice will just have 1 pod on 1st host and 1 pod in 2nd host). That's why calculating sliceState for domains below slice requested level makes no sense and that's why I thought I'll just assign 0 to those domains so in case anyone is debugging that algorithm does not try to interpret those numbers in any way.

In both ways the algorithm will work just fine, so it's just a matter of readability. If you think this "if" is confusing I'm happy to just remove it.

I see, thanks for clarifying. In that case. I understand now.

However, it makes me wonder this will be still tricky for others. To simplify we could:

move the computation entirely to fillInCountsHelper; or

improve the comment

I would be leaning to (1.), but if you prefer (2.), then maybe:

// We only compute the number of slices for leaves if the requested level for slices // indicates the lowest level. Otherwise the information would be ignored by the // algorithm anyway.

I've moved this part into leaf logic inside fillInCountsHelper. Good idea!

mimowo · 2025-06-12T11:36:41Z

pkg/cache/tas_flavor_snapshot.go

+	}
+
 	required := isRequired(tasPodSetRequests.PodSet.TopologyRequest)
+


nit, revert formatting

mimowo · 2025-06-12T11:36:46Z

pkg/cache/tas_flavor_snapshot.go

+	sliceTopologyKey := s.sliceLevelKeyWithDefault(&tasPodSetRequests, ptr.To(s.lowestLevel()))
+
 	unconstrained := isUnconstrained(tasPodSetRequests.PodSet.TopologyRequest, &tasPodSetRequests)
+


nit, revert formatting

mimowo

LGTM overall. This is a really great extension to TAS 👍

mimowo · 2025-06-12T12:44:58Z

pkg/cache/tas_flavor_snapshot.go

 	}
 	domain.state = childrenCapacity
-	return childrenCapacity
+	if level == sliceLevelIdx {


Suggested change

if level == sliceLevelIdx {

if level <= sliceLevelIdx {

To avoid filling sliceState with unnecessary non-zero values as described in https://github.com/kubernetes-sigs/kueue/pull/5353/files/4bfd8844c197791804491904ff032376d2f83975..904b7a02548624789bf9c8942a9f7834cbc90fef#r2142562024

wdyt?

That's tricky part. There are two ways of calculating sliceState:

Sum children's sliceState

state / sliceSize

As an example, let's assume we have 4 topology levels: zone, block, rack, hostname. We request slices on block level. We start by assigning 0 as sliceState to all "hostname" domains (there is a special logic for leaves). Then we iterate through racks and we sum sliceStates of children - we get 0, because children have 0 everywhere, so racks also have sliceState 0. Then we iterate through blocks, but that's the slice requestes level, so we switch to 2nd way of calculating sliceState and we use state / sliceSize. Then we iterate through zones and we go back to the 1st method of summing children sliceStates, but children no longer have all 0 sliceState, so we propagate sliceState upwards.

If we change that "==" to "<=" we start using 2nd method of calculating sliceState for all domains above the slice level which will override any actual sliceState we can fit on the slice requested level. This will cause bugs similar to the following example:

SliceSize = 3 pods

Number of pods = 6 (so 2 slices)

Slice level = host

Rack has 3 hosts. Each host can fit 2 pods

So rack has state = 6, but sliceState = 0, because no host can fit a single slice, so we should noFit this workload

If we change "==" to "<=" we will calculate rack's sliceState = state / size = 2, so algorithm will say that we can fit that workload.

I'll write a test case for that as it seems it's easy to break.

Thank you for explaining. Since this is tricky, and easy to break, as I just did without breaking any tests, then +1 for the unit test. It can be a follow up.

Got it. I'll add that test in a follow-up then.

mimowo · 2025-06-12T13:22:56Z

/lgtm
/approve
Nice, thank you for working relentlessly on addressing all comments 👍

k8s-ci-robot · 2025-06-12T13:23:03Z

LGTM label has been added.

Git tree hash: 54b639b9db5a438b64cc7ec3ac84db00070c8ff2

k8s-ci-robot · 2025-06-12T13:23:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lchrzaszcz, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mimowo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mimowo · 2025-07-07T08:01:37Z

/kind feature

tenzen-y · 2025-07-28T14:40:36Z

/release-note-edit

TAS: Introduce two-level scheduling

k8s-ci-robot requested review from kannon92 and mimowo May 26, 2025 14:53

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 26, 2025

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 26, 2025

lchrzaszcz mentioned this pull request May 26, 2025

[WIP] Add 2nd layer of TAS scheduling in PodSet #5321

Closed

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 27, 2025

lchrzaszcz changed the title ~~Two level TAS scheduling / PodSet chunk topology~~ WIP Two level TAS scheduling / PodSet chunk topology May 28, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 28, 2025

lchrzaszcz commented May 29, 2025

View reviewed changes

lchrzaszcz marked this pull request as ready for review May 29, 2025 17:24

k8s-ci-robot requested a review from gabesaba May 29, 2025 17:24

lchrzaszcz changed the title ~~WIP Two level TAS scheduling / PodSet chunk topology~~ Two level TAS scheduling / PodSet chunk topology Jun 2, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 2, 2025

mimowo reviewed Jun 12, 2025

View reviewed changes

Address review issues

904b7a0

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 12, 2025

k8s-ci-robot requested review from PBundyra and gabesaba June 12, 2025 11:54

lchrzaszcz added 2 commits June 12, 2025 12:12

Fixed unit test

afde1ea

Refactor sliceState calculation

fa04e02

mimowo reviewed Jun 12, 2025

View reviewed changes

Add comments

6c5966a

k8s-ci-robot assigned mimowo Jun 12, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 12, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 12, 2025

k8s-ci-robot merged commit d81f629 into kubernetes-sigs:main Jun 12, 2025
23 checks passed

k8s-ci-robot added this to the v0.13 milestone Jun 12, 2025

This was referenced Jun 16, 2025

Two-level scheduling follow up / Improve readability #5660

Merged

Propagate error from TAS request builder #5662

Merged

Slice-only TAS topology #5683

Merged

This was referenced Jun 26, 2025

Validate upper-bound of slice size #5777

Closed

Write a unit test to make sure sliceState is set to 0 between TAS fitting consecutive PodSets in a single Workload #5779

Closed

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 7, 2025

lchrzaszcz mentioned this pull request Jul 11, 2025

REQUEST: New membership for lchrzaszcz kubernetes/org#5696

Closed

11 tasks

gabesaba mentioned this pull request Aug 12, 2025

Self-Nominate gabesaba@ as Full Approver #6555

Merged

		}

		required := isRequired(tasPodSetRequests.PodSet.TopologyRequest)

		sliceTopologyKey := s.sliceLevelKeyWithDefault(&tasPodSetRequests, ptr.To(s.lowestLevel()))

		unconstrained := isUnconstrained(tasPodSetRequests.PodSet.TopologyRequest, &tasPodSetRequests)

Two level TAS scheduling / PodSet chunk topology #5353

Two level TAS scheduling / PodSet chunk topology #5353

Uh oh!

Conversation

lchrzaszcz commented May 26, 2025 • edited by k8s-ci-robot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented May 26, 2025

Uh oh!

netlify bot commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Uh oh!

mimowo commented May 28, 2025

Uh oh!

mimowo commented May 28, 2025

Uh oh!

tenzen-y commented May 28, 2025

Uh oh!

lchrzaszcz commented May 28, 2025

Uh oh!

lchrzaszcz commented May 28, 2025

Uh oh!

lchrzaszcz commented May 29, 2025

Uh oh!

lchrzaszcz commented May 29, 2025

Uh oh!

lchrzaszcz commented May 29, 2025

Uh oh!

lchrzaszcz commented May 29, 2025

Uh oh!

lchrzaszcz May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mimowo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mimowo commented Jun 12, 2025

Uh oh!

k8s-ci-robot commented Jun 12, 2025

Uh oh!

k8s-ci-robot commented Jun 12, 2025

Uh oh!

Uh oh!

mimowo commented Jul 7, 2025

Uh oh!

tenzen-y commented Jul 28, 2025

Uh oh!

Reviewers

Assignees

lchrzaszcz commented May 26, 2025 •

edited by k8s-ci-robot

Loading

netlify bot commented May 26, 2025 •

edited

Loading

lchrzaszcz May 29, 2025 •

edited

Loading