Skip to content

KEP-5547: Integrate Workload APIs with Job controller#5871

Merged
k8s-ci-robot merged 15 commits intokubernetes:masterfrom
helayoty:helayoty/5547-workload-job
Feb 11, 2026
Merged

KEP-5547: Integrate Workload APIs with Job controller#5871
k8s-ci-robot merged 15 commits intokubernetes:masterfrom
helayoty:helayoty/5547-workload-job

Conversation

@helayoty
Copy link
Member

@helayoty helayoty commented Feb 2, 2026

Signed-off-by: helayoty <heelayot@microsoft.com>
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Feb 2, 2026
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Feb 2, 2026
@k8s-ci-robot k8s-ci-robot requested a review from kow3ns February 2, 2026 16:53
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Feb 2, 2026
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Feb 2, 2026
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 2, 2026
@helayoty
Copy link
Member Author

helayoty commented Feb 2, 2026

cc @mm4tt @erictune @soltysh @kannon92

### Goals

- Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling.
- Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would break if JobSet also adds gang support.

How can someone opt out of this even if parallelism > 1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What failure mode do you have on your mind - if JobSet creates its own Workload/PodGroup for the whole JobSet?

I believe that we need a mechanism that if the Workload (PodGroup?) already exists, that should be adopted and no new one should be created. I also treat it as an "opt-out" mechanism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ownerReferences check would be sufficient here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ownerReferences check would be sufficient here.

This is what's stated in the KEP. You can find it in the notes/constraints section in addition to the unit tests and integration tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could require that for the automatic creation to happen, Job.spec.template.spec.workloadRef must be empty. If it is set to anything, this is a signal that preexisting PodGroups or Workloads may be involved, and so it does not create them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed according to @erictune suggestion and based on our conversation on the Workload meeting today. PTAL.

@helayoty
Copy link
Member Author

helayoty commented Feb 2, 2026

/sig apps

We will add the following integration tests to the Job controller `https://github.com/kubernetes/kubernetes/blob/v1.35.0/test/integration/job/job_test.go`:
- Gang and Basic Scheduling Lifecycle Test (create, update, delete Job, verify Workload and PodGroup creation, verify pods have workloadRef, verify Job deletion cascades to Workload and PodGroup deletion)
- Failure Recovery Test (create Job with Workload API unavailable, verify Job controller retries, verify Workload is eventually created)
- Feature gate disable/enable (Jobs work without Workload/PodGroup creation (Jobs with ownerReferences managed by higher-level controllers do not create Workload/PodGroup))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a few areas we need to cover in alpha:

  • How does this feature work with suspended jobs?
  • If a job has ownerreferences set can we verify that no workload is created?
  • ElasticJob is forbidden. We should test/verify this.

Copy link
Member Author

@helayoty helayoty Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2 and 3 already stated in the Proposal section already. I'll add the suspended jobs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. PTAL.

@helayoty helayoty moved this to In Progress in SIG Scheduling Feb 5, 2026
@helayoty helayoty moved this from Needs Triage to In Progress in SIG Apps Feb 5, 2026
@wojtek-t wojtek-t self-assigned this Feb 5, 2026
### Goals

- Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling.
- Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What failure mode do you have on your mind - if JobSet creates its own Workload/PodGroup for the whole JobSet?

I believe that we need a mechanism that if the Workload (PodGroup?) already exists, that should be adopted and no new one should be created. I also treat it as an "opt-out" mechanism.

### Goals

- Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling.
- Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we want to also check "parallelism=completions" - if the completions are larger than parallelism, then clearly it is not a gang...

@soltysh for your thoughts too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have any strong requirements wrt parallelism == completions for defining parallel jobs. If you look at our docs we call parallel everything that has parallelism > 1.

There's also the question of indexed and non-indexed jobs, should we differentiate between the two? I remember there was some discussion at one point to only limit to indexed jobs, but I'm open.

Honestly, I'm inclined to start with stricter rules for creating the gang, and we can expand as we go, and we see it makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see a reason to restrict to only IndexedJobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-indexed jobs can make a workload with Basic policy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm inclined to start with stricter rules for creating the gang, and we can expand as we go, and we see it makes sense.

+1 to it
The goal is to prove the integration in Alpha, not to support every potential usecase for gang-scheduling. We should focus on not breaking anyone in this phase.

So the question is - what are the exact rules to use for gang-scheduling policy. Are we suggesting:

  • parallelism > 1
  • parallelism = completion
  • Indexed only

[For everything else we can still create Workload, but use the Basic scheduling policy]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - along those lines, I don't think "parallel jobs" per-se are a goal. I think we want to bring the "first step of value" to users, so I would rather take one example where we know we need gang-scheduling and do that. The above criteria are relatively narrow, but maybe that's exactly what we want to start with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated Goals to address the feedback.PTAL

- The alpha release targets simple, static batch workloads where the workload requirements are known at creation time.
- Each Job maps to one `PodGroup`. All pods in the Job are identical from a scheduling policy perspective.
- The `minCount` field in the Workload's `GangSchedulingPolicy` mirrors the Job's parallelism.
- There is no mechanism to opt-out of `Workload`/`PodGroup` creation for indexed (parallel) jobs if feature gate is enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there has to, but maybe we it can be as simple as "create your own Workload that basically explicitly states the Basic policy" and teach job controller to adopt it in that case (create Workload only if it doesn't exist - i.e. support BYOW - bring-your-own-workload).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic policy" and teach job controller to adopt it in that case

That's roughly my previous question. How do we define the existing workload for a job controller to recognize, and adopt.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated based on our conversation today. PTAL.


- Check if a `Workload` object already exists for this `Job`.
- If not, determine the appropriate scheduling policy and create the `Workload` object with the determined policy.
- If it already exists, verify the existing `Workload` matches the `Job` spec. If not, update the `Workload` object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "verify if it matches"?

I think we probably want to verify the structure (there is a single PodGroup) or sth, but if someone set Basic and we believe it should be Gang - I definitely wouldn't update it. We should treat at as an opt-out mechanism.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I believe that updating in general sounds like a bad pattern - I would rather say that in case the structure doesn't match at all, we rather give up and create the pods without the Workload/PodGroup at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also Workload is immutable in 1.36, so that's not an option.

Should this be a "silent give-up" (I ignore Workload and proceed with old logic) or should we have an admission blocking creation of Job objects with illegal Workloads attached?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see here clear criteria for what it means that a Workload exists for a Job. In one place I've seen information about ownerRef, but I'm not seeing it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a "silent give-up" (I ignore Workload and proceed with old logic) or should we have an admission blocking creation of Job objects with illegal Workloads attached?

You can always have races:
I admit the Job because no workload exists but at the same time a Workload object is created.
So that's not a full solution.

I didn't want to say it's a "silent give-up" - we should set some condition/emit event/whatever.
But if we don't know what to do with that Workload, at least for now we should error on the side of "don't break existing things".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this section according to the discussion here. PTAL.

- If it already exists, verify the existing `Workload` matches the `Job` spec. If not, update the `Workload` object.
- Check if a `PodGroup` object already exists for this `Job`.
- If not, create the `PodGroup` object referencing the `Workload`
- If it already exists, verify the existing `PodGroup` correctly references the `Workload`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - if it isn't, how we even know that it is "the PodGroup" that we should use?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll ask differently, if we're assuming that the order of ownerRefs is always from Workload->PodGroup, do we have a validation in place to ensure that? If not, we should either establish one, otherwise users will start creating different combination and whatever logic we come up in the job controller will either be over-complicated or won't work for most cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote this section to address your feedback. PTAL.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @helayoty!
I left a few comments.

- Each Job maps to one `PodGroup`. All pods in the Job are identical from a scheduling policy perspective.
- The `minCount` field in the Workload's `GangSchedulingPolicy` mirrors the Job's parallelism.
- There is no mechanism to opt-out of `Workload`/`PodGroup` creation for indexed (parallel) jobs if feature gate is enabled.
- When gang scheduling is active (parallel jobs), changes to `spec.parallelism` are blocked via admission validation because this would require changing `minCount`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned above, I don't understand why do we need this limitation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why admission validation? We can do this conditionally (based on FG on/off state) in job validation, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. No need for admission validation here. Updated. PTAL.

- The `minCount` field in the Workload's `GangSchedulingPolicy` mirrors the Job's parallelism.
- There is no mechanism to opt-out of `Workload`/`PodGroup` creation for indexed (parallel) jobs if feature gate is enabled.
- When gang scheduling is active (parallel jobs), changes to `spec.parallelism` are blocked via admission validation because this would require changing `minCount`
- If a Job has `ownerReferences` indicating it is managed by another controller (i.e., JobSet), the Job controller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- If a Job has `ownerReferences` indicating it is managed by another controller (i.e., JobSet), the Job controller
- If a Job has `ownerReferences` indicating it is managed by another controller (i.e., JobSet, TrainJob), the Job controller

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That answers my question, roughly. But at the same time this means that a CronJob-owned Jobs will not be capable of using Workloads. Unless we introduce workloads there as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But at the same time this means that a CronJob-owned Jobs will not be capable of using Workloads. Unless we introduce workloads there as well.

This is my assumption.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But at the same time this means that a CronJob-owned Jobs will not be capable of using Workloads

I would say it's desired - let's not try to boil the ocean here.

Comment on lines +233 to +235
The Job controller must create objects in a strict order to ensures that the scheduler can properly validate pods
against their scheduling policy before attempting to schedule them. The order is as follows:
1. `Workload` object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that really needed? I asked previously, it doesn't matter in which order objects will be created since kube-scheduler will wait for Workload and PodGroup objects if Pods have workloadRef

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it's needed. Added more justification. PTAL.

### Goals

- Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling.
- Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is fine for alpha, but AFAIK it's a quite common case to start a job with parallelism=1 and later scale it up as a gang. If we go with what if proposed here the Job will be created without gang-scheduling initially and it's not clear how to change it later.

Overall, I think I'm fine with having this "default gang iff parallelism > 1" in alpha - given that alpha features need to be enabled explicitly and the contract is kind of "use at your own risk".

However for beta promotion we need to have a full API on the Job side figured out - with figured out things like opt-in / opt-out and support for common use-cases (like going from parallelism 1->N and having gang scheduling )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/feature-gates.md#features-which-add-a-new-api-field

When introducing a new API field the feature must start in alpha.

So if we want to add an API I see the following path:
alpha - 1.36 (rough sketch of implementation without API)
alpha - 1.37 (api for opt in / opt out)
beta - 1.38


- Check if a `Workload` object already exists for this `Job`.
- If not, determine the appropriate scheduling policy and create the `Workload` object with the determined policy.
- If it already exists, verify the existing `Workload` matches the `Job` spec. If not, update the `Workload` object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also Workload is immutable in 1.36, so that's not an option.

Should this be a "silent give-up" (I ignore Workload and proceed with old logic) or should we have an admission blocking creation of Job objects with illegal Workloads attached?

### Goals

- Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling.
- Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have any strong requirements wrt parallelism == completions for defining parallel jobs. If you look at our docs we call parallel everything that has parallelism > 1.

There's also the question of indexed and non-indexed jobs, should we differentiate between the two? I remember there was some discussion at one point to only limit to indexed jobs, but I'm open.

Honestly, I'm inclined to start with stricter rules for creating the gang, and we can expand as we go, and we see it makes sense.

- Parallelism change is blocked for gang-scheduled Jobs and allowed for basic-scheduled Jobs
- Job deletion cascades to Workload and PodGroup deletion
- Feature gate disabled: Jobs work without Workload/PodGroup creation
- Jobs with ownerReferences (managed by higher-level controllers) do not create Workload/PodGroup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably also add tests verifying the actual ownerRefs values for job, workload and podgroup, so that it matches the expected ordering.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.PTAL.

Signed-off-by: helayoty <heelayot@microsoft.com>
### Goals

- Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling.
- Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - along those lines, I don't think "parallel jobs" per-se are a goal. I think we want to bring the "first step of value" to users, so I would rather take one example where we know we need gang-scheduling and do that. The above criteria are relatively narrow, but maybe that's exactly what we want to start with.

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Feb 11, 2026
Signed-off-by: helayoty <heelayot@microsoft.com>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2026
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: helayoty, soltysh, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 11, 2026
@k8s-ci-robot k8s-ci-robot merged commit 5e67e40 into kubernetes:master Feb 11, 2026
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.36 milestone Feb 11, 2026
@github-project-automation github-project-automation bot moved this from Needs Approval to Done in SIG Apps Feb 11, 2026
@github-project-automation github-project-automation bot moved this from Needs Final Approver to Done in SIG Scheduling Feb 11, 2026
Copy link
Member

@dom4ha dom4ha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't managed before merge, but all looks good!

I wonder if we could not block elastic jobs by allowing them to opt-out by defining Basic policy?

Thanks @helayoty @wojtek-t and all others for the work on integration!


### Validation for Parallelism Changes

The Job API Validation rejects updates that change `spec.parallelism` when the feature gate is enabled and the Job uses gang scheduling. Since changing this field would require changing `minCount` in the `Workload` object, which is immutable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a workload object created with basic policy (to opt-out) is the elastic job still not allowed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided not to create workload+podgroup in all other cases, except for when it's requesting gang scheduling. At least in the initial alpha. See #5871 (comment) for discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of Indexed Jobs, index 0 is sometimes treated as a special role, then we want to schedule index-0 and other indexes separately.

In this situation, are we able to manually create 2 PodGroups, each for the index-0 Pod and other index Pods by using the Workload Basic policy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this situation, are we able to manually create 2 PodGroups, each for the index-0 Pod and other index Pods by using the Workload Basic policy?

Yes, you can always create Workload+PodGroups in your desired configuration. In that case the job controller won't do it for you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for describing that.
That sounds reasonable. We might be able to natively support various PodGroup creation patterns based on real usecases in the future.

But, I agree with keeping a minimum at this time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a workload object created with basic policy (to opt-out) is the elastic job still not allowed?

@dom4ha - the answer by @soltysh above answers that.
You can always create your own Workload/PodGroup if you want. That will just work.
What we want ed to ensure is that if the Workload will be created by us, we will not allow for scaling it.

@soltysh
Copy link
Contributor

soltysh commented Feb 11, 2026

I wonder if we could not block elastic jobs by allowing them to opt-out by defining Basic policy?

Creating your own workload resources basically allows you to freely manage the job, so it's definitely a reasonable option. Definitely, worth documenting.

@tenzen-y
Copy link
Member

I wonder if we could not block elastic jobs by allowing them to opt-out by defining Basic policy?

Creating your own workload resources basically allows you to freely manage the job, so it's definitely a reasonable option. Definitely, worth documenting.

+1
I think so too.

@atiratree
Copy link
Member

atiratree commented Feb 13, 2026

From User Stories:

Standard Batch Job with Workload Tracking

As a data engineer, I want to run a batch processing job that processes files sequentially without gang scheduling requirements.

  1. The data engineer usually has enough resources on their cluster; setting Indexed and Completions == Parallelism to achieve fast execution.
  2. The job controller creates all of the pods.
  3. Pods get scheduled and executed.

or

  1. The data engineer usually has enough resources on their cluster; setting Indexed and Completions == Parallelism to achieve fast execution.
  2. There is fewer resources on the shared cluster due to new workloads running there.
  3. The job controller creates all of the pods.
  4. Half of the pods get scheduled and executed.
  5. Second half of the pods get scheduled and executed.

IIUIC, the job controller will create Workload + PodGroup + Gang scheduling policy automatically for both of these cases and might block the second example from running.

What was the main reason for making gang scheduling the default behavior instead of opt-in? Please let me know, if you have already discussed this scenario or if I have missed something?

@soltysh
Copy link
Contributor

soltysh commented Feb 16, 2026

What was the main reason for making gang scheduling the default behavior instead of opt-in? Please let me know, if you have already discussed this scenario or if I have missed something?

The goal was to experiment and NOT introduce API fields in the Job resource. We haven't reached an agreement what the settings should be on the Job side for that functionality and this approach allows us to experiment (while the feature is still alpha) and better understand how to expose the necessary knobs. We want to avoid expanding the Job API in ways that we know will change, since both Workload API and PodGroup API are under heavy development currently.

@soltysh
Copy link
Contributor

soltysh commented Feb 16, 2026

@atiratree #5548 is roughly what I would like to avoid, and that triggered the entire discussion about not expanding the API, until we have a clear picture.

@atiratree
Copy link
Member

I am not suggesting that we introduce a new API / fields. I am just curious about the scheduling aspect. I will bring this up at the next SIG scheduling meeting for further discussion.

@helayoty
Copy link
Member Author

/area workload-aware

@k8s-ci-robot k8s-ci-robot added the area/workload-aware Categorizes an issue or PR as relevant to Workload-aware and Topology-aware scheduling subprojects. label Feb 17, 2026
@helayoty
Copy link
Member Author

IIUIC, the job controller will create Workload + PodGroup + Gang scheduling policy automatically for both of these cases and might block the second example from running.

What was the main reason for making gang scheduling the default behavior instead of opt-in? Please let me know, if you have already discussed this scenario or if I have missed something?

@atiratree , There are few points that we need to clarify:

  1. Creating Workload/PodGroup doesn't enforce gang scheduling by itself.

    The Job controller creates the Workload and PodGroup objects, it does not change how the scheduler places pods. The scheduler's gang scheduling behavior is controlled by a separate feature gate GangScheduling. If only EnableWorkloadWithJob is enabled without GangScheduling on the scheduler, the Workload/PodGroup objects are created as metadata, but pods are scheduled normally. Which means scenario 2 would continue to work as expected.

  2. When both gates are enabled, the behavior change is intentional. In that case, yes, the second scenario would wait for all pods to be schedulable simultaneously rather than making partial progress. Which is expected and intended.

  3. On auto-detection vs explicit opt-in.

    The structural criteria (Indexed + Completions = Parallelism) are a strong signal of fixed-size parallel workloads that typically benefit from gang scheduling. The rationale for auto-detection:

    • It avoids requiring users to learn about the Workload/PodGroup APIs to get the benefit.
    • Higher-level controllers (JobSet, LWS, etc.) can opt out by setting schedulingGroup themselves.

That said, your data engineer scenario is a real one as not every Indexed job with C==P needs gang scheduling and in that case user can create an empty workload to reference on their Job object.

@helayoty helayoty deleted the helayoty/5547-workload-job branch February 17, 2026 13:39
@atiratree
Copy link
Member

When both gates are enabled, the behavior change is intentional. In that case, yes, the second scenario would wait for all pods to be schedulable simultaneously rather than making partial progress. Which is expected and intended.

It is intentional now, but it will become the default behavior when both of these feature gates graduate.

@helayoty
Copy link
Member Author

When both gates are enabled, the behavior change is intentional. In that case, yes, the second scenario would wait for all pods to be schedulable simultaneously rather than making partial progress. Which is expected and intended.

It is intentional now, but it will become the default behavior when both of these feature gates graduate.

True. This was by design. The default is to gang for these types of Jobs. If users don't want to, they can opt-out by creating their own Workload.

@wojtek-t @mm4tt

@soltysh
Copy link
Contributor

soltysh commented Feb 18, 2026

It is intentional now, but it will become the default behavior when both of these feature gates graduate.

The beta graduation criteria clearly state that for beta we will expose necessary knobs for users to tweak when and how workload/podgroup are created, which will give users the ability to opt-in or opt-out on demand. As I stated before the alpha stage is to figure out WHAT the API should look like, b/c we don't have a clear picture. Heba, Wojtek, Eric, Matt and I spent significant time going back and forth on the shape and we've decided that this path will ensure we can come up with long-term API, rather than ad-hoc changes which are then costly to maintain.

@atiratree
Copy link
Member

After further clarification from SIG Scheduling, this feature is being implemented as experimental and will most likely have to be changed in Beta. The use cases and breaking changes (against the stable Job API) will have to be analysed and discussed again.

@wojtek-t
Copy link
Member

True. This was by design. The default is to gang for these types of Jobs. If users don't want to, they can opt-out by creating their own Workload.

To clarify - this was only decision for Alpha.
There is very high change (I would say almost 100%) that the actual defaulting mechanism will be different once we reach Beta.
So as @soltysh already wrote above - the goal for Alpha is to prove that we can integrate through the whole stack. The actual knobs on how users can configure that were explicitly put out-of-scope for Alpha and are Beta criteria and will be figure by then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/workload-aware Categorizes an issue or PR as relevant to Workload-aware and Topology-aware scheduling subprojects. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.