KEP-5547: Integrate Workload APIs with Job controller by helayoty · Pull Request #5871 · kubernetes/enhancements

helayoty · 2026-02-02T16:53:25Z

One-line PR description: Integrate Workload and PodGroup APIs with the Job controllers to support gang-scheduling.
Issue link: WAS: Integrate Workload APIs with Job controller #5547
Other comments: See other KEPs

/sig scheduling

Signed-off-by: helayoty <heelayot@microsoft.com>

helayoty · 2026-02-02T16:54:27Z

cc @mm4tt @erictune @soltysh @kannon92

keps/sig-apps/5547-integrate-workload-with-job/kep.yaml

keps/prod-readiness/sig-apps/5547.yaml

kannon92 · 2026-02-02T17:08:13Z

keps/sig-apps/5547-integrate-workload-with-job/README.md

+### Goals
+
+- Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling.
+- Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism`


This would break if JobSet also adds gang support.

How can someone opt out of this even if parallelism > 1?

What failure mode do you have on your mind - if JobSet creates its own Workload/PodGroup for the whole JobSet?

I believe that we need a mechanism that if the Workload (PodGroup?) already exists, that should be adopted and no new one should be created. I also treat it as an "opt-out" mechanism.

Maybe ownerReferences check would be sufficient here.

Maybe ownerReferences check would be sufficient here.

This is what's stated in the KEP. You can find it in the notes/constraints section in addition to the unit tests and integration tests.

You could require that for the automatic creation to happen, Job.spec.template.spec.workloadRef must be empty. If it is set to anything, this is a signal that preexisting PodGroups or Workloads may be involved, and so it does not create them.

Addressed according to @erictune suggestion and based on our conversation on the Workload meeting today. PTAL.

helayoty · 2026-02-02T17:12:38Z

/sig apps

kannon92 · 2026-02-02T17:18:54Z

keps/sig-apps/5547-integrate-workload-with-job/README.md

+We will add the following integration tests to the Job controller `https://github.com/kubernetes/kubernetes/blob/v1.35.0/test/integration/job/job_test.go`:
+- Gang and Basic Scheduling Lifecycle Test (create, update, delete Job, verify Workload and PodGroup creation, verify pods have workloadRef, verify Job deletion cascades to Workload and PodGroup deletion)
+- Failure Recovery Test (create Job with Workload API unavailable, verify Job controller retries, verify Workload is eventually created)
+- Feature gate disable/enable (Jobs work without Workload/PodGroup creation (Jobs with ownerReferences managed by higher-level controllers do not create Workload/PodGroup))


I see a few areas we need to cover in alpha:

How does this feature work with suspended jobs?

If a job has ownerreferences set can we verify that no workload is created?

ElasticJob is forbidden. We should test/verify this.

The 2 and 3 already stated in the Proposal section already. I'll add the suspended jobs.

Addressed. PTAL.

keps/sig-apps/5547-integrate-workload-with-job/README.md

wojtek-t · 2026-02-05T12:55:55Z

keps/sig-apps/5547-integrate-workload-with-job/README.md

+### Goals
+
+- Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling.
+- Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism`


What failure mode do you have on your mind - if JobSet creates its own Workload/PodGroup for the whole JobSet?

I believe that we need a mechanism that if the Workload (PodGroup?) already exists, that should be adopted and no new one should be created. I also treat it as an "opt-out" mechanism.

wojtek-t · 2026-02-05T12:56:35Z

keps/sig-apps/5547-integrate-workload-with-job/README.md

+### Goals
+
+- Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling.
+- Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism`


So we want to also check "parallelism=completions" - if the completions are larger than parallelism, then clearly it is not a gang...

@soltysh for your thoughts too

We don't have any strong requirements wrt parallelism == completions for defining parallel jobs. If you look at our docs we call parallel everything that has parallelism > 1.

There's also the question of indexed and non-indexed jobs, should we differentiate between the two? I remember there was some discussion at one point to only limit to indexed jobs, but I'm open.

Honestly, I'm inclined to start with stricter rules for creating the gang, and we can expand as we go, and we see it makes sense.

cc @andreyvelich @kannon92

I don't really see a reason to restrict to only IndexedJobs.

Non-indexed jobs can make a workload with Basic policy.

Honestly, I'm inclined to start with stricter rules for creating the gang, and we can expand as we go, and we see it makes sense.

+1 to it
The goal is to prove the integration in Alpha, not to support every potential usecase for gang-scheduling. We should focus on not breaking anyone in this phase.

So the question is - what are the exact rules to use for gang-scheduling policy. Are we suggesting:

parallelism > 1

parallelism = completion

Indexed only

[For everything else we can still create Workload, but use the Basic scheduling policy]

Also - along those lines, I don't think "parallel jobs" per-se are a goal. I think we want to bring the "first step of value" to users, so I would rather take one example where we know we need gang-scheduling and do that. The above criteria are relatively narrow, but maybe that's exactly what we want to start with.

Updated Goals to address the feedback.PTAL

keps/sig-apps/5547-integrate-workload-with-job/README.md

wojtek-t · 2026-02-05T13:02:40Z

keps/sig-apps/5547-integrate-workload-with-job/README.md

+- The alpha release targets simple, static batch workloads where the workload requirements are known at creation time.
+- Each Job maps to one `PodGroup`. All pods in the Job are identical from a scheduling policy perspective.
+- The `minCount` field in the Workload's `GangSchedulingPolicy` mirrors the Job's parallelism.
+- There is no mechanism to opt-out of `Workload`/`PodGroup` creation for indexed (parallel) jobs if feature gate is enabled.


I think there has to, but maybe we it can be as simple as "create your own Workload that basically explicitly states the Basic policy" and teach job controller to adopt it in that case (create Workload only if it doesn't exist - i.e. support BYOW - bring-your-own-workload).

Basic policy" and teach job controller to adopt it in that case

That's roughly my previous question. How do we define the existing workload for a job controller to recognize, and adopt.

Updated based on our conversation today. PTAL.

keps/sig-apps/5547-integrate-workload-with-job/README.md

wojtek-t · 2026-02-05T13:06:47Z

keps/sig-apps/5547-integrate-workload-with-job/README.md

+
+- Check if a `Workload` object already exists for this `Job`.
+  - If not, determine the appropriate scheduling policy and create the `Workload` object with the determined policy.
+  - If it already exists, verify the existing `Workload` matches the `Job` spec. If not, update the `Workload` object.


What do you mean by "verify if it matches"?

I think we probably want to verify the structure (there is a single PodGroup) or sth, but if someone set Basic and we believe it should be Gang - I definitely wouldn't update it. We should treat at as an opt-out mechanism.

FWIW, I believe that updating in general sounds like a bad pattern - I would rather say that in case the structure doesn't match at all, we rather give up and create the pods without the Workload/PodGroup at all.

Also Workload is immutable in 1.36, so that's not an option.

Should this be a "silent give-up" (I ignore Workload and proceed with old logic) or should we have an admission blocking creation of Job objects with illegal Workloads attached?

I'd like to see here clear criteria for what it means that a Workload exists for a Job. In one place I've seen information about ownerRef, but I'm not seeing it here.

Should this be a "silent give-up" (I ignore Workload and proceed with old logic) or should we have an admission blocking creation of Job objects with illegal Workloads attached?

You can always have races:
I admit the Job because no workload exists but at the same time a Workload object is created.
So that's not a full solution.

I didn't want to say it's a "silent give-up" - we should set some condition/emit event/whatever.
But if we don't know what to do with that Workload, at least for now we should error on the side of "don't break existing things".

Updated this section according to the discussion here. PTAL.

wojtek-t · 2026-02-05T13:08:54Z

keps/sig-apps/5547-integrate-workload-with-job/README.md

+  - If it already exists, verify the existing `Workload` matches the `Job` spec. If not, update the `Workload` object.
+- Check if a `PodGroup` object already exists for this `Job`.
+  - If not, create the `PodGroup` object referencing the `Workload`
+  - If it already exists, verify the existing `PodGroup` correctly references the `Workload`.


Hmm - if it isn't, how we even know that it is "the PodGroup" that we should use?

I'll ask differently, if we're assuming that the order of ownerRefs is always from Workload->PodGroup, do we have a validation in place to ensure that? If not, we should either establish one, otherwise users will start creating different combination and whatever logic we come up in the job controller will either be over-complicated or won't work for most cases.

Rewrote this section to address your feedback. PTAL.

andreyvelich

Thank you @helayoty!
I left a few comments.

keps/sig-apps/5547-integrate-workload-with-job/README.md

andreyvelich · 2026-02-05T15:13:24Z

keps/sig-apps/5547-integrate-workload-with-job/README.md

+- Each Job maps to one `PodGroup`. All pods in the Job are identical from a scheduling policy perspective.
+- The `minCount` field in the Workload's `GangSchedulingPolicy` mirrors the Job's parallelism.
+- There is no mechanism to opt-out of `Workload`/`PodGroup` creation for indexed (parallel) jobs if feature gate is enabled.
+- When gang scheduling is active (parallel jobs), changes to `spec.parallelism` are blocked via admission validation because this would require changing `minCount`


As I mentioned above, I don't understand why do we need this limitation?

Why admission validation? We can do this conditionally (based on FG on/off state) in job validation, no?

Agree. No need for admission validation here. Updated. PTAL.

andreyvelich · 2026-02-05T15:14:10Z

keps/sig-apps/5547-integrate-workload-with-job/README.md

+- The `minCount` field in the Workload's `GangSchedulingPolicy` mirrors the Job's parallelism.
+- There is no mechanism to opt-out of `Workload`/`PodGroup` creation for indexed (parallel) jobs if feature gate is enabled.
+- When gang scheduling is active (parallel jobs), changes to `spec.parallelism` are blocked via admission validation because this would require changing `minCount`
+- If a Job has `ownerReferences` indicating it is managed by another controller (i.e., JobSet), the Job controller 


Suggested change

- If a Job has `ownerReferences` indicating it is managed by another controller (i.e., JobSet), the Job controller

- If a Job has `ownerReferences` indicating it is managed by another controller (i.e., JobSet, TrainJob), the Job controller

That answers my question, roughly. But at the same time this means that a CronJob-owned Jobs will not be capable of using Workloads. Unless we introduce workloads there as well.

But at the same time this means that a CronJob-owned Jobs will not be capable of using Workloads. Unless we introduce workloads there as well.

This is my assumption.

But at the same time this means that a CronJob-owned Jobs will not be capable of using Workloads

I would say it's desired - let's not try to boil the ocean here.

keps/sig-apps/5547-integrate-workload-with-job/README.md

andreyvelich · 2026-02-05T15:20:23Z

keps/sig-apps/5547-integrate-workload-with-job/README.md

+The Job controller must create objects in a strict order to ensures that the scheduler can properly validate pods 
+against their scheduling policy before attempting to schedule them. The order is as follows:
+1. `Workload` object


Is that really needed? I asked previously, it doesn't matter in which order objects will be created since kube-scheduler will wait for Workload and PodGroup objects if Pods have workloadRef

I'd say it's needed. Added more justification. PTAL.

keps/sig-apps/5547-integrate-workload-with-job/README.md

mm4tt · 2026-02-06T09:47:44Z

keps/sig-apps/5547-integrate-workload-with-job/README.md

+### Goals
+
+- Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling.
+- Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism`


I guess this is fine for alpha, but AFAIK it's a quite common case to start a job with parallelism=1 and later scale it up as a gang. If we go with what if proposed here the Job will be created without gang-scheduling initially and it's not clear how to change it later.

Overall, I think I'm fine with having this "default gang iff parallelism > 1" in alpha - given that alpha features need to be enabled explicitly and the contract is kind of "use at your own risk".

However for beta promotion we need to have a full API on the Job side figured out - with figured out things like opt-in / opt-out and support for common use-cases (like going from parallelism 1->N and having gang scheduling )

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/feature-gates.md#features-which-add-a-new-api-field

When introducing a new API field the feature must start in alpha.

So if we want to add an API I see the following path:
alpha - 1.36 (rough sketch of implementation without API)
alpha - 1.37 (api for opt in / opt out)
beta - 1.38

keps/sig-apps/5547-integrate-workload-with-job/README.md

mm4tt · 2026-02-06T10:06:50Z

keps/sig-apps/5547-integrate-workload-with-job/README.md

+
+- Check if a `Workload` object already exists for this `Job`.
+  - If not, determine the appropriate scheduling policy and create the `Workload` object with the determined policy.
+  - If it already exists, verify the existing `Workload` matches the `Job` spec. If not, update the `Workload` object.


Also Workload is immutable in 1.36, so that's not an option.

Should this be a "silent give-up" (I ignore Workload and proceed with old logic) or should we have an admission blocking creation of Job objects with illegal Workloads attached?

keps/sig-apps/5547-integrate-workload-with-job/README.md

soltysh · 2026-02-06T15:08:19Z

keps/sig-apps/5547-integrate-workload-with-job/README.md

+### Goals
+
+- Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling.
+- Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism`


We don't have any strong requirements wrt parallelism == completions for defining parallel jobs. If you look at our docs we call parallel everything that has parallelism > 1.

There's also the question of indexed and non-indexed jobs, should we differentiate between the two? I remember there was some discussion at one point to only limit to indexed jobs, but I'm open.

Honestly, I'm inclined to start with stricter rules for creating the gang, and we can expand as we go, and we see it makes sense.

keps/sig-apps/5547-integrate-workload-with-job/README.md

soltysh · 2026-02-06T15:36:07Z

keps/sig-apps/5547-integrate-workload-with-job/README.md

+  - Parallelism change is blocked for gang-scheduled Jobs and allowed for basic-scheduled Jobs
+  - Job deletion cascades to Workload and PodGroup deletion
+  - Feature gate disabled: Jobs work without Workload/PodGroup creation
+  - Jobs with ownerReferences (managed by higher-level controllers) do not create Workload/PodGroup


Probably also add tests verifying the actual ownerRefs values for job, workload and podgroup, so that it matches the expected ordering.

Added.PTAL.

keps/sig-apps/5547-integrate-workload-with-job/README.md

keps/sig-apps/5547-integrate-workload-with-job/kep.yaml

Signed-off-by: helayoty <heelayot@microsoft.com>

wojtek-t · 2026-02-09T11:13:33Z

keps/sig-apps/5547-integrate-workload-with-job/README.md

+### Goals
+
+- Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling.
+- Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism`


Also - along those lines, I don't think "parallel jobs" per-se are a goal. I think we want to bring the "first step of value" to users, so I would rather take one example where we know we need gang-scheduling and do that. The above criteria are relatively narrow, but maybe that's exactly what we want to start with.

keps/sig-apps/5547-integrate-workload-with-job/README.md

Signed-off-by: helayoty <heelayot@microsoft.com>

soltysh

/lgtm
/approve

k8s-ci-robot · 2026-02-11T14:55:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: helayoty, soltysh, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [soltysh,wojtek-t]
~~keps/sig-apps/OWNERS~~ [soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dom4ha

Haven't managed before merge, but all looks good!

I wonder if we could not block elastic jobs by allowing them to opt-out by defining Basic policy?

Thanks @helayoty @wojtek-t and all others for the work on integration!

dom4ha · 2026-02-11T15:01:28Z

keps/sig-apps/5547-integrate-workload-with-job/README.md

+
+### Validation for Parallelism Changes
+
+The Job API Validation rejects updates that change `spec.parallelism` when the feature gate is enabled and the Job uses gang scheduling. Since changing this field would require changing `minCount` in the `Workload` object, which is immutable.


If there is a workload object created with basic policy (to opt-out) is the elastic job still not allowed?

We decided not to create workload+podgroup in all other cases, except for when it's requesting gang scheduling. At least in the initial alpha. See #5871 (comment) for discussion.

In the case of Indexed Jobs, index 0 is sometimes treated as a special role, then we want to schedule index-0 and other indexes separately.

In this situation, are we able to manually create 2 PodGroups, each for the index-0 Pod and other index Pods by using the Workload Basic policy?

In this situation, are we able to manually create 2 PodGroups, each for the index-0 Pod and other index Pods by using the Workload Basic policy?

Yes, you can always create Workload+PodGroups in your desired configuration. In that case the job controller won't do it for you.

Thank you for describing that.
That sounds reasonable. We might be able to natively support various PodGroup creation patterns based on real usecases in the future.

But, I agree with keeping a minimum at this time.

If there is a workload object created with basic policy (to opt-out) is the elastic job still not allowed?

@dom4ha - the answer by @soltysh above answers that.
You can always create your own Workload/PodGroup if you want. That will just work.
What we want ed to ensure is that if the Workload will be created by us, we will not allow for scaling it.

soltysh · 2026-02-11T15:18:57Z

I wonder if we could not block elastic jobs by allowing them to opt-out by defining Basic policy?

Creating your own workload resources basically allows you to freely manage the job, so it's definitely a reasonable option. Definitely, worth documenting.

tenzen-y · 2026-02-11T15:31:06Z

I wonder if we could not block elastic jobs by allowing them to opt-out by defining Basic policy?

Creating your own workload resources basically allows you to freely manage the job, so it's definitely a reasonable option. Definitely, worth documenting.

+1
I think so too.

atiratree · 2026-02-13T18:38:01Z

From User Stories:

Standard Batch Job with Workload Tracking

As a data engineer, I want to run a batch processing job that processes files sequentially without gang scheduling requirements.

The data engineer usually has enough resources on their cluster; setting Indexed and Completions == Parallelism to achieve fast execution.
The job controller creates all of the pods.
Pods get scheduled and executed.

or

The data engineer usually has enough resources on their cluster; setting Indexed and Completions == Parallelism to achieve fast execution.
There is fewer resources on the shared cluster due to new workloads running there.
The job controller creates all of the pods.
Half of the pods get scheduled and executed.
Second half of the pods get scheduled and executed.

IIUIC, the job controller will create Workload + PodGroup + Gang scheduling policy automatically for both of these cases and might block the second example from running.

What was the main reason for making gang scheduling the default behavior instead of opt-in? Please let me know, if you have already discussed this scenario or if I have missed something?

soltysh · 2026-02-16T10:04:28Z

What was the main reason for making gang scheduling the default behavior instead of opt-in? Please let me know, if you have already discussed this scenario or if I have missed something?

The goal was to experiment and NOT introduce API fields in the Job resource. We haven't reached an agreement what the settings should be on the Job side for that functionality and this approach allows us to experiment (while the feature is still alpha) and better understand how to expose the necessary knobs. We want to avoid expanding the Job API in ways that we know will change, since both Workload API and PodGroup API are under heavy development currently.

soltysh · 2026-02-16T12:46:17Z

@atiratree #5548 is roughly what I would like to avoid, and that triggered the entire discussion about not expanding the API, until we have a clear picture.

atiratree · 2026-02-16T17:29:34Z

I am not suggesting that we introduce a new API / fields. I am just curious about the scheduling aspect. I will bring this up at the next SIG scheduling meeting for further discussion.

helayoty · 2026-02-17T00:10:18Z

/area workload-aware

helayoty · 2026-02-17T13:39:13Z

IIUIC, the job controller will create Workload + PodGroup + Gang scheduling policy automatically for both of these cases and might block the second example from running.

What was the main reason for making gang scheduling the default behavior instead of opt-in? Please let me know, if you have already discussed this scenario or if I have missed something?

@atiratree , There are few points that we need to clarify:

Creating Workload/PodGroup doesn't enforce gang scheduling by itself.

The Job controller creates the Workload and PodGroup objects, it does not change how the scheduler places pods. The scheduler's gang scheduling behavior is controlled by a separate feature gate GangScheduling. If only EnableWorkloadWithJob is enabled without GangScheduling on the scheduler, the Workload/PodGroup objects are created as metadata, but pods are scheduled normally. Which means scenario 2 would continue to work as expected.
When both gates are enabled, the behavior change is intentional. In that case, yes, the second scenario would wait for all pods to be schedulable simultaneously rather than making partial progress. Which is expected and intended.
On auto-detection vs explicit opt-in.

The structural criteria (Indexed + Completions = Parallelism) are a strong signal of fixed-size parallel workloads that typically benefit from gang scheduling. The rationale for auto-detection:
- It avoids requiring users to learn about the Workload/PodGroup APIs to get the benefit.
- Higher-level controllers (JobSet, LWS, etc.) can opt out by setting schedulingGroup themselves.

That said, your data engineer scenario is a real one as not every Indexed job with C==P needs gang scheduling and in that case user can create an empty workload to reference on their Job object.

atiratree · 2026-02-17T13:53:11Z

When both gates are enabled, the behavior change is intentional. In that case, yes, the second scenario would wait for all pods to be schedulable simultaneously rather than making partial progress. Which is expected and intended.

It is intentional now, but it will become the default behavior when both of these feature gates graduate.

helayoty · 2026-02-17T13:57:26Z

When both gates are enabled, the behavior change is intentional. In that case, yes, the second scenario would wait for all pods to be schedulable simultaneously rather than making partial progress. Which is expected and intended.

It is intentional now, but it will become the default behavior when both of these feature gates graduate.

True. This was by design. The default is to gang for these types of Jobs. If users don't want to, they can opt-out by creating their own Workload.

@wojtek-t @mm4tt

soltysh · 2026-02-18T09:47:14Z

It is intentional now, but it will become the default behavior when both of these feature gates graduate.

The beta graduation criteria clearly state that for beta we will expose necessary knobs for users to tweak when and how workload/podgroup are created, which will give users the ability to opt-in or opt-out on demand. As I stated before the alpha stage is to figure out WHAT the API should look like, b/c we don't have a clear picture. Heba, Wojtek, Eric, Matt and I spent significant time going back and forth on the shape and we've decided that this path will ensure we can come up with long-term API, rather than ad-hoc changes which are then costly to maintain.

atiratree · 2026-02-19T19:19:41Z

After further clarification from SIG Scheduling, this feature is being implemented as experimental and will most likely have to be changed in Beta. The use cases and breaking changes (against the stable Job API) will have to be analysed and discussed again.

wojtek-t · 2026-02-23T14:51:43Z

True. This was by design. The default is to gang for these types of Jobs. If users don't want to, they can opt-out by creating their own Workload.

To clarify - this was only decision for Alpha.
There is very high change (I would say almost 100%) that the actual defaulting mechanism will be different once we reach Beta.
So as @soltysh already wrote above - the goal for Alpha is to prove that we can integrate through the whole stack. The actual knobs on how users can configure that were explicitly put out-of-scope for Alpha and are Beta criteria and will be figure by then.

Initial KEP for Job integration with Workload

2a863c1

Signed-off-by: helayoty <heelayot@microsoft.com>

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Feb 2, 2026

github-project-automation bot added this to SIG Scheduling Feb 2, 2026

k8s-ci-robot requested a review from kikisdeliveryservice February 2, 2026 16:53

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Feb 2, 2026

k8s-ci-robot requested a review from kow3ns February 2, 2026 16:53

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Feb 2, 2026

github-project-automation bot added this to SIG Apps Feb 2, 2026

github-project-automation bot moved this to Needs Triage in SIG Apps Feb 2, 2026

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 2, 2026

helayoty mentioned this pull request Feb 2, 2026

WAS: Integrate Workload APIs with Job controller #5547

Open

6 tasks

macsko reviewed Feb 2, 2026

View reviewed changes

keps/sig-apps/5547-integrate-workload-with-job/kep.yaml Outdated Show resolved Hide resolved

kannon92 reviewed Feb 2, 2026

View reviewed changes

keps/prod-readiness/sig-apps/5547.yaml Outdated Show resolved Hide resolved

kannon92 reviewed Feb 2, 2026

View reviewed changes

keps/sig-apps/5547-integrate-workload-with-job/README.md Show resolved Hide resolved

helayoty moved this to In Progress in SIG Scheduling Feb 5, 2026

helayoty moved this from Needs Triage to In Progress in SIG Apps Feb 5, 2026

wojtek-t self-assigned this Feb 5, 2026

wojtek-t reviewed Feb 5, 2026

View reviewed changes

andreyvelich reviewed Feb 5, 2026

View reviewed changes

kannon92 reviewed Feb 5, 2026

View reviewed changes

keps/sig-apps/5547-integrate-workload-with-job/README.md Outdated Show resolved Hide resolved

mm4tt reviewed Feb 6, 2026

View reviewed changes

soltysh reviewed Feb 6, 2026

View reviewed changes

Remove kube-scheduler

fe605f1

Signed-off-by: helayoty <heelayot@microsoft.com>

wojtek-t reviewed Feb 9, 2026

View reviewed changes

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Feb 11, 2026

Flip the diagram

627fa1e

Signed-off-by: helayoty <heelayot@microsoft.com>

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2026

soltysh approved these changes Feb 11, 2026

View reviewed changes

k8s-ci-robot assigned soltysh Feb 11, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 11, 2026

k8s-ci-robot merged commit 5e67e40 into kubernetes:master Feb 11, 2026
4 checks passed

k8s-ci-robot added this to the v1.36 milestone Feb 11, 2026

github-project-automation bot moved this from Needs Approval to Done in SIG Apps Feb 11, 2026

github-project-automation bot moved this from Needs Final Approver to Done in SIG Scheduling Feb 11, 2026

dom4ha reviewed Feb 11, 2026

View reviewed changes

atiratree mentioned this pull request Feb 16, 2026

KEP-5547: Implement Workload APIs integration with Job controller kubernetes/kubernetes#137032

Open

k8s-ci-robot added the area/workload-aware Categorizes an issue or PR as relevant to Workload-aware and Topology-aware scheduling subprojects. label Feb 17, 2026

helayoty deleted the helayoty/5547-workload-job branch February 17, 2026 13:39

mm4tt mentioned this pull request Feb 20, 2026

[WIP] feat(api): KEP-3015: Workload Aware Scheduling for TrainJob kubeflow/trainer#3219

Open

	- If a Job has `ownerReferences` indicating it is managed by another controller (i.e., JobSet), the Job controller
	- If a Job has `ownerReferences` indicating it is managed by another controller (i.e., JobSet, TrainJob), the Job controller


		### Validation for Parallelism Changes

		The Job API Validation rejects updates that change `spec.parallelism` when the feature gate is enabled and the Job uses gang scheduling. Since changing this field would require changing `minCount` in the `Workload` object, which is immutable.

Conversation

helayoty commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

helayoty commented Feb 2, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

helayoty commented Feb 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

helayoty Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

helayoty commented Feb 2, 2026 •

edited

Loading

helayoty Feb 6, 2026 •

edited

Loading