KEP-5547: Integrate Workload APIs with Job controller#5871
KEP-5547: Integrate Workload APIs with Job controller#5871k8s-ci-robot merged 15 commits intokubernetes:masterfrom
Conversation
Signed-off-by: helayoty <heelayot@microsoft.com>
| ### Goals | ||
|
|
||
| - Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling. | ||
| - Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism` |
There was a problem hiding this comment.
This would break if JobSet also adds gang support.
How can someone opt out of this even if parallelism > 1?
There was a problem hiding this comment.
What failure mode do you have on your mind - if JobSet creates its own Workload/PodGroup for the whole JobSet?
I believe that we need a mechanism that if the Workload (PodGroup?) already exists, that should be adopted and no new one should be created. I also treat it as an "opt-out" mechanism.
There was a problem hiding this comment.
Maybe ownerReferences check would be sufficient here.
There was a problem hiding this comment.
Maybe ownerReferences check would be sufficient here.
This is what's stated in the KEP. You can find it in the notes/constraints section in addition to the unit tests and integration tests.
There was a problem hiding this comment.
You could require that for the automatic creation to happen, Job.spec.template.spec.workloadRef must be empty. If it is set to anything, this is a signal that preexisting PodGroups or Workloads may be involved, and so it does not create them.
There was a problem hiding this comment.
Addressed according to @erictune suggestion and based on our conversation on the Workload meeting today. PTAL.
|
/sig apps |
| We will add the following integration tests to the Job controller `https://github.com/kubernetes/kubernetes/blob/v1.35.0/test/integration/job/job_test.go`: | ||
| - Gang and Basic Scheduling Lifecycle Test (create, update, delete Job, verify Workload and PodGroup creation, verify pods have workloadRef, verify Job deletion cascades to Workload and PodGroup deletion) | ||
| - Failure Recovery Test (create Job with Workload API unavailable, verify Job controller retries, verify Workload is eventually created) | ||
| - Feature gate disable/enable (Jobs work without Workload/PodGroup creation (Jobs with ownerReferences managed by higher-level controllers do not create Workload/PodGroup)) |
There was a problem hiding this comment.
I see a few areas we need to cover in alpha:
- How does this feature work with suspended jobs?
- If a job has ownerreferences set can we verify that no workload is created?
- ElasticJob is forbidden. We should test/verify this.
There was a problem hiding this comment.
The 2 and 3 already stated in the Proposal section already. I'll add the suspended jobs.
| ### Goals | ||
|
|
||
| - Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling. | ||
| - Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism` |
There was a problem hiding this comment.
What failure mode do you have on your mind - if JobSet creates its own Workload/PodGroup for the whole JobSet?
I believe that we need a mechanism that if the Workload (PodGroup?) already exists, that should be adopted and no new one should be created. I also treat it as an "opt-out" mechanism.
| ### Goals | ||
|
|
||
| - Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling. | ||
| - Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism` |
There was a problem hiding this comment.
So we want to also check "parallelism=completions" - if the completions are larger than parallelism, then clearly it is not a gang...
@soltysh for your thoughts too
There was a problem hiding this comment.
We don't have any strong requirements wrt parallelism == completions for defining parallel jobs. If you look at our docs we call parallel everything that has parallelism > 1.
There's also the question of indexed and non-indexed jobs, should we differentiate between the two? I remember there was some discussion at one point to only limit to indexed jobs, but I'm open.
Honestly, I'm inclined to start with stricter rules for creating the gang, and we can expand as we go, and we see it makes sense.
There was a problem hiding this comment.
There was a problem hiding this comment.
I don't really see a reason to restrict to only IndexedJobs.
There was a problem hiding this comment.
Non-indexed jobs can make a workload with Basic policy.
There was a problem hiding this comment.
Honestly, I'm inclined to start with stricter rules for creating the gang, and we can expand as we go, and we see it makes sense.
+1 to it
The goal is to prove the integration in Alpha, not to support every potential usecase for gang-scheduling. We should focus on not breaking anyone in this phase.
So the question is - what are the exact rules to use for gang-scheduling policy. Are we suggesting:
- parallelism > 1
- parallelism = completion
- Indexed only
[For everything else we can still create Workload, but use the Basic scheduling policy]
There was a problem hiding this comment.
Also - along those lines, I don't think "parallel jobs" per-se are a goal. I think we want to bring the "first step of value" to users, so I would rather take one example where we know we need gang-scheduling and do that. The above criteria are relatively narrow, but maybe that's exactly what we want to start with.
There was a problem hiding this comment.
Updated Goals to address the feedback.PTAL
| - The alpha release targets simple, static batch workloads where the workload requirements are known at creation time. | ||
| - Each Job maps to one `PodGroup`. All pods in the Job are identical from a scheduling policy perspective. | ||
| - The `minCount` field in the Workload's `GangSchedulingPolicy` mirrors the Job's parallelism. | ||
| - There is no mechanism to opt-out of `Workload`/`PodGroup` creation for indexed (parallel) jobs if feature gate is enabled. |
There was a problem hiding this comment.
I think there has to, but maybe we it can be as simple as "create your own Workload that basically explicitly states the Basic policy" and teach job controller to adopt it in that case (create Workload only if it doesn't exist - i.e. support BYOW - bring-your-own-workload).
There was a problem hiding this comment.
Basic policy" and teach job controller to adopt it in that case
That's roughly my previous question. How do we define the existing workload for a job controller to recognize, and adopt.
There was a problem hiding this comment.
Updated based on our conversation today. PTAL.
|
|
||
| - Check if a `Workload` object already exists for this `Job`. | ||
| - If not, determine the appropriate scheduling policy and create the `Workload` object with the determined policy. | ||
| - If it already exists, verify the existing `Workload` matches the `Job` spec. If not, update the `Workload` object. |
There was a problem hiding this comment.
What do you mean by "verify if it matches"?
I think we probably want to verify the structure (there is a single PodGroup) or sth, but if someone set Basic and we believe it should be Gang - I definitely wouldn't update it. We should treat at as an opt-out mechanism.
There was a problem hiding this comment.
FWIW, I believe that updating in general sounds like a bad pattern - I would rather say that in case the structure doesn't match at all, we rather give up and create the pods without the Workload/PodGroup at all.
There was a problem hiding this comment.
Also Workload is immutable in 1.36, so that's not an option.
Should this be a "silent give-up" (I ignore Workload and proceed with old logic) or should we have an admission blocking creation of Job objects with illegal Workloads attached?
There was a problem hiding this comment.
I'd like to see here clear criteria for what it means that a Workload exists for a Job. In one place I've seen information about ownerRef, but I'm not seeing it here.
There was a problem hiding this comment.
Should this be a "silent give-up" (I ignore Workload and proceed with old logic) or should we have an admission blocking creation of Job objects with illegal Workloads attached?
You can always have races:
I admit the Job because no workload exists but at the same time a Workload object is created.
So that's not a full solution.
I didn't want to say it's a "silent give-up" - we should set some condition/emit event/whatever.
But if we don't know what to do with that Workload, at least for now we should error on the side of "don't break existing things".
There was a problem hiding this comment.
Updated this section according to the discussion here. PTAL.
| - If it already exists, verify the existing `Workload` matches the `Job` spec. If not, update the `Workload` object. | ||
| - Check if a `PodGroup` object already exists for this `Job`. | ||
| - If not, create the `PodGroup` object referencing the `Workload` | ||
| - If it already exists, verify the existing `PodGroup` correctly references the `Workload`. |
There was a problem hiding this comment.
Hmm - if it isn't, how we even know that it is "the PodGroup" that we should use?
There was a problem hiding this comment.
I'll ask differently, if we're assuming that the order of ownerRefs is always from Workload->PodGroup, do we have a validation in place to ensure that? If not, we should either establish one, otherwise users will start creating different combination and whatever logic we come up in the job controller will either be over-complicated or won't work for most cases.
There was a problem hiding this comment.
Rewrote this section to address your feedback. PTAL.
andreyvelich
left a comment
There was a problem hiding this comment.
Thank you @helayoty!
I left a few comments.
| - Each Job maps to one `PodGroup`. All pods in the Job are identical from a scheduling policy perspective. | ||
| - The `minCount` field in the Workload's `GangSchedulingPolicy` mirrors the Job's parallelism. | ||
| - There is no mechanism to opt-out of `Workload`/`PodGroup` creation for indexed (parallel) jobs if feature gate is enabled. | ||
| - When gang scheduling is active (parallel jobs), changes to `spec.parallelism` are blocked via admission validation because this would require changing `minCount` |
There was a problem hiding this comment.
As I mentioned above, I don't understand why do we need this limitation?
There was a problem hiding this comment.
Why admission validation? We can do this conditionally (based on FG on/off state) in job validation, no?
There was a problem hiding this comment.
Agree. No need for admission validation here. Updated. PTAL.
| - The `minCount` field in the Workload's `GangSchedulingPolicy` mirrors the Job's parallelism. | ||
| - There is no mechanism to opt-out of `Workload`/`PodGroup` creation for indexed (parallel) jobs if feature gate is enabled. | ||
| - When gang scheduling is active (parallel jobs), changes to `spec.parallelism` are blocked via admission validation because this would require changing `minCount` | ||
| - If a Job has `ownerReferences` indicating it is managed by another controller (i.e., JobSet), the Job controller |
There was a problem hiding this comment.
| - If a Job has `ownerReferences` indicating it is managed by another controller (i.e., JobSet), the Job controller | |
| - If a Job has `ownerReferences` indicating it is managed by another controller (i.e., JobSet, TrainJob), the Job controller |
There was a problem hiding this comment.
That answers my question, roughly. But at the same time this means that a CronJob-owned Jobs will not be capable of using Workloads. Unless we introduce workloads there as well.
There was a problem hiding this comment.
But at the same time this means that a CronJob-owned Jobs will not be capable of using Workloads. Unless we introduce workloads there as well.
This is my assumption.
There was a problem hiding this comment.
But at the same time this means that a CronJob-owned Jobs will not be capable of using Workloads
I would say it's desired - let's not try to boil the ocean here.
| The Job controller must create objects in a strict order to ensures that the scheduler can properly validate pods | ||
| against their scheduling policy before attempting to schedule them. The order is as follows: | ||
| 1. `Workload` object |
There was a problem hiding this comment.
Is that really needed? I asked previously, it doesn't matter in which order objects will be created since kube-scheduler will wait for Workload and PodGroup objects if Pods have workloadRef
There was a problem hiding this comment.
I'd say it's needed. Added more justification. PTAL.
| ### Goals | ||
|
|
||
| - Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling. | ||
| - Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism` |
There was a problem hiding this comment.
I guess this is fine for alpha, but AFAIK it's a quite common case to start a job with parallelism=1 and later scale it up as a gang. If we go with what if proposed here the Job will be created without gang-scheduling initially and it's not clear how to change it later.
Overall, I think I'm fine with having this "default gang iff parallelism > 1" in alpha - given that alpha features need to be enabled explicitly and the contract is kind of "use at your own risk".
However for beta promotion we need to have a full API on the Job side figured out - with figured out things like opt-in / opt-out and support for common use-cases (like going from parallelism 1->N and having gang scheduling )
There was a problem hiding this comment.
When introducing a new API field the feature must start in alpha.
So if we want to add an API I see the following path:
alpha - 1.36 (rough sketch of implementation without API)
alpha - 1.37 (api for opt in / opt out)
beta - 1.38
|
|
||
| - Check if a `Workload` object already exists for this `Job`. | ||
| - If not, determine the appropriate scheduling policy and create the `Workload` object with the determined policy. | ||
| - If it already exists, verify the existing `Workload` matches the `Job` spec. If not, update the `Workload` object. |
There was a problem hiding this comment.
Also Workload is immutable in 1.36, so that's not an option.
Should this be a "silent give-up" (I ignore Workload and proceed with old logic) or should we have an admission blocking creation of Job objects with illegal Workloads attached?
| ### Goals | ||
|
|
||
| - Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling. | ||
| - Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism` |
There was a problem hiding this comment.
We don't have any strong requirements wrt parallelism == completions for defining parallel jobs. If you look at our docs we call parallel everything that has parallelism > 1.
There's also the question of indexed and non-indexed jobs, should we differentiate between the two? I remember there was some discussion at one point to only limit to indexed jobs, but I'm open.
Honestly, I'm inclined to start with stricter rules for creating the gang, and we can expand as we go, and we see it makes sense.
| - Parallelism change is blocked for gang-scheduled Jobs and allowed for basic-scheduled Jobs | ||
| - Job deletion cascades to Workload and PodGroup deletion | ||
| - Feature gate disabled: Jobs work without Workload/PodGroup creation | ||
| - Jobs with ownerReferences (managed by higher-level controllers) do not create Workload/PodGroup |
There was a problem hiding this comment.
Probably also add tests verifying the actual ownerRefs values for job, workload and podgroup, so that it matches the expected ordering.
Signed-off-by: helayoty <heelayot@microsoft.com>
| ### Goals | ||
|
|
||
| - Job controller automatically creates `Workload` and `PodGroup` objects for Jobs that require gang scheduling. | ||
| - Job with `parallelism > 1` will use `GangSchedulingPolicy` with `minCount = parallelism` |
There was a problem hiding this comment.
Also - along those lines, I don't think "parallel jobs" per-se are a goal. I think we want to bring the "first step of value" to users, so I would rather take one example where we know we need gang-scheduling and do that. The above criteria are relatively narrow, but maybe that's exactly what we want to start with.
Signed-off-by: helayoty <heelayot@microsoft.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: helayoty, soltysh, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
|
||
| ### Validation for Parallelism Changes | ||
|
|
||
| The Job API Validation rejects updates that change `spec.parallelism` when the feature gate is enabled and the Job uses gang scheduling. Since changing this field would require changing `minCount` in the `Workload` object, which is immutable. |
There was a problem hiding this comment.
If there is a workload object created with basic policy (to opt-out) is the elastic job still not allowed?
There was a problem hiding this comment.
We decided not to create workload+podgroup in all other cases, except for when it's requesting gang scheduling. At least in the initial alpha. See #5871 (comment) for discussion.
There was a problem hiding this comment.
In the case of Indexed Jobs, index 0 is sometimes treated as a special role, then we want to schedule index-0 and other indexes separately.
In this situation, are we able to manually create 2 PodGroups, each for the index-0 Pod and other index Pods by using the Workload Basic policy?
There was a problem hiding this comment.
In this situation, are we able to manually create 2 PodGroups, each for the index-0 Pod and other index Pods by using the Workload
Basicpolicy?
Yes, you can always create Workload+PodGroups in your desired configuration. In that case the job controller won't do it for you.
There was a problem hiding this comment.
Thank you for describing that.
That sounds reasonable. We might be able to natively support various PodGroup creation patterns based on real usecases in the future.
But, I agree with keeping a minimum at this time.
There was a problem hiding this comment.
If there is a workload object created with basic policy (to opt-out) is the elastic job still not allowed?
@dom4ha - the answer by @soltysh above answers that.
You can always create your own Workload/PodGroup if you want. That will just work.
What we want ed to ensure is that if the Workload will be created by us, we will not allow for scaling it.
Creating your own workload resources basically allows you to freely manage the job, so it's definitely a reasonable option. Definitely, worth documenting. |
+1 |
|
From User Stories:
or
IIUIC, the job controller will create Workload + PodGroup + Gang scheduling policy automatically for both of these cases and might block the second example from running. What was the main reason for making gang scheduling the default behavior instead of opt-in? Please let me know, if you have already discussed this scenario or if I have missed something? |
The goal was to experiment and NOT introduce API fields in the Job resource. We haven't reached an agreement what the settings should be on the Job side for that functionality and this approach allows us to experiment (while the feature is still alpha) and better understand how to expose the necessary knobs. We want to avoid expanding the Job API in ways that we know will change, since both Workload API and PodGroup API are under heavy development currently. |
|
@atiratree #5548 is roughly what I would like to avoid, and that triggered the entire discussion about not expanding the API, until we have a clear picture. |
|
I am not suggesting that we introduce a new API / fields. I am just curious about the scheduling aspect. I will bring this up at the next SIG scheduling meeting for further discussion. |
|
/area workload-aware |
@atiratree , There are few points that we need to clarify:
That said, your data engineer scenario is a real one as not every Indexed job with C==P needs gang scheduling and in that case user can create an empty workload to reference on their Job object. |
It is intentional now, but it will become the default behavior when both of these feature gates graduate. |
True. This was by design. The default is to gang for these types of Jobs. If users don't want to, they can opt-out by creating their own Workload. |
The beta graduation criteria clearly state that for beta we will expose necessary knobs for users to tweak when and how workload/podgroup are created, which will give users the ability to opt-in or opt-out on demand. As I stated before the alpha stage is to figure out WHAT the API should look like, b/c we don't have a clear picture. Heba, Wojtek, Eric, Matt and I spent significant time going back and forth on the shape and we've decided that this path will ensure we can come up with long-term API, rather than ad-hoc changes which are then costly to maintain. |
|
After further clarification from SIG Scheduling, this feature is being implemented as experimental and will most likely have to be changed in Beta. The use cases and breaking changes (against the stable Job API) will have to be analysed and discussed again. |
To clarify - this was only decision for Alpha. |
One-line PR description: Integrate
WorkloadandPodGroupAPIs with the Job controllers to support gang-scheduling.Issue link: WAS: Integrate Workload APIs with Job controller #5547
Other comments: See other KEPs
/sig scheduling