-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Add TrainJob and TrainingRuntime APIs #2223
KEP-2170: Add TrainJob and TrainingRuntime APIs #2223
Conversation
Signed-off-by: Andrey Velichkevich <[email protected]>
// Number of training nodes. | ||
// TODO (andreyvelich): Do we want to support dynamic num of nodes in TrainJob for PyTorch elastic: `--nnodes=1:4` ? | ||
NumNodes *int32 `json:"numNodes,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y What do you think about using *string
here since for the elastic training user can set:
torchrun --nnodes=1:4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer int
here to refer to active/effective number instead of string
.
IMO, the string value of nnodes if calculated from a config from like ElasticPolicy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, we can specify the elastically policy in the JobSetSpec. So, let's use the typed int here.
After we find some advantages for string, we can introduce IntOrString like Deployment rollingUpdate.
Pull Request Test Coverage Report for Build 10566761788Details
💛 - Coveralls |
} | ||
|
||
// TranJobList is a collection of training jobs. | ||
type TranJobList struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TrainJobList
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!
metav1.ObjectMeta `json:"metadata,omitempty"` | ||
|
||
// Specification of the desired ClusterTrainingRuntime. | ||
Spec TrainingRuntimeSpec `json:"spec,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the spec ever be empty for ClusterTrainingRUntime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, but I noticed that for all Kubernetes APIs the spec
is set with omitempty
: https://github.com/kubernetes/api/blob/master/apps/v1/types.go#L820.
@tenzen-y @kannon92 Any specific reason why we do this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, in any case, the spec
field is defined as an optional field in the Kubernetes. So, the optional TrainingRuntime spec would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL.
type TorchSpec struct { | ||
// Number of processes per node. | ||
// This value is inserted into the `--nproc-per-node` argument of the `torchrun` CLI. | ||
// Supported values: `auto`, `cpu`, `gpu`, or int value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably use KubeBuilder validations for the enums here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed offline, we will add validations in the separate PRs.
Tracking issue:
type MLSpec struct { | ||
// Configuration for the PyTorch runtime. | ||
TorchSpec *TorchSpec `json:"torchSpec,omitempty"` | ||
|
||
// Configuration for the MPI Runtime. | ||
MPISpec *MPISpec `json:"mpiSpec,omitempty"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern I have is that frameworks configurations/spec change quite often. Have we considered using configmap for this so that we don't have a lot of responsibilities to maintain the compatibility, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our goal is not to add all frameworks configurations here, but only parameters that require additional orchestration such as Elastic Policy, MPISpec. For example, in the future we can add SlurmSpec or FluxSpec here as we discussed with @vsoch here: #2171 (comment).
For Torch
our assumption is that torchrun
CLI is quite stable and probably won't change in the near future.
@terrytangyuan How do you think we can use ConfigMap for those parameters ?
@andreyvelich First of all, could you generate / create Unless those functions, we can not use the API in the controllers/webhooks. |
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
I registered APIs with scheme and added deepcopygen via controller-gen. I think, we can add the defaulters and other parameters required for clients, listers, informers, in the following PRs. |
That sounds good to me. We can create a separate issue "KEP-2170: Provide client-go library for TrainJob and TrainingRuntime". |
Created: #2224 |
|
||
// PodGroupSpec represents a PodGroup configuration to enable gang-scheduling. | ||
type PodGroupSpec struct { | ||
// Plugin for the gang-scheduling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going forward with a default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, the gang-scheduling is disabled for TrainJob, since it requires plugin to be installed (coscheduling or volcano).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense.
NumNodes *int32 `json:"numNodes,omitempty"` | ||
|
||
// JobSet configuration which will be used by TrainJob. | ||
JobSetSpec *jobsetv1alpha2.JobSetSpec `json:",inline"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why inline here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we discussed it before, here is the example: https://github.com/kubeflow/training-operator/tree/master/docs/proposals/2170-kubeflow-training-v2#pytorch-distributed-runtime.
This will make API consistent with the JobSet. WDYT @kannon92 @tenzen-y ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking because I don't know what that argument actually does. I usually only see if for really small objects. Never a Spec so not sure if that means literally we are putting the object "inline" or it skips protobuf or api generation? like I've seen it for type..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to give user functionality to set the whole JobSet spec under Training Runtimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type TrainingRuntimeSpec struct {
MLPolicy *MLPolicy `json:"mlPolicy,omitempty"`
JobSetSpec jonsetv2alpha2.JobSetSpec `json:"spec"`
}
type MLPolicy struct {
// Number of training nodes.
// Defaults to 1.
NumNodes *int32 `json:"numNodes,omitempty"`
MLPolicySource `json:",inline"`
}
type MLPolicySource struct {
PyTorch ...
}
Maybe, we want to dedicated field for the JobSetSpec so that we can identify the JobSetSpec.
@andreyvelich @kannon92 WDYT?
metav1.ObjectMeta `json:"metadata,omitempty"` | ||
|
||
// Specification of the desired TrainingRuntime. | ||
Spec TrainingRuntimeSpec `json:"spec"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spec TrainingRuntimeSpec `json:"spec"` | |
Spec TrainingRuntimeSpec `json:"spec, omitempty"` |
Similar to other specs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice Catch!
} | ||
|
||
// PodSpecOverrides represents the custom overrides that will be applied for the TrainJob's resources. | ||
type PodSpecOverrides struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y do we need schedulingGates here? I can't remember what we decided a few weeks ago..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do not need to add the schedulingGate
since we would expect the schedulingGates are added by the webhook based on the Pod creation event.
Indeed, Kueue adds the schedulingGates to Pods, not Job by webhook, when the Pods are created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work !
Suspend *bool `json:"suspend,omitempty"` | ||
|
||
// ManagedBy field indicates the controller that manages a TrainJob. | ||
ManagedBy *string `json:"managedBy,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the type of ManagedBy
is string, I propose ControllerName
inspired by SchedulerName
.
Since ManagedBy
reminder me of managedFields
which is another struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a byproduct of https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/4368-support-managed-by-for-batch-jobs so I think we should not change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've missed it.
// Number of training nodes. | ||
// TODO (andreyvelich): Do we want to support dynamic num of nodes in TrainJob for PyTorch elastic: `--nnodes=1:4` ? | ||
NumNodes *int32 `json:"numNodes,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer int
here to refer to active/effective number instead of string
.
IMO, the string value of nnodes if calculated from a config from like ElasticPolicy.
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
NumProcPerNode *int32 `json:"numProcPerNode,omitempty"` | ||
|
||
// Implementation name for the MPI to create the appropriate hostfile. | ||
MPIImplementation *MPIImplementation `json:"mpiImplementation"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since MPIImplementation is a const, does this need to be a pointer? With a pointer, a nil
is a valid value. Do we want user to pass a nil value?
Also with omitempty
not specified, the zero value of this would be nil
. The zero value will be included in all patch requests which seems unnecessary.
Maybe you meant to include the omitempty
but accept a nil value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we should remove pointer from here. @tenzen-y What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Implementation name for the MPI to create the appropriate hostfile.
// Defaults to OpenMPI
MPIImplementation *MPIImplementation `json:"mpiImplementation,omitempty"`
In most cases of optional fields, we should use the omitempty
tag.
Please refer to more details here: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#optional-vs-required
// Every training runtime contains `trainer` container which represents Trainer. | ||
type Trainer struct { | ||
// Docker image for the training container. | ||
Image string `json:"image,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could Image be a pointer to the string? The validation can differentiate between unspecified and empty string.
This marks the field as required in the CRD but a nil value can be passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y @shravan-achar Do we want to introduce pointer or +optional
for image since user can skip the image for Trainer?
E.g. Kubernetes for container users +optional
validation: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L2708C6-L2709
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, we can skip specifying this image field when the TrainingRuntime has the image.
// +optional
Image *string `json:"image,omitempty"`
Regarding to the optional vs required, please refer to the above my comment.
// the `dataset-initializer` container in the `Initializer` Job. | ||
type DatasetConfig struct { | ||
// Storage uri for the dataset provider. | ||
StorageUri string `json:"storageUri"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could StorageUri be a pointer to the string? The validation could differentiate between unspecified and empty string.
Also omitempty
is still necessary? From a PATCH perspective, if this field is unspecified, its zero value will be included in the serialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make the storageUri optional given that Training Runtime can contain the default dataset and model.
// InputModel represents the desired pre-trained model configuration. | ||
type InputModel struct { | ||
// Storage uri for the model provider. | ||
StorageUri string `json:"storageUri"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here as on L141
Env []corev1.EnvVar `json:"env,omitempty"` | ||
|
||
// Reference to the TrainJob's secrets to download dataset. | ||
SecretRef corev1.SecretReference `json:"secretRef,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not take a pointer to corev1.SecretReference
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
// PodSpecOverrides represents the custom overrides that will be applied for the TrainJob's resources. | ||
type PodSpecOverrides struct { | ||
// Names of the training job replicas in the training runtime template to apply the overrides. | ||
TargetReplicatedJobs []string `json:"targetReplicatedJobs"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't see the benefit of skipping omitempty
. Probably we should include almost everywhere. Let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we skip it from here, how we can make the targetReplicatedJobs
mandatory parameter ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's possible to use Kubebuilder tags to make it required on the CRD. The omitempty will help with serialization. kubernetes-sigs/controller-tools#944
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TargetReplicatedJobs []string `json:"targetReplicatedJobs"` | |
// +required | |
TargetReplicatedJobs []string `json:"targetReplicatedJobs"` |
We can make this field as required like this.
@andreyvelich: GitHub didn't allow me to assign the following users: shravan-achar, kannon92. Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed-off-by: Andrey Velichkevich <[email protected]>
d70db47
to
49a004c
Compare
Signed-off-by: Andrey Velichkevich <[email protected]>
LGTM on my end. |
We had some discussions with @tenzen-y about APIs and we proposed the following changes:
|
/rerun-all |
} | ||
|
||
// TorchMLSpecSource represents a PyTorch runtime configuration. | ||
type TorchMLSpecSource struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y I removed standalone
from TorchMLSpecSource API.
I think, we can detect when numNodes=1
and numProcPerNode>1
to set the standalone
parameter which sets the default values for rendezvous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When users will need it, we can add it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me.
Thank you for the update.
Signed-off-by: Andrey Velichkevich <[email protected]>
We made the final changes with @tenzen-y:
If we don't have any followup suggestions, we can merge it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates!
/hold
/lgtm
/approve
Feel free to merge this PR.
Additionally, could you update the KEP in a separate PR?
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold |
Signed-off-by: Andrey Velichkevich <[email protected]>
338cc29
to
7aa4094
Compare
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks everyone for the review, we should be ready to merge it.
I will create followup PR to update KEP with the correct APIs.
/hold cancel
Fixes: #2206
I added APIs for TrainJob, TrainingRuntime, and ClusterTrainingRuntime resources.
/assign @kubeflow/wg-training-leads @kannon92 @mimowo @vsoch @ahg-g @kuizhiqing @alculquicondor @zw0610 @franciscojavierarceo @shravan-achar
/hold for review.