-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set correct ENV for PytorchJob to support torchrun #1840
Conversation
06910d6
to
d29362c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @kuizhiqing!
I left few comments.
@kuizhiqing Can you address @tenzen-y 's comments as well? We can merge post that. Thanks for this |
e3953d6
to
b50fd83
Compare
Done |
@@ -79,4 +79,9 @@ func SetDefaults_PyTorchJob(job *PyTorchJob) { | |||
} | |||
// Set default elastic policy. | |||
setElasticPolicy(job) | |||
|
|||
if job.Spec.NprocPerNode == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check elasticPolicy? if elasticPolicy.NProcPerNode is set, validateNprocPerNode
rejects to create the Job, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can you add a unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y I'm little bit confused here, for now, we are going to just leave nprocPerNode in elasticPolicy work with some warning and deprecate it in the future or it will not work since this version ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kuizhiqing Sorry for the confusion.
I meant the following checks:
if job.Spec.NprocPerNode == nil { | |
if (job.Spec.ElasticPolicy != nil && job.Spec.ElasticPolicy.NProcPerNode == nil) || (job.Spec.ElasticPolicy == nil) { | |
if job.Spec.NprocPerNode == nil { |
If we don't check elastciPolicy and then both elastciPolicy.NProcPerNode
and job.Spec.NprocPerNode
are set, the validateNprocPerNode
function rejects the request, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y OK,Thanks for clarifying. I will handle this and those UT as you say next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -79,4 +79,9 @@ func SetDefaults_PyTorchJob(job *PyTorchJob) { | |||
} | |||
// Set default elastic policy. | |||
setElasticPolicy(job) | |||
|
|||
if job.Spec.NprocPerNode == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can you add a unit test?
return nil | ||
} | ||
|
||
func validateNprocPerNode(pytorchJob *PyTorchJob) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@tenzen-y @johnugeorge PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kuizhiqing I left some comments.
@@ -19,6 +19,10 @@ import ( | |||
"k8s.io/apimachinery/pkg/runtime" | |||
) | |||
|
|||
var ( | |||
defaultNprocPerNode = "auto" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great constants :)
@@ -152,3 +152,25 @@ func TestSetElasticPolicy(t *testing.T) { | |||
}) | |||
} | |||
} | |||
|
|||
func TestSetDefaultNprocPerNode(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this test doesn't verify some edge cases.
For example, the current case can not verify that .spec.nprocPerNode
isn't overridden when elasticPolicy
is nil and .spec.elasticPolicy
isn't nil.
So should we add more test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -60,7 +60,7 @@ func setPodEnv(obj interface{}, podTemplateSpec *corev1.PodTemplateSpec, rtype, | |||
// Ref https://stackoverflow.com/questions/59812009/what-is-the-use-of-pythonunbuffered-in-docker-file. | |||
podTemplateSpec.Spec.Containers[i].Env = append(podTemplateSpec.Spec.Containers[i].Env, corev1.EnvVar{ | |||
Name: "PYTHONUNBUFFERED", | |||
Value: "0", | |||
Value: "1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change this in a separate PR since PaddleJob isn't related to PyTorchJob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
} | ||
return 1 | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? IIUC, we default spec.NproPerNode
to auto
if spec.NproPerNode
is nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getNprocPerNodeInt
is a helper function to calculate world size, it will not effect the env.
When nproc_per_node set to auto
, it means the number of process will be determinate in the user process phase, in this case, world size env will not be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks for the clarification!
So should we add a comment to this function? And can we add validation to pkg/apis/kubeflow.org/v1/pytorch_validation.go like the following?
if job.spec.NrocPerNode != nil {
if np, err := strconv.Atoi(job.spec.NrocPerNode); err != nil && (np != "auto" || np != "CPU" || np == "GPU") {
Error("error")
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we should do that, let's say if PyTorch framework support values other than those in the list, e.g. "XPU", the operator will not work anymore.
Anyway, in my opinion, the operator should decouple with the framework in this level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense.
Is there any document about supported values by PyTorch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I found the logic for the nproc_per_node
: https://github.com/pytorch/pytorch/blob/26f7f470df64d90e092081e39507e4ac751f55d6/torch/distributed/run.py#L629-L658
Can you add this link to pkg/apis/kubeflow.org/v1/pytorch_types.go.
And can you add the following comment you posted to this function?
When nproc_per_node set to auto, it means the number of process will be determinate in the user process phase, in this case, world size env will not be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Others LGTM. Thanks for the big contribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
return 1 | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks for the clarification!
So should we add a comment to this function? And can we add validation to pkg/apis/kubeflow.org/v1/pytorch_validation.go like the following?
if job.spec.NrocPerNode != nil {
if np, err := strconv.Atoi(job.spec.NrocPerNode); err != nil && (np != "auto" || np != "CPU" || np == "GPU") {
Error("error")
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kuizhiqing Thanks for the great work!
/lgtm
/assign @johnugeorge @andreyvelich
Thank you for this contribution @kuizhiqing ! |
Thank you for this contribution @kuizhiqing ! |
Thanks @kuizhiqing for contributing this /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johnugeorge, kuizhiqing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
This PR adds environment variables to support different distributed training launch methods:
python train.py
python -m torch.distributed.launch train.py
python -m torch.distributed.run train.py
This PR makes the following changes:
nprocPerNode
at the top API level. Note that this is different from the previousnProcPerNode
and relates tonproc_per_node
.nprocPerNode
type to string, which is consistent with PyTorch.RemovesnProcPerNode
from thespec.elasticPolicy
API section.EnvNNodes
toEnvNnodes
to match--nnodes
.WORLD_SIZE
environment variable tototalReplicas
*nprocPerNode
.PET_NPROC_PER_NODE
for each pod.PET_NODE_RANK
for each pod.PET_NNODES
for non-elastic mode.PET_MASTER_PORT
/PET_MASTER_ADDR
equals toMASTER_PORT
/MASTER_ADDR
for compatibility.References:
Checklist: