Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set correct ENV for PytorchJob to support torchrun #1840

Merged
merged 9 commits into from
Jul 12, 2023

Conversation

kuizhiqing
Copy link
Member

@kuizhiqing kuizhiqing commented Jun 26, 2023

What this PR does / why we need it:

This PR adds environment variables to support different distributed training launch methods:

  • python train.py
  • python -m torch.distributed.launch train.py
  • python -m torch.distributed.run train.py

This PR makes the following changes:

  • Adds nprocPerNode at the top API level. Note that this is different from the previous nProcPerNode and relates to nproc_per_node.
  • Change nprocPerNode type to string, which is consistent with PyTorch.
  • Removes nProcPerNode from the spec.elasticPolicy API section.
  • Changes EnvNNodes to EnvNnodes to match --nnodes.
  • Sets the WORLD_SIZE environment variable to totalReplicas * nprocPerNode.
  • Adds PET_NPROC_PER_NODE for each pod.
  • Adds PET_NODE_RANK for each pod.
  • Adds PET_NNODES for non-elastic mode.
  • Sets PET_MASTER_PORT/PET_MASTER_ADDR equals to MASTER_PORT/MASTER_ADDR for compatibility.

References:

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented Jun 26, 2023

Pull Request Test Coverage Report for Build 5520229973

  • 44 of 68 (64.71%) changed or added relevant lines in 7 files are covered.
  • 777 unchanged lines in 23 files lost coverage.
  • Overall coverage decreased (-1.2%) to 33.134%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller.v1/pytorch/elastic.go 3 4 75.0%
pkg/apis/kubeflow.org/v1/pytorch_defaults.go 5 7 71.43%
pkg/apis/kubeflow.org/v1/pytorch_validation.go 4 8 50.0%
pkg/controller.v1/pytorch/envvar.go 24 28 85.71%
pkg/apis/kubeflow.org/v1/zz_generated.deepcopy.go 0 5 0.0%
pkg/apis/kubeflow.org/v1/openapi_generated.go 0 8 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/apis/kubeflow.org/v1/paddlepaddle_defautls.go 2 0%
pkg/apis/kubeflow.org/v1/pytorch_defaults.go 2 39.62%
pkg/reconciler.v1/common/job.go 2 8.1%
pkg/controller.v1/tensorflow/util.go 3 93.75%
pkg/controller.v1/paddlepaddle/envvar.go 4 74.11%
pkg/controller.v1/pytorch/elastic.go 5 76.92%
pkg/controller.v1/pytorch/envvar.go 5 86.46%
pkg/common/util/util.go 6 42.22%
pkg/controller.v1/mpi/mpijob.go 10 91.06%
pkg/util/status.go 11 81.82%
Totals Coverage Status
Change from base Build 5371475460: -1.2%
Covered Lines: 3211
Relevant Lines: 9691

💛 - Coveralls

@kuizhiqing kuizhiqing changed the title fix pytorch WORLD_SIZE env inconsistent [WIP] fix pytorch WORLD_SIZE env inconsistent Jun 26, 2023
@kuizhiqing kuizhiqing changed the title [WIP] fix pytorch WORLD_SIZE env inconsistent Set correct ENV for PytorchJob to support torchrun Jun 28, 2023
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @kuizhiqing!
I left few comments.

pkg/apis/kubeflow.org/v1/pytorch_types.go Outdated Show resolved Hide resolved
pkg/apis/kubeflow.org/v1/pytorch_types.go Outdated Show resolved Hide resolved
pkg/controller.v1/pytorch/envvar.go Show resolved Hide resolved
pkg/controller.v1/pytorch/envvar.go Show resolved Hide resolved
pkg/apis/kubeflow.org/v1/pytorch_types.go Outdated Show resolved Hide resolved
pkg/controller.v1/pytorch/elastic.go Outdated Show resolved Hide resolved
pkg/controller.v1/pytorch/envvar.go Outdated Show resolved Hide resolved
@johnugeorge
Copy link
Member

johnugeorge commented Jun 30, 2023

@kuizhiqing Can you address @tenzen-y 's comments as well? We can merge post that. Thanks for this

@kuizhiqing
Copy link
Member Author

@kuizhiqing Can you address @tenzen-y 's comments as well? We can merge post that. Thanks for this

Done

@@ -79,4 +79,9 @@ func SetDefaults_PyTorchJob(job *PyTorchJob) {
}
// Set default elastic policy.
setElasticPolicy(job)

if job.Spec.NprocPerNode == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check elasticPolicy? if elasticPolicy.NProcPerNode is set, validateNprocPerNode rejects to create the Job, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you add a unit test?

Copy link
Member Author

@kuizhiqing kuizhiqing Jul 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y I'm little bit confused here, for now, we are going to just leave nprocPerNode in elasticPolicy work with some warning and deprecate it in the future or it will not work since this version ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kuizhiqing Sorry for the confusion.
I meant the following checks:

Suggested change
if job.Spec.NprocPerNode == nil {
if (job.Spec.ElasticPolicy != nil && job.Spec.ElasticPolicy.NProcPerNode == nil) || (job.Spec.ElasticPolicy == nil) {
if job.Spec.NprocPerNode == nil {

If we don't check elastciPolicy and then both elastciPolicy.NProcPerNode and job.Spec.NprocPerNode are set, the validateNprocPerNode function rejects the request, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y OK,Thanks for clarifying. I will handle this and those UT as you say next week.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -79,4 +79,9 @@ func SetDefaults_PyTorchJob(job *PyTorchJob) {
}
// Set default elastic policy.
setElasticPolicy(job)

if job.Spec.NprocPerNode == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you add a unit test?

return nil
}

func validateNprocPerNode(pytorchJob *PyTorchJob) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a unit test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@kuizhiqing
Copy link
Member Author

@tenzen-y @johnugeorge PTAL

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kuizhiqing I left some comments.

@@ -19,6 +19,10 @@ import (
"k8s.io/apimachinery/pkg/runtime"
)

var (
defaultNprocPerNode = "auto"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great constants :)

@@ -152,3 +152,25 @@ func TestSetElasticPolicy(t *testing.T) {
})
}
}

func TestSetDefaultNprocPerNode(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this test doesn't verify some edge cases.
For example, the current case can not verify that .spec.nprocPerNode isn't overridden when elasticPolicy is nil and .spec.elasticPolicy isn't nil.

So should we add more test cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -60,7 +60,7 @@ func setPodEnv(obj interface{}, podTemplateSpec *corev1.PodTemplateSpec, rtype,
// Ref https://stackoverflow.com/questions/59812009/what-is-the-use-of-pythonunbuffered-in-docker-file.
podTemplateSpec.Spec.Containers[i].Env = append(podTemplateSpec.Spec.Containers[i].Env, corev1.EnvVar{
Name: "PYTHONUNBUFFERED",
Value: "0",
Value: "1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change this in a separate PR since PaddleJob isn't related to PyTorchJob.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

}
return 1
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? IIUC, we default spec.NproPerNode to auto if spec.NproPerNode is nil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getNprocPerNodeInt is a helper function to calculate world size, it will not effect the env.

When nproc_per_node set to auto, it means the number of process will be determinate in the user process phase, in this case, world size env will not be used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for the clarification!
So should we add a comment to this function? And can we add validation to pkg/apis/kubeflow.org/v1/pytorch_validation.go like the following?

if job.spec.NrocPerNode != nil {
   if np, err := strconv.Atoi(job.spec.NrocPerNode); err != nil && (np != "auto" || np != "CPU" || np == "GPU") {
     Error("error")
  }
}

Copy link
Member Author

@kuizhiqing kuizhiqing Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should do that, let's say if PyTorch framework support values other than those in the list, e.g. "XPU", the operator will not work anymore.

Anyway, in my opinion, the operator should decouple with the framework in this level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense.
Is there any document about supported values by PyTorch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I found the logic for the nproc_per_node: https://github.com/pytorch/pytorch/blob/26f7f470df64d90e092081e39507e4ac751f55d6/torch/distributed/run.py#L629-L658

Can you add this link to pkg/apis/kubeflow.org/v1/pytorch_types.go.

And can you add the following comment you posted to this function?

When nproc_per_node set to auto, it means the number of process will be determinate in the user process phase, in this case, world size env will not be used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Others LGTM. Thanks for the big contribution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

pkg/controller.v1/pytorch/pytorchjob_controller_test.go Outdated Show resolved Hide resolved
}
return 1
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for the clarification!
So should we add a comment to this function? And can we add validation to pkg/apis/kubeflow.org/v1/pytorch_validation.go like the following?

if job.spec.NrocPerNode != nil {
   if np, err := strconv.Atoi(job.spec.NrocPerNode); err != nil && (np != "auto" || np != "CPU" || np == "GPU") {
     Error("error")
  }
}

pkg/controller.v1/xgboost/xgboost.go Outdated Show resolved Hide resolved
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kuizhiqing Thanks for the great work!
/lgtm
/assign @johnugeorge @andreyvelich

@andreyvelich
Copy link
Member

Thank you for this contribution @kuizhiqing !
/lgtm
/assign @johnugeorge

@itayvallach
Copy link

Thank you for this contribution @kuizhiqing !

@johnugeorge
Copy link
Member

Thanks @kuizhiqing for contributing this

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge, kuizhiqing

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants