Set correct ENV for PytorchJob to support torchrun #1840

kuizhiqing · 2023-06-26T08:41:35Z

What this PR does / why we need it:

This PR adds environment variables to support different distributed training launch methods:

python train.py
python -m torch.distributed.launch train.py
python -m torch.distributed.run train.py

This PR makes the following changes:

Adds nprocPerNode at the top API level. Note that this is different from the previous nProcPerNode and relates to nproc_per_node.
Change nprocPerNode type to string, which is consistent with PyTorch.
~~Removes nProcPerNode from the spec.elasticPolicy API section.~~
Changes EnvNNodes to EnvNnodes to match --nnodes.
Sets the WORLD_SIZE environment variable to totalReplicas * nprocPerNode.
Adds PET_NPROC_PER_NODE for each pod.
Adds PET_NODE_RANK for each pod.
Adds PET_NNODES for non-elastic mode.
Sets PET_MASTER_PORT/PET_MASTER_ADDR equals to MASTER_PORT/MASTER_ADDR for compatibility.

References:

Checklist:

Docs included if any changes are user facing

coveralls · 2023-06-26T08:44:55Z

Pull Request Test Coverage Report for Build 5520229973

44 of 68 (64.71%) changed or added relevant lines in 7 files are covered.
777 unchanged lines in 23 files lost coverage.
Overall coverage decreased (-1.2%) to 33.134%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controller.v1/pytorch/elastic.go	3	4	75.0%
pkg/apis/kubeflow.org/v1/pytorch_defaults.go	5	7	71.43%
pkg/apis/kubeflow.org/v1/pytorch_validation.go	4	8	50.0%
pkg/controller.v1/pytorch/envvar.go	24	28	85.71%
pkg/apis/kubeflow.org/v1/zz_generated.deepcopy.go	0	5	0.0%
pkg/apis/kubeflow.org/v1/openapi_generated.go	0	8	0.0%

Files with Coverage Reduction	New Missed Lines	%
pkg/apis/kubeflow.org/v1/paddlepaddle_defautls.go	2	0%
pkg/apis/kubeflow.org/v1/pytorch_defaults.go	2	39.62%
pkg/reconciler.v1/common/job.go	2	8.1%
pkg/controller.v1/tensorflow/util.go	3	93.75%
pkg/controller.v1/paddlepaddle/envvar.go	4	74.11%
pkg/controller.v1/pytorch/elastic.go	5	76.92%
pkg/controller.v1/pytorch/envvar.go	5	86.46%
pkg/common/util/util.go	6	42.22%
pkg/controller.v1/mpi/mpijob.go	10	91.06%
pkg/util/status.go	11	81.82%

Totals
Change from base Build 5371475460:	-1.2%
Covered Lines:	3211
Relevant Lines:	9691

💛 - Coveralls

pkg/apis/kubeflow.org/v1/pytorch_types.go

pkg/controller.v1/pytorch/master.go

pkg/controller.v1/pytorch/envvar.go

andreyvelich

Thank you for this @kuizhiqing!
I left few comments.

pkg/apis/kubeflow.org/v1/pytorch_types.go

pkg/controller.v1/pytorch/envvar.go

pkg/apis/kubeflow.org/v1/pytorch_types.go

pkg/controller.v1/pytorch/pytorchjob_controller_test.go

pkg/controller.v1/pytorch/elastic.go

pkg/controller.v1/pytorch/envvar.go

pkg/apis/kubeflow.org/v1/pytorch_types.go

johnugeorge · 2023-06-30T06:11:42Z

@kuizhiqing Can you address @tenzen-y 's comments as well? We can merge post that. Thanks for this

kuizhiqing · 2023-06-30T17:09:12Z

@kuizhiqing Can you address @tenzen-y 's comments as well? We can merge post that. Thanks for this

Done

tenzen-y · 2023-06-30T17:53:10Z

pkg/apis/kubeflow.org/v1/pytorch_defaults.go

@@ -79,4 +79,9 @@ func SetDefaults_PyTorchJob(job *PyTorchJob) {
 	}
 	// Set default elastic policy.
 	setElasticPolicy(job)
+
+	if job.Spec.NprocPerNode == nil {


Can we check elasticPolicy? if elasticPolicy.NProcPerNode is set, validateNprocPerNode rejects to create the Job, right?

Also, can you add a unit test?

@tenzen-y I'm little bit confused here, for now, we are going to just leave nprocPerNode in elasticPolicy work with some warning and deprecate it in the future or it will not work since this version ?

@kuizhiqing Sorry for the confusion.
I meant the following checks:

Suggested change

if job.Spec.NprocPerNode == nil {

if (job.Spec.ElasticPolicy != nil && job.Spec.ElasticPolicy.NProcPerNode == nil) || (job.Spec.ElasticPolicy == nil) {

if job.Spec.NprocPerNode == nil {

If we don't check elastciPolicy and then both elastciPolicy.NProcPerNode and job.Spec.NprocPerNode are set, the validateNprocPerNode function rejects the request, right?

@tenzen-y OK,Thanks for clarifying. I will handle this and those UT as you say next week.

tenzen-y · 2023-06-30T17:53:51Z

pkg/apis/kubeflow.org/v1/pytorch_defaults.go

@@ -79,4 +79,9 @@ func SetDefaults_PyTorchJob(job *PyTorchJob) {
 	}
 	// Set default elastic policy.
 	setElasticPolicy(job)
+
+	if job.Spec.NprocPerNode == nil {


Also, can you add a unit test?

tenzen-y · 2023-06-30T17:54:05Z

pkg/apis/kubeflow.org/v1/pytorch_validation.go

+	return nil
+}
+
+func validateNprocPerNode(pytorchJob *PyTorchJob) error {


Should we add a unit test?

kuizhiqing · 2023-07-10T14:47:39Z

@tenzen-y @johnugeorge PTAL

tenzen-y

@kuizhiqing I left some comments.

tenzen-y · 2023-07-11T07:10:09Z

pkg/apis/kubeflow.org/v1/pytorch_defaults.go

@@ -19,6 +19,10 @@ import (
 	"k8s.io/apimachinery/pkg/runtime"
 )

+var (
+	defaultNprocPerNode = "auto"


Great constants :)

tenzen-y · 2023-07-11T07:20:59Z

pkg/apis/kubeflow.org/v1/pytorch_defaults_test.go

@@ -152,3 +152,25 @@ func TestSetElasticPolicy(t *testing.T) {
 		})
 	}
 }
+
+func TestSetDefaultNprocPerNode(t *testing.T) {


I think that this test doesn't verify some edge cases.
For example, the current case can not verify that .spec.nprocPerNode isn't overridden when elasticPolicy is nil and .spec.elasticPolicy isn't nil.

So should we add more test cases?

tenzen-y · 2023-07-11T07:24:15Z

pkg/controller.v1/paddlepaddle/envvar.go

@@ -60,7 +60,7 @@ func setPodEnv(obj interface{}, podTemplateSpec *corev1.PodTemplateSpec, rtype,
 		// Ref https://stackoverflow.com/questions/59812009/what-is-the-use-of-pythonunbuffered-in-docker-file.
 		podTemplateSpec.Spec.Containers[i].Env = append(podTemplateSpec.Spec.Containers[i].Env, corev1.EnvVar{
 			Name:  "PYTHONUNBUFFERED",
-			Value: "0",
+			Value: "1",


Should we change this in a separate PR since PaddleJob isn't related to PyTorchJob.

tenzen-y · 2023-07-11T07:31:44Z

pkg/controller.v1/pytorch/envvar.go

+	}
+	return 1
+}
+


Why do we need this? IIUC, we default spec.NproPerNode to auto if spec.NproPerNode is nil.

getNprocPerNodeInt is a helper function to calculate world size, it will not effect the env.

When nproc_per_node set to auto, it means the number of process will be determinate in the user process phase, in this case, world size env will not be used.

I see. Thanks for the clarification!
So should we add a comment to this function? And can we add validation to pkg/apis/kubeflow.org/v1/pytorch_validation.go like the following?

if job.spec.NrocPerNode != nil { if np, err := strconv.Atoi(job.spec.NrocPerNode); err != nil && (np != "auto" || np != "CPU" || np == "GPU") { Error("error") } }

I do not think we should do that, let's say if PyTorch framework support values other than those in the list, e.g. "XPU", the operator will not work anymore.

Anyway, in my opinion, the operator should decouple with the framework in this level.

It makes sense.
Is there any document about supported values by PyTorch?

Ah, I found the logic for the nproc_per_node: https://github.com/pytorch/pytorch/blob/26f7f470df64d90e092081e39507e4ac751f55d6/torch/distributed/run.py#L629-L658

Can you add this link to pkg/apis/kubeflow.org/v1/pytorch_types.go.

And can you add the following comment you posted to this function?

When nproc_per_node set to auto, it means the number of process will be determinate in the user process phase, in this case, world size env will not be used.

Others LGTM. Thanks for the big contribution.

pkg/controller.v1/pytorch/pytorchjob_controller_test.go

tenzen-y · 2023-07-11T10:42:52Z

pkg/controller.v1/pytorch/envvar.go

+	}
+	return 1
+}
+


I see. Thanks for the clarification!
So should we add a comment to this function? And can we add validation to pkg/apis/kubeflow.org/v1/pytorch_validation.go like the following?

if job.spec.NrocPerNode != nil { if np, err := strconv.Atoi(job.spec.NrocPerNode); err != nil && (np != "auto" || np != "CPU" || np == "GPU") { Error("error") } }

pkg/controller.v1/xgboost/xgboost.go

tenzen-y

@kuizhiqing Thanks for the great work!
/lgtm
/assign @johnugeorge @andreyvelich

andreyvelich · 2023-07-11T13:24:11Z

Thank you for this contribution @kuizhiqing !
/lgtm
/assign @johnugeorge

itayvallach · 2023-07-11T19:42:53Z

Thank you for this contribution @kuizhiqing !

johnugeorge · 2023-07-12T16:36:04Z

Thanks @kuizhiqing for contributing this

/lgtm
/approve

google-oss-prow · 2023-07-12T16:36:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge, kuizhiqing

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johnugeorge]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fix pytorch WORLD_SIZE env inconsistent

ad5c282

google-oss-prow bot requested review from jinchihe and tenzen-y June 26, 2023 08:41

google-oss-prow bot added the size/L label Jun 26, 2023

kuizhiqing changed the title ~~fix pytorch WORLD_SIZE env inconsistent~~ [WIP] fix pytorch WORLD_SIZE env inconsistent Jun 26, 2023

google-oss-prow bot added the do-not-merge/work-in-progress label Jun 26, 2023

add correct env node_rank, nnodes for torch

d29362c

kuizhiqing force-pushed the torch-env-fix branch from 06910d6 to d29362c Compare June 28, 2023 07:29

kuizhiqing changed the title ~~[WIP] fix pytorch WORLD_SIZE env inconsistent~~ Set correct ENV for PytorchJob to support torchrun Jun 28, 2023

google-oss-prow bot removed the do-not-merge/work-in-progress label Jun 28, 2023

kuizhiqing mentioned this pull request Jun 28, 2023

[Discussion] PyTorch Operator Improvement #1836

Open

tenzen-y reviewed Jun 28, 2023

View reviewed changes

pkg/apis/kubeflow.org/v1/pytorch_types.go Show resolved Hide resolved

restore elastic nnproc

731a12e

johnugeorge reviewed Jun 28, 2023

View reviewed changes

pkg/controller.v1/pytorch/master.go Show resolved Hide resolved

johnugeorge reviewed Jun 28, 2023

View reviewed changes

pkg/controller.v1/pytorch/envvar.go Show resolved Hide resolved

johnugeorge reviewed Jun 28, 2023

View reviewed changes

pkg/controller.v1/pytorch/envvar.go Show resolved Hide resolved

andreyvelich reviewed Jun 28, 2023

View reviewed changes

pkg/apis/kubeflow.org/v1/pytorch_types.go Outdated Show resolved Hide resolved

pkg/apis/kubeflow.org/v1/pytorch_types.go Outdated Show resolved Hide resolved

pkg/controller.v1/pytorch/envvar.go Show resolved Hide resolved

pkg/controller.v1/pytorch/envvar.go Show resolved Hide resolved

tenzen-y reviewed Jun 29, 2023

View reviewed changes

use string for nproc_per_node

f98077d

tenzen-y reviewed Jun 29, 2023

View reviewed changes

pkg/apis/kubeflow.org/v1/pytorch_types.go Show resolved Hide resolved

pkg/apis/kubeflow.org/v1/pytorch_types.go Show resolved Hide resolved

add defaults in api

aae9541

add validation for two nproc_per_node, use auto for defaulter

b50fd83

kuizhiqing force-pushed the torch-env-fix branch from e3953d6 to b50fd83 Compare June 30, 2023 15:19

tenzen-y reviewed Jun 30, 2023

View reviewed changes

add ut for defaults and validation

741a4de

tenzen-y reviewed Jul 11, 2023

View reviewed changes

fix ut

2f1b5df

tenzen-y reviewed Jul 11, 2023

View reviewed changes

add doc for nproc_per_node

a54264c

tenzen-y reviewed Jul 11, 2023

View reviewed changes

google-oss-prow bot assigned andreyvelich, johnugeorge and tenzen-y Jul 11, 2023

google-oss-prow bot added the lgtm label Jul 11, 2023

kuizhiqing mentioned this pull request Jul 12, 2023

Bug: NProcPerNode in ElasticPolicy is wrong type #1861

Closed

google-oss-prow bot added the approved label Jul 12, 2023

google-oss-prow bot merged commit 2f18ab7 into kubeflow:master Jul 12, 2023

kuizhiqing mentioned this pull request Aug 1, 2023

Training Operator - Pytorch, multi-gpu + multi-worker distributed training #1872

Closed

johnugeorge mentioned this pull request Aug 5, 2023

[Release] Training operator 1.7.0 release #1809

Closed

8 tasks

brannondorsey mentioned this pull request Sep 27, 2023

The WORLD_SIZE environment variable in PyTorch is different from its definition #1790

Closed

This was referenced Nov 17, 2023

The WORLD_SIZE environment variable for elastic policy is not getting set correctly #1947

Closed

fix nproc env in elastic mode for pytorchjob #1948

Merged

kuizhiqing mentioned this pull request Dec 15, 2023

[SDK] Train API #1962

Merged

1 task

kuizhiqing mentioned this pull request Oct 5, 2024

Add DeepSpeed Example with Pytorch Operator #2235

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set correct ENV for PytorchJob to support torchrun #1840

Set correct ENV for PytorchJob to support torchrun #1840

kuizhiqing commented Jun 26, 2023 •

edited

Loading

coveralls commented Jun 26, 2023 •

edited

Loading

andreyvelich left a comment

johnugeorge commented Jun 30, 2023 •

edited

Loading

kuizhiqing commented Jun 30, 2023

tenzen-y Jun 30, 2023

tenzen-y Jun 30, 2023

kuizhiqing Jul 1, 2023 •

edited

Loading

tenzen-y Jul 3, 2023

kuizhiqing Jul 4, 2023

kuizhiqing Jul 10, 2023

tenzen-y Jun 30, 2023

tenzen-y Jun 30, 2023

kuizhiqing Jul 10, 2023

kuizhiqing commented Jul 10, 2023

tenzen-y left a comment

tenzen-y Jul 11, 2023

tenzen-y Jul 11, 2023

kuizhiqing Jul 11, 2023

tenzen-y Jul 11, 2023

kuizhiqing Jul 11, 2023

tenzen-y Jul 11, 2023

kuizhiqing Jul 11, 2023

tenzen-y Jul 11, 2023

kuizhiqing Jul 11, 2023 •

edited

Loading

tenzen-y Jul 11, 2023

tenzen-y Jul 11, 2023

tenzen-y Jul 11, 2023

kuizhiqing Jul 11, 2023

tenzen-y Jul 11, 2023

tenzen-y left a comment

andreyvelich commented Jul 11, 2023

itayvallach commented Jul 11, 2023

johnugeorge commented Jul 12, 2023

google-oss-prow bot commented Jul 12, 2023

	if job.Spec.NprocPerNode == nil {
	if (job.Spec.ElasticPolicy != nil && job.Spec.ElasticPolicy.NProcPerNode == nil) \|\| (job.Spec.ElasticPolicy == nil) {
	if job.Spec.NprocPerNode == nil {

Set correct ENV for PytorchJob to support torchrun #1840

Set correct ENV for PytorchJob to support torchrun #1840

Conversation

kuizhiqing commented Jun 26, 2023 • edited Loading

coveralls commented Jun 26, 2023 • edited Loading

Pull Request Test Coverage Report for Build 5520229973

💛 - Coveralls

andreyvelich left a comment

Choose a reason for hiding this comment

johnugeorge commented Jun 30, 2023 • edited Loading

kuizhiqing commented Jun 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kuizhiqing Jul 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kuizhiqing commented Jul 10, 2023

tenzen-y left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kuizhiqing Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

andreyvelich commented Jul 11, 2023

itayvallach commented Jul 11, 2023

johnugeorge commented Jul 12, 2023

google-oss-prow bot commented Jul 12, 2023

kuizhiqing commented Jun 26, 2023 •

edited

Loading

coveralls commented Jun 26, 2023 •

edited

Loading

johnugeorge commented Jun 30, 2023 •

edited

Loading

kuizhiqing Jul 1, 2023 •

edited

Loading

kuizhiqing Jul 11, 2023 •

edited

Loading