Adding support for torchrun in xla backend #3609

amithrm · 2022-05-25T05:22:11Z

This pull request enables the code needed to integrate torchrun launcher with xla backend.

JackCaoG · 2022-05-25T18:23:11Z

@amithrm Can you provide a test for this new feature?

amithrm · 2022-05-25T19:01:54Z

@JackCaoG sure..will add tests

amithrm · 2022-06-08T19:34:05Z

@JackCaoG I changed the initialization a bit to take into account how slurm configures the devices. Please take a look at it and also the test cases. All of these need would need more modifications after we discuss

JackCaoG · 2022-06-09T23:44:29Z

I have to admit that I am not an expert of torchrun, let me read up some documentations first lol. Looping in @will-cromar to make sure this does not conflict with our future pjrt runtime.

amithrm · 2022-06-22T22:09:08Z

we did some internal testing. It appears that at scale, we see issues with the set up of GRPC channels. We should understand if you see similar issues at your end too.

amithrm · 2022-06-23T20:17:28Z

@JackCaoG
A simple test that you can run on GPU-XLA:

import sys
import torch
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.distributed.xla_multiprocessing as xmp

import os


def _mp_fn(index):
  print('XRT_LOCAL_WORKER:{}'.format(os.environ['XRT_LOCAL_WORKER']))
  print('XRT_DEVICE_MAP:{}'.format(os.environ['XRT_DEVICE_MAP']))
  print('XRT_WORKERS:{}'.format(os.environ['XRT_WORKERS']))
  print('XRT_HOST_WORLD_SIZE:{}'.format(os.environ['XRT_HOST_WORLD_SIZE']))
  device = xm.xla_device()
  world_size = xm.xrt_world_size()
  ordinal_tensor = torch.tensor([index], dtype=torch.float).to(device)
  print('rank:{}, value:{}'.format(index, ordinal_tensor))
  result = xm.all_reduce('sum', ordinal_tensor)

  cpu_result = result.cpu()
  print('rank:{}, value:{}'.format(index, cpu_result))


if __name__ == '__main__':
  xmp.spawn(_mp_fn, args=(), nprocs=2, join=True)

amithrm · 2022-06-23T20:19:09Z

Run command:

GPU_NUM_DEVICES=2 python3 allreduce_xla.py

This will output:

XRT_LOCAL_WORKER:localservice:0
XRT_DEVICE_MAP:GPU:0;/job:localservice/replica:0/task:0/device:XLA_GPU:0|GPU:1;/job:localservice/replica:0/task:1/device:XLA_GPU:0
XRT_WORKERS:localservice:0;grpc://dfda805bbe4b:49887|localservice:1;grpc://dfda805bbe4b:33097
XRT_LOCAL_WORKER:localservice:1
XRT_DEVICE_MAP:GPU:0;/job:localservice/replica:0/task:0/device:XLA_GPU:0|GPU:1;/job:localservice/replica:0/task:1/device:XLA_GPU:0
XRT_WORKERS:localservice:0;grpc://dfda805bbe4b:49887|localservice:1;grpc://dfda805bbe4b:33097

If you look for XRT_WORKERS, this has the grpc string for each worker. This won't scale with number of workers.

JackCaoG · 2022-11-02T18:30:21Z

We want to support torch run with new PJRT run time, in the mean time if this torch run utility can unblock aws folks we can also take it.

I am a bit hesitant whether claim official support for XRT:TPU + torch run. @will-cromar Let's invesgate what's the gap here, if it is free we might as well just take it.

will-cromar

This doesn't conflict with PJRT, but we will have to add a new code path for it. Runtime configuration is much simpler under PJRT, so I expect that we can simplify a lot of this.

@JackCaoG what do you think of moving this under torch_xla.experimental for now?

torch_xla/distributed/xrt_init.py

will-cromar · 2022-11-02T22:16:16Z

test/test_allreduce_torchrun.py

+    cmd = "torchrun --nproc_per_node=2  --master_addr=127.0.0.1 --master_port=2020 allreduce_torchrun.py "
+
+    new_env0 = os.environ.copy()
+    new_env0['NEURON_RT_VISIBLE_CORES'] = '0,1'


I'm not familiar with Neuron at all. Do we need to install anything extra beyond PyTorch and PyTorch/XLA to run this test? If not, can you add a GPU CI test like this?

xla/test/run_tests.sh

Lines 85 to 90 in 590dee5

if [ -x "$(command -v nvidia-smi)" ]; then

PJRT_DEVICE=GPU run_test "$@"

else

# TODO(darisoy): run these tests with multiple CPU devices, this fails due to TF issue.

PJRT_DEVICE=CPU CPU_NUM_DEVICES=1 run_test "$@"

fi

Yes, these are neuron specific env variables. Will add the GPU specific ones. But, we have PJRT_DEVICE listed above, will this interfere with the torchrun?

Nope, there shouldn't be an issue. PJRT won't work with this PR, but you can add a new function to run_tests.sh that sets up anything neuron related and skips setting PJRT_DEVICE (i.e. make a run_neuron function like run_pjrt)

Hi @will-cromar Looks like in the CI tests, the file allreduce_torchrun.py is not picked up. What is a good way to fix this? Is there a global prefix that I can add before the file name?

torch_xla/distributed/xla_backend.py

torch_xla/distributed/xrt_init.py

test/test_allreduce_torchrun.py

torch_xla/distributed/xrt_init.py

test/allreduce_torchrun.py

JackCaoG · 2022-11-02T22:49:53Z

in terms of moving the code to experimental I guess it is related to how do we want to set the user expection. Is there any known cavet for this feature? If it works well we don't need to put it in experimental. However we should add a README similar to https://github.com/pytorch/xla/blob/master/docs/ddp.md

JackCaoG · 2022-12-05T18:37:20Z

@will-cromar can you take another pass of this pr when you have some time?

will-cromar

Thanks! Overall LGTM. Please add a CI test (run_tests.sh) and then we can merge this

amithrm · 2022-12-08T20:33:44Z

Looks like the file need to test (allreduce_torchrun.py) is not getting picked up. Checking with @will-cromar on how to fix this. And some yapf fixes are pending in one file.

amithrm · 2022-12-13T00:09:33Z

@will-cromar I see build failure: NameError: name 'sympy' is not defined

JackCaoG · 2022-12-13T00:20:48Z

weird, head is green right now https://github.com/pytorch/xla/commits/master

JackCaoG · 2022-12-13T00:41:25Z

Ah ok https://github.com/pytorch/xla/pull/4313/files should fix it, can you rebase again?

…ated tests" This reverts commit a66060c.

amithrm · 2022-12-13T18:59:53Z

@JackCaoG Alll the 4 pass @will-cromar is there anything else needed?

will-cromar

LGTM

JackCaoG

Thanks!

jeffhataws · 2022-12-13T23:48:03Z

Thanks @JackCaoG and @amithrm !

jeffhataws mentioned this pull request Oct 26, 2022

Enables torchrun for XLA-based accelerators huggingface/transformers#19907

Closed

5 tasks

amithrm force-pushed the xrt_init branch from bc05ff8 to 548dd39 Compare October 27, 2022 07:51

JackCaoG requested a review from will-cromar November 2, 2022 18:27

will-cromar requested changes Nov 2, 2022

View reviewed changes

JackCaoG reviewed Nov 2, 2022

View reviewed changes

test/allreduce_torchrun.py Show resolved Hide resolved

amithrm force-pushed the xrt_init branch from 548dd39 to 1ecbee8 Compare December 5, 2022 06:49

will-cromar reviewed Dec 5, 2022

View reviewed changes

amithrm force-pushed the xrt_init branch from 1ecbee8 to eb09f7d Compare December 6, 2022 23:07

amithrm force-pushed the xrt_init branch 3 times, most recently from f263be6 to e71414a Compare December 10, 2022 16:55

amithrm added 5 commits December 12, 2022 21:15

Adding support for torchrun in xla backend

8771725

Adding support for custom specification of devices and associated tests

997356b

Revert "Adding support for custom specification of devices and associ…

8f84716

…ated tests" This reverts commit a66060c.

Adding support for custom specification of devices and associated tests

3336030

Updated xrt initialization to work with TPU based flow

e0c4ff9

amithrm added 5 commits December 12, 2022 21:15

Cleanup xrt initialization to work with torchrun/slurm

6833150

Adding tests to CI pipeline

d0fbbd8

Fix Lint/CI issues

b7c02c9

Fixing Lint Error

1d2be47

Lint changes

262813e

amithrm force-pushed the xrt_init branch from e71414a to 262813e Compare December 13, 2022 05:15

will-cromar approved these changes Dec 13, 2022

View reviewed changes

will-cromar requested a review from JackCaoG December 13, 2022 19:10

JackCaoG approved these changes Dec 13, 2022

View reviewed changes

JackCaoG merged commit 334d1d6 into pytorch:master Dec 13, 2022

jeffhataws mentioned this pull request Dec 20, 2022

Add AWS Neuron torchrun support huggingface/transformers#20806

Merged

5 tasks

	if [ -x "$(command -v nvidia-smi)" ]; then
	PJRT_DEVICE=GPU run_test "$@"
	else
	# TODO(darisoy): run these tests with multiple CPU devices, this fails due to TF issue.
	PJRT_DEVICE=CPU CPU_NUM_DEVICES=1 run_test "$@"
	fi

Adding support for torchrun in xla backend #3609

Adding support for torchrun in xla backend #3609

Uh oh!

Conversation

amithrm commented May 25, 2022

Uh oh!

JackCaoG commented May 25, 2022

Uh oh!

amithrm commented May 25, 2022

Uh oh!

amithrm commented Jun 8, 2022

Uh oh!

JackCaoG commented Jun 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amithrm commented Jun 22, 2022

Uh oh!

amithrm commented Jun 23, 2022

Uh oh!

amithrm commented Jun 23, 2022

Uh oh!

JackCaoG commented Nov 2, 2022

Uh oh!

will-cromar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

will-cromar Nov 2, 2022

Choose a reason for hiding this comment

Uh oh!

amithrm Nov 10, 2022

Choose a reason for hiding this comment

Uh oh!

will-cromar Dec 5, 2022

Choose a reason for hiding this comment

Uh oh!

amithrm Dec 8, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JackCaoG commented Nov 2, 2022

Uh oh!

JackCaoG commented Dec 5, 2022

Uh oh!

will-cromar left a comment

Choose a reason for hiding this comment

Uh oh!

amithrm commented Dec 8, 2022

Uh oh!

amithrm commented Dec 13, 2022

Uh oh!

JackCaoG commented Dec 13, 2022

Uh oh!

JackCaoG commented Dec 13, 2022

Uh oh!

amithrm commented Dec 13, 2022

Uh oh!

will-cromar left a comment

Choose a reason for hiding this comment

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

jeffhataws commented Dec 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

JackCaoG commented Jun 9, 2022 •

edited

Loading