Add AWS Neuron torchrun support #20806

jeffhataws · 2022-12-17T00:24:39Z

What does this PR do?

This PR adds support for torchrun for AWS Neuron SDK.

Existing HF tutorial for Neuron SDK requires users to modify the HF example script (ie run_glue.py). This change will help minimize the changes required.

This change will require future AWS Neuron PyTorch 1.13 support.

This is an update to #19907 .

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sgugger

…ckend yet

…r flag

HuggingFaceDocBuilderDev · 2022-12-17T00:39:19Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Thanks for adding this new integration. The test won't be run on our CI since torch_neuroncore is not installed. Is it possible to install it in regular images or do we need to be on an AWS instance>

src/transformers/training_args.py

philschmid · 2022-12-19T14:22:52Z

@jeffhataws could you maybe please explain a bit more about how users would benefit from that? I quickly checked the HF tutorial and with the change you propose users would still need to modify the scripts, e.g., for

# Fixup to enable distributed training with XLA
from packaging import version
from transformers import __version__
if version.parse(__version__) < version.parse("4.20.0"):
    Trainer._wrap_model = lambda self, model, training=True: model
else:
    Trainer._wrap_model = lambda self, model, training=True, dataloader=None: model

# Workaround for NaNs seen with transformers version >= 4.21.0
# https://github.com/aws-neuron/aws-neuron-sdk/issues/593
if os.environ.get("XLA_USE_BF16") or os.environ.get("XLA_DOWNCAST_BF16"):
    transformers.modeling_utils.get_parameter_dtype = lambda x: torch.bfloat16

jeffhataws · 2022-12-20T16:42:46Z

Thanks for adding this new integration. The test won't be run on our CI since torch_neuroncore is not installed. Is it possible to install it in regular images or do we need to be on an AWS instance>

Yes for this test we will need Trainium instance. Over time, once pytorch/xla#3609 is released, we can make it more generic for GPU/XLA. For now, Neuron team will test this. Test is currently passing on Trainium instance.

jeffhataws · 2022-12-20T16:46:13Z

@jeffhataws could you maybe please explain a bit more about how users would benefit from that? I quickly checked the HF tutorial and with the change you propose users would still need to modify the scripts, e.g., for

# Fixup to enable distributed training with XLA
from packaging import version
from transformers import __version__
if version.parse(__version__) < version.parse("4.20.0"):
    Trainer._wrap_model = lambda self, model, training=True: model
else:
    Trainer._wrap_model = lambda self, model, training=True, dataloader=None: model

# Workaround for NaNs seen with transformers version >= 4.21.0
# https://github.com/aws-neuron/aws-neuron-sdk/issues/593
if os.environ.get("XLA_USE_BF16") or os.environ.get("XLA_DOWNCAST_BF16"):
    transformers.modeling_utils.get_parameter_dtype = lambda x: torch.bfloat16

The first workaround is for missing DDP support which will be available in Neuron's PyTorch-XLA version 1.13 (future release). The second workaround is already fixed in transformers==4.25.1 by #20562.

sgugger · 2023-01-03T15:06:25Z

Thanks for the precisions. Let's wait until the release of Neuron's PyTorch-XLA version 1.13 to merge this, then?

jeffhataws · 2023-01-06T18:09:05Z

Thanks for the precisions. Let's wait until the release of Neuron's PyTorch-XLA version 1.13 to merge this, then?

@sgugger since we already have a workaround for DDP wrapper by overwriting the _wrap_model function, we can actually merge this first. The reason is that 1) we want it in for next transformer release ahead of 1.13, and 2) I will need this change to post another PR for the default compiler flag for transformer model type. Let me know if this is acceptable.

sgugger · 2023-01-18T16:21:16Z

Thanks for your patience on this.

jeffhataws and others added 7 commits November 29, 2022 23:25

Add XLA torchrun support

2327ed4

Clarify that currently DDP doesn't work with torch.distributed XLA ba…

9ce1c3d

…ckend yet

Enable DDP with torchrun and XLA (now available in PT-XLA 1.13)

f339bd9

Merge branch 'huggingface:main' into add_aws_neuron_torchrun_support

62a39be

Add check for AWS Neuron availability and AWS Neuron specific compile…

9baa11d

…r flag

Merge branch 'huggingface:main' into add_aws_neuron_torchrun_support

3cc426e

Change the new test's name to TestTrainerDistributedNeuronCore

7110fe4

jeffhataws changed the title ~~Add aws neuron torchrun support~~ Add AWS Neuron torchrun support and transformer model-type support for compiler Dec 17, 2022

sgugger reviewed Dec 19, 2022

View reviewed changes

src/transformers/training_args.py Outdated Show resolved Hide resolved

jeffhataws added 2 commits December 22, 2022 06:50

Remove "assert" and replace raised exception

be83589

Remove compiler flag as it is optional. If needed, will be another PR.

f81d38a

jeffhataws changed the title ~~Add AWS Neuron torchrun support and transformer model-type support for compiler~~ Add AWS Neuron torchrun support Dec 22, 2022

Use TORCHELASTIC_RUN_ID to determine whether torchrun is used

2bfae4a

Merge branch 'huggingface:main' into add_aws_neuron_torchrun_support

c073f53

jeffhataws requested a review from sgugger January 6, 2023 20:58

Merge branch 'huggingface:main' into add_aws_neuron_torchrun_support

1932bbc

sgugger approved these changes Jan 18, 2023

View reviewed changes

sgugger merged commit c59d71b into huggingface:main Jan 18, 2023

jeffhataws deleted the add_aws_neuron_torchrun_support branch January 19, 2023 04:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AWS Neuron torchrun support #20806

Add AWS Neuron torchrun support #20806

Uh oh!

jeffhataws commented Dec 17, 2022 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Dec 17, 2022 •

edited

Loading

Uh oh!

sgugger left a comment

Uh oh!

Uh oh!

philschmid commented Dec 19, 2022

Uh oh!

jeffhataws commented Dec 20, 2022 •

edited

Loading

Uh oh!

jeffhataws commented Dec 20, 2022

Uh oh!

sgugger commented Jan 3, 2023

Uh oh!

jeffhataws commented Jan 6, 2023

Uh oh!

sgugger commented Jan 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add AWS Neuron torchrun support #20806

Add AWS Neuron torchrun support #20806

Uh oh!

Conversation

jeffhataws commented Dec 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Dec 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

philschmid commented Dec 19, 2022

Uh oh!

jeffhataws commented Dec 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffhataws commented Dec 20, 2022

Uh oh!

sgugger commented Jan 3, 2023

Uh oh!

jeffhataws commented Jan 6, 2023

Uh oh!

sgugger commented Jan 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jeffhataws commented Dec 17, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 17, 2022 •

edited

Loading

jeffhataws commented Dec 20, 2022 •

edited

Loading