Skip to content

Conversation

@jeffhataws
Copy link
Contributor

@jeffhataws jeffhataws commented Dec 17, 2022

What does this PR do?

This PR adds support for torchrun for AWS Neuron SDK.

Existing HF tutorial for Neuron SDK requires users to modify the HF example script (ie run_glue.py). This change will help minimize the changes required.

This change will require future AWS Neuron PyTorch 1.13 support.

This is an update to #19907 .

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sgugger

@jeffhataws jeffhataws changed the title Add aws neuron torchrun support Add AWS Neuron torchrun support and transformer model-type support for compiler Dec 17, 2022
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Dec 17, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this new integration. The test won't be run on our CI since torch_neuroncore is not installed. Is it possible to install it in regular images or do we need to be on an AWS instance>

@philschmid
Copy link
Contributor

@jeffhataws could you maybe please explain a bit more about how users would benefit from that? I quickly checked the HF tutorial and with the change you propose users would still need to modify the scripts, e.g., for

# Fixup to enable distributed training with XLA
from packaging import version
from transformers import __version__
if version.parse(__version__) < version.parse("4.20.0"):
    Trainer._wrap_model = lambda self, model, training=True: model
else:
    Trainer._wrap_model = lambda self, model, training=True, dataloader=None: model

# Workaround for NaNs seen with transformers version >= 4.21.0
# https://github.com/aws-neuron/aws-neuron-sdk/issues/593
if os.environ.get("XLA_USE_BF16") or os.environ.get("XLA_DOWNCAST_BF16"):
    transformers.modeling_utils.get_parameter_dtype = lambda x: torch.bfloat16

@jeffhataws
Copy link
Contributor Author

jeffhataws commented Dec 20, 2022

Thanks for adding this new integration. The test won't be run on our CI since torch_neuroncore is not installed. Is it possible to install it in regular images or do we need to be on an AWS instance>

Yes for this test we will need Trainium instance. Over time, once pytorch/xla#3609 is released, we can make it more generic for GPU/XLA. For now, Neuron team will test this. Test is currently passing on Trainium instance.

@jeffhataws
Copy link
Contributor Author

@jeffhataws could you maybe please explain a bit more about how users would benefit from that? I quickly checked the HF tutorial and with the change you propose users would still need to modify the scripts, e.g., for

# Fixup to enable distributed training with XLA
from packaging import version
from transformers import __version__
if version.parse(__version__) < version.parse("4.20.0"):
    Trainer._wrap_model = lambda self, model, training=True: model
else:
    Trainer._wrap_model = lambda self, model, training=True, dataloader=None: model

# Workaround for NaNs seen with transformers version >= 4.21.0
# https://github.com/aws-neuron/aws-neuron-sdk/issues/593
if os.environ.get("XLA_USE_BF16") or os.environ.get("XLA_DOWNCAST_BF16"):
    transformers.modeling_utils.get_parameter_dtype = lambda x: torch.bfloat16

The first workaround is for missing DDP support which will be available in Neuron's PyTorch-XLA version 1.13 (future release). The second workaround is already fixed in transformers==4.25.1 by #20562.

@jeffhataws jeffhataws changed the title Add AWS Neuron torchrun support and transformer model-type support for compiler Add AWS Neuron torchrun support Dec 22, 2022
@sgugger
Copy link
Collaborator

sgugger commented Jan 3, 2023

Thanks for the precisions. Let's wait until the release of Neuron's PyTorch-XLA version 1.13 to merge this, then?

@jeffhataws
Copy link
Contributor Author

Thanks for the precisions. Let's wait until the release of Neuron's PyTorch-XLA version 1.13 to merge this, then?

@sgugger since we already have a workaround for DDP wrapper by overwriting the _wrap_model function, we can actually merge this first. The reason is that 1) we want it in for next transformer release ahead of 1.13, and 2) I will need this change to post another PR for the default compiler flag for transformer model type. Let me know if this is acceptable.

@jeffhataws jeffhataws requested a review from sgugger January 6, 2023 20:58
@sgugger
Copy link
Collaborator

sgugger commented Jan 18, 2023

Thanks for your patience on this.

@sgugger sgugger merged commit c59d71b into huggingface:main Jan 18, 2023
@jeffhataws jeffhataws deleted the add_aws_neuron_torchrun_support branch January 19, 2023 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants