-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Add AWS Neuron torchrun support #20806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AWS Neuron torchrun support #20806
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this new integration. The test won't be run on our CI since torch_neuroncore is not installed. Is it possible to install it in regular images or do we need to be on an AWS instance>
|
@jeffhataws could you maybe please explain a bit more about how users would benefit from that? I quickly checked the HF tutorial and with the change you propose users would still need to modify the scripts, e.g., for # Fixup to enable distributed training with XLA
from packaging import version
from transformers import __version__
if version.parse(__version__) < version.parse("4.20.0"):
Trainer._wrap_model = lambda self, model, training=True: model
else:
Trainer._wrap_model = lambda self, model, training=True, dataloader=None: model
# Workaround for NaNs seen with transformers version >= 4.21.0
# https://github.com/aws-neuron/aws-neuron-sdk/issues/593
if os.environ.get("XLA_USE_BF16") or os.environ.get("XLA_DOWNCAST_BF16"):
transformers.modeling_utils.get_parameter_dtype = lambda x: torch.bfloat16 |
Yes for this test we will need Trainium instance. Over time, once pytorch/xla#3609 is released, we can make it more generic for GPU/XLA. For now, Neuron team will test this. Test is currently passing on Trainium instance. |
The first workaround is for missing DDP support which will be available in Neuron's PyTorch-XLA version 1.13 (future release). The second workaround is already fixed in transformers==4.25.1 by #20562. |
|
Thanks for the precisions. Let's wait until the release of Neuron's PyTorch-XLA version 1.13 to merge this, then? |
@sgugger since we already have a workaround for DDP wrapper by overwriting the _wrap_model function, we can actually merge this first. The reason is that 1) we want it in for next transformer release ahead of 1.13, and 2) I will need this change to post another PR for the default compiler flag for transformer model type. Let me know if this is acceptable. |
|
Thanks for your patience on this. |
What does this PR do?
This PR adds support for torchrun for AWS Neuron SDK.
Existing HF tutorial for Neuron SDK requires users to modify the HF example script (ie run_glue.py). This change will help minimize the changes required.
This change will require future AWS Neuron PyTorch 1.13 support.
This is an update to #19907 .
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sgugger