-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Enables torchrun for XLA-based accelerators #19907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enables torchrun for XLA-based accelerators #19907
Conversation
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR! There are a few problems with it though, as it breaks current support for PyTorch XLA.
src/transformers/trainer.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No the model needs to be wrapped in a DistributedDataParallel for TPU,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Limited the case to torchrun in the latest commit and clarified that currently DDP doesn't work with torch.distributed XLA backend yet (pytorch/pytorch#79164)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case we should wait until it's supported. This will break the current TPU support.
src/transformers/training_args.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved to the _setup_devices method of TrainingArguments. It shouldn't be done on import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Unfortunately I can't move it into _setup_devices for some reason as it run into device not found error. Will investigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have any runner testing on XLA for now, so this test will never be run FYI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Will get help with running these tests.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
f882cb6 to
f339bd9
Compare
What does this PR do?
This PR enables torchrun for XLA-based accelerators (TPU/NeuronCore) by using torch.distributed XLA backend. It is dependent on the torch/xla change pytorch/xla#3609.
Example application is the AWS Neuron tutorial with HF Trainer that uses torchrun:
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/finetune_hftrainer.html
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sgugger