Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make rdzv_backend configurable #739

Merged
merged 1 commit into from
Jul 26, 2023

Conversation

MichaelClifford
Copy link
Contributor

PR Summary:
This PR allows the user to configure the rdzv_backend they want to use with ddp and sets the default to "c10d". This also removes the previously hard-coded c10d rdzv_backend.

Test plan:
This has been tested locally against dist_test.py with all 10 tests passing.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 21, 2023
Copy link
Collaborator

@kiukchung kiukchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, LGTM.

Two small notes:

  1. Certain rdzv_backends require additional rendezvous configs to be passed as rdzv_conf KEY=VAL,KEY=VAL,..., so you might end up having to expose rdzv_conf as well.
  2. The spmd component (in the same file) calls ddp and has a much cleaner argument list. So depending on what you are after, you might want to switch over to spmd and plumb rdzv argument through to that component as well.

Both are non-blocking comments. Approving.

@MichaelClifford
Copy link
Contributor Author

@kiukchung thanks for the review. And happy to add those noted changes in an additional PR.

After the approval, is there anything else I need to do before the PR can be mereged?

@d4l3k
Copy link
Member

d4l3k commented Jul 26, 2023

Code looks good -- there's some CI failures in trunk which are unrelated to this (pyyaml regression + disk space issue on the underlying runners)

@d4l3k d4l3k merged commit c9340a7 into pytorch:main Jul 26, 2023
12 of 20 checks passed
KPostOffice pushed a commit to KPostOffice/torchx that referenced this pull request Sep 7, 2023
MichaelClifford added a commit to project-codeflare/torchx that referenced this pull request Sep 8, 2023
KPostOffice pushed a commit to KPostOffice/torchx that referenced this pull request Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants