-
Notifications
You must be signed in to change notification settings - Fork 147
dist.ddp: make rendezvous work out of the box on all schedulers #400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #400 +/- ##
==========================================
- Coverage 94.57% 94.50% -0.08%
==========================================
Files 66 66
Lines 3593 3638 +45
==========================================
+ Hits 3398 3438 +40
- Misses 195 200 +5
Continue to review full report at Codecov.
|
|
@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@d4l3k has updated the pull request. You must reimport the pull request before landing. |
|
@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@d4l3k has updated the pull request. You must reimport the pull request before landing. |
|
@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@d4l3k has updated the pull request. You must reimport the pull request before landing. |
|
@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@d4l3k has updated the pull request. You must reimport the pull request before landing. |
|
@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary:
Every OSS scheduler we support provides a special environment variable for the rank0 host address. This creates a new macro that provides that for all the schedulers and updates `dist.ddp` to use it.
This macro is a bit different since it's a `rank0_env` macro which provides the name of the ENV variable to read from. To use that, the component either needs to read that or do something like `f"/bin/bash -c main.py $${macros.rank0_env}`. We wrap dist.ddp with `/bin/bash` to achieve this. Due to the macro template language we need to use `$$` instead of just `$`.
Other misc changes:
* fixed ray tests to skip if `ray` isn't installed
* deleted TORCHX_IMAGE_EXAMPLES since it doesn't exist anymore
* updated dist.ddp to have `-m`, `--max_restarts` and set LOGLEVEL by default
Pull Request resolved: #400
Test Plan:
Updated slurm, docker and aws batch integ tests to use `dist.ddp` instead of bespoke usage.
Tested k8s and local_cwd manually.
I haven't tested ray
Updated component test framework to use two workers
Reviewed By: kiukchung
Differential Revision: D34437397
Pulled By: d4l3k
fbshipit-source-id: f2f78e7d293c9b50393e75cf4e9ca0eb5388d0b4
|
This pull request was exported from Phabricator. Differential Revision: D34437397 |
Every OSS scheduler we support provides a special environment variable for the rank0 host address. This creates a new macro that provides that for all the schedulers and updates
dist.ddpto use it.This macro is a bit different since it's a
rank0_envmacro which provides the name of the ENV variable to read from. To use that, the component either needs to read that or do something likef"/bin/bash -c main.py $${macros.rank0_env}. We wrap dist.ddp with/bin/bashto achieve this. Due to the macro template language we need to use$$instead of just$.Other misc changes:
rayisn't installed-m,--max_restartsand set LOGLEVEL by defaultTest plan:
Updated slurm, docker and aws batch integ tests to use
dist.ddpinstead of bespoke usage.Tested k8s and local_cwd manually.
I haven't tested ray
Updated component test framework to use two workers