-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for DDP fork #13405
Merged
Merged
Add support for DDP fork #13405
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
awaelchli
force-pushed
the
feature/ddp-fork2
branch
from
June 25, 2022 03:09
b059f73
to
9cea979
Compare
awaelchli
commented
Jun 25, 2022
justusschock
approved these changes
Jun 27, 2022
mergify
bot
added
ready
PRs ready to be merged
and removed
has conflicts
ready
PRs ready to be merged
labels
Jul 20, 2022
This reverts commit 3d7095d.
for more information, see https://pre-commit.ci
Co-authored-by: Akihiro Nitta <[email protected]>
Borda
approved these changes
Jul 22, 2022
mergify
bot
added
ready
PRs ready to be merged
and removed
has conflicts
ready
PRs ready to be merged
labels
Jul 22, 2022
This was referenced Jul 25, 2022
This was referenced Aug 7, 2022
This was referenced Aug 19, 2022
11 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
feature
Is an improvement or enhancement
pl
Generic label for PyTorch Lightning package
priority: 0
High priority task
ready
PRs ready to be merged
strategy: ddp
DistributedDataParallel
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes #7550
Fixes #8230
Adds support for DDP Fork. This version of ddp can be used in Jupyter notebooks with GPU unlike ddp_spawn!!!
Usage:
See the added docs for a comparison of ddp_spawn vs. ddp_fork
Remarks:
Q: Why did we need to replace almost all instances of
torch.cuda.device_count()
andtorch.cuda.is_available()
?A: These function calls unfortunately create a CUDA context, i.e., they init the CUDA memory and tie it to the current process. Once this happens, we won't be able to re-initialize the CUDA anymore in the forked processes. This is a limitation of torch + forking.
Q: Now that we support two different start methods in DDPSpawnStrategy, shouldn't we rename the strategy?
A: Yes technically we should. Especially since TPUSpawnStrategy is also using the fork start method and does not even support spawn. However, renaming everything here is too premature, considering that in the longterm the strategies DDPSpawn and DDP will get merged eventually anyway. What we can do in a follow up is renaming the internal launcher classes
_SpawnLauncher
and_XLASpawnLauncher
, and associated terminology in docs, comments, etc.Open questions
strategy=None
anddevices>1
in a Jupyter notebook?Follow-up work
Does your PR introduce any breaking changes? If yes, please list them.
No known ones.
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃
cc @Borda @tchaton @rohitgr7 @justusschock @kaushikb11 @awaelchli @akihironitta