-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hang in DDP HPC accelerators #5157
Fix hang in DDP HPC accelerators #5157
Conversation
init_device was never called
Codecov Report
@@ Coverage Diff @@
## master #5157 +/- ##
======================================
Coverage 93% 93%
======================================
Files 134 134
Lines 9905 9905
======================================
Hits 9204 9204
Misses 701 701 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, mind add chlog?
I think the changes are fine apologies for the mistake. This is correct, devices just now have to be initialised separately from the model due to changes for the RPC plugin, and before initializing any model/connections. |
* Fix hang in DDP HPC accelerators init_device was never called * Update CHANGELOG.md
* Fix hang in DDP HPC accelerators init_device was never called * Update CHANGELOG.md
* Fix hang in DDP HPC accelerators init_device was never called * Update CHANGELOG.md
* Fix hang in DDP HPC accelerators init_device was never called * Update CHANGELOG.md
* Fix hang in DDP HPC accelerators init_device was never called * Update CHANGELOG.md
* Fix hang in DDP HPC accelerators init_device was never called * Update CHANGELOG.md
What does this PR do?
init_device
was not called, so theroot_gpu
is 0 throughout. This leads to a hang later on inddp_train
whenconfigure_ddp
was called, as theLightningDistributedDataParallel
call never completes with the redundant device ids set.This is a recent change from #5016 and there aren't tests in OSS for this particular accelerator
2 main questions:
init_device
immediately afterset_world_ranks
- is that correct?init_device
as a no-op, but I'm not sure if that's the most robust.Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃