-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix] Add barriers before and after setup hook is run #7202
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7202 +/- ##
=======================================
- Coverage 91% 87% -4%
=======================================
Files 198 198
Lines 12647 12725 +78
=======================================
- Hits 11531 11063 -468
- Misses 1116 1662 +546 |
Hi, thanks for the patch! It looks sensible, but it doesn't seem to fix the issue on my end. The issue seem to be that at this point, DDP is not yet set up so the barrier is a no-op. I think the right place to put it would be between environment setup and module setup |
@awaelchli @tchaton this feels like a bug that prepare data can be called without other ranks waiting for it to finish, especially if setup depends on the results of prepare data being available. how can we conditionally move the process group initialization sooner so that we can use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you still need a barrier before running setup()
?
Only the barrier before the setup hook can solve the reported issue right? The barrier after setup may be optional. |
Yes you're right. Ideally we have a barrier immediately after prepare_data, not just before setup
the barrier here could be helpful in DDP if users define layers in |
IMO inside each connector. This way the trainer code is a tiny bit easier to read :) |
6ee1c8d
to
092e767
Compare
@Evpok mind taking another look to see if this would mitigate/solve the issue for you? |
@ananthsub Yes it does, thank you for that! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
Careful! You merged with TPU tests failing. We probably need to add an additional check in the barrier for TPU plugin. |
What does this PR do?
Reported on slack:
Currently we delay the process group initialization until
accelerator.setup_environment
.prepare_data
is called before, which works fine for when lightning launches subprocess/spawns the trainer processes for distributed training, but not so for when the user has launched training outside of the trainer.This adds a barrier immediately before and after the setup hook is called.
We indirectly address this by adding barriers before setup runs (to hedge against the
prepare_data
case) as well as aftersetup
hooks run in case the model defines layers inside ofsetup
in order to synchronize before the module is wrapped in DDP.Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃