-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Destroy process group in DDP destructor #8080
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8080 +/- ##
=======================================
- Coverage 93% 88% -5%
=======================================
Files 212 212
Lines 13720 13726 +6
=======================================
- Hits 12751 12069 -682
- Misses 969 1657 +688 |
1acdb23
to
e5602c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! Great catch !
@awaelchli caught the fact that we probably want to keep the communication alive till the end of all trainer stages. Not entirely sure what would be the preferred approach if this is the case! |
@ananthsub is destroying the process group on trainer destruction breaking any use cases you know of? for example, this will create and destroy the group N times: for n in range(1, N + 1):
trainer = Trainer(max_epochs=n, ...):
trainer.fit(...)
... in current Lighting, it would only create it once |
So this fix is necessary because if the trainer config is different each iteration (maybe a different plugin is used), side effects from the previous process group can impact the new trainer. |
@carmocca I think this is not needed anymore, right? |
@awaelchli We still don't destroy the process group after the However, it's unclear whether this is desired and even if we want to, it's a breaking change. So I'm fine with closing this. Can always be revisited later. |
What does this PR do?
Discovered by running a DeepSpeed test without
special=True
which would produce the error:RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
This is because a previous test initialized the process group with the
gloo
backend and we never destroyed it because it raised aSystemExit
on setup. And since special tests are run separately this wasn't caught before.https://github.com/PyTorchLightning/pytorch-lightning/blob/f79f0f9de112bc0a810587147b2ff5d1f3bfb256/tests/accelerators/test_ddp.py#L98-L122
So the fix is:
teardown(edit: destructor, we want to reuse the process group acrossfit/validate/test
calls)Do not raise a SystemExit in setup as teardown won't get called in that case. We should eventually also clean up after ourselves in cases like that.The changes in #8070 need this to work so that will act as a test
Closes #8115
Before submitting
PR review