Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destroy process group in DDP destructor #8080

Closed
wants to merge 37 commits into from

Conversation

carmocca
Copy link
Contributor

@carmocca carmocca commented Jun 22, 2021

What does this PR do?

Discovered by running a DeepSpeed test without special=True which would produce the error:

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

This is because a previous test initialized the process group with the gloo backend and we never destroyed it because it raised a SystemExit on setup. And since special tests are run separately this wasn't caught before.

https://github.com/PyTorchLightning/pytorch-lightning/blob/f79f0f9de112bc0a810587147b2ff5d1f3bfb256/tests/accelerators/test_ddp.py#L98-L122

So the fix is:

  • Properly release the group in teardown (edit: destructor, we want to reuse the process group across fit/validate/test calls)
  • Do not raise a SystemExit in setup as teardown won't get called in that case. We should eventually also clean up after ourselves in cases like that.

The changes in #8070 need this to work so that will act as a test

Closes #8115

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

@carmocca carmocca added bug Something isn't working distributed Generic distributed-related topic labels Jun 22, 2021
@carmocca carmocca added this to the v1.3.x milestone Jun 22, 2021
@carmocca carmocca self-assigned this Jun 22, 2021
@codecov
Copy link

codecov bot commented Jun 22, 2021

Codecov Report

Merging #8080 (fc6338e) into master (0e19d16) will decrease coverage by 5%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #8080    +/-   ##
=======================================
- Coverage      93%     88%    -5%     
=======================================
  Files         212     212            
  Lines       13720   13726     +6     
=======================================
- Hits        12751   12069   -682     
- Misses        969    1657   +688     

@carmocca carmocca marked this pull request as draft June 22, 2021 21:33
@carmocca carmocca force-pushed the bug/teardown-ddp-process-group branch from 1acdb23 to e5602c9 Compare June 22, 2021 22:37
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Great catch !

@SeanNaren SeanNaren self-requested a review June 23, 2021 09:11
@SeanNaren
Copy link
Contributor

@awaelchli caught the fact that we probably want to keep the communication alive till the end of all trainer stages. Not entirely sure what would be the preferred approach if this is the case!

CHANGELOG.md Outdated Show resolved Hide resolved
@awaelchli
Copy link
Contributor

awaelchli commented Jun 23, 2021

@ananthsub is destroying the process group on trainer destruction breaking any use cases you know of?

for example, this will create and destroy the group N times:

for n in range(1, N + 1):
   trainer = Trainer(max_epochs=n, ...):
   trainer.fit(...)
   ...

in current Lighting, it would only create it once

@carmocca
Copy link
Contributor Author

this will create and destroy the group N times

So this fix is necessary because if the trainer config is different each iteration (maybe a different plugin is used), side effects from the previous process group can impact the new trainer.

@carmocca carmocca changed the title Destroy process group in DDP teardown Destroy process group in DDP destructor Jun 23, 2021
@Borda Borda modified the milestones: v1.3.x, v1.4 Jul 6, 2021
@edenlightning edenlightning modified the milestones: v1.4, v1.5, v1.4.x Jul 6, 2021
@tchaton tchaton modified the milestones: v1.4.x, v1.5, v1.6.x Nov 1, 2021
@awaelchli awaelchli modified the milestones: v1.6.x, 1.5.x Nov 3, 2021
@Borda Borda modified the milestones: 1.5.x, 1.6 Mar 21, 2022
@carmocca carmocca modified the milestones: 1.6, future Mar 28, 2022
@awaelchli
Copy link
Contributor

@carmocca I think this is not needed anymore, right?

@carmocca
Copy link
Contributor Author

@awaelchli We still don't destroy the process group after the Trainer ends. We do it in our CI through conftest.

However, it's unclear whether this is desired and even if we want to, it's a breaking change.

So I'm fine with closing this. Can always be revisited later.

@carmocca carmocca closed this Jul 28, 2022
@carmocca carmocca deleted the bug/teardown-ddp-process-group branch July 28, 2022 17:22
@carmocca carmocca removed this from the future milestone Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants