Destroy process group in DDP destructor #8080

carmocca · 2021-06-22T17:08:35Z

What does this PR do?

Discovered by running a DeepSpeed test without special=True which would produce the error:

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

This is because a previous test initialized the process group with the gloo backend and we never destroyed it because it raised a SystemExit on setup. And since special tests are run separately this wasn't caught before.

https://github.com/PyTorchLightning/pytorch-lightning/blob/f79f0f9de112bc0a810587147b2ff5d1f3bfb256/tests/accelerators/test_ddp.py#L98-L122

So the fix is:

Properly release the group in ~~teardown~~ (edit: destructor, we want to reuse the process group across fit/validate/test calls)
~~Do not raise a SystemExit in setup as teardown won't get called in that case. We should eventually also clean up after ourselves in cases like that.~~

The changes in #8070 need this to work so that will act as a test

Closes #8115

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

codecov · 2021-06-22T17:10:21Z

Codecov Report

Merging #8080 (fc6338e) into master (0e19d16) will decrease coverage by 5%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #8080    +/-   ##
=======================================
- Coverage      93%     88%    -5%     
=======================================
  Files         212     212            
  Lines       13720   13726     +6     
=======================================
- Hits        12751   12069   -682     
- Misses        969    1657   +688

tchaton

LGTM ! Great catch !

SeanNaren · 2021-06-23T09:12:43Z

@awaelchli caught the fact that we probably want to keep the communication alive till the end of all trainer stages. Not entirely sure what would be the preferred approach if this is the case!

CHANGELOG.md

pytorch_lightning/plugins/training_type/ddp.py

awaelchli · 2021-06-23T12:26:42Z

@ananthsub is destroying the process group on trainer destruction breaking any use cases you know of?

for example, this will create and destroy the group N times:

for n in range(1, N + 1):
   trainer = Trainer(max_epochs=n, ...):
   trainer.fit(...)
   ...

in current Lighting, it would only create it once

carmocca · 2021-06-23T12:34:17Z

this will create and destroy the group N times

So this fix is necessary because if the trainer config is different each iteration (maybe a different plugin is used), side effects from the previous process group can impact the new trainer.

awaelchli · 2022-07-28T17:13:19Z

@carmocca I think this is not needed anymore, right?

carmocca · 2022-07-28T17:22:27Z

@awaelchli We still don't destroy the process group after the Trainer ends. We do it in our CI through conftest.

However, it's unclear whether this is desired and even if we want to, it's a breaking change.

So I'm fine with closing this. Can always be revisited later.

Add ddp training type teardown

c700cab

carmocca added bug Something isn't working distributed Generic distributed-related topic labels Jun 22, 2021

carmocca added this to the v1.3.x milestone Jun 22, 2021

carmocca self-assigned this Jun 22, 2021

carmocca requested review from awaelchli, Borda, justusschock, kaushikb11, SeanNaren, tchaton and williamFalcon as code owners June 22, 2021 17:08

Update CHANGELOG

e5602c9

SeanNaren approved these changes Jun 22, 2021

View reviewed changes

Borda approved these changes Jun 22, 2021

View reviewed changes

carmocca marked this pull request as draft June 22, 2021 21:33

carmocca force-pushed the bug/teardown-ddp-process-group branch from 1acdb23 to e5602c9 Compare June 22, 2021 22:37

tchaton approved these changes Jun 23, 2021

View reviewed changes

SeanNaren self-requested a review June 23, 2021 09:11

Use destructor

0b94b6c

carmocca commented Jun 23, 2021

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Update CHANGELOG.md

aaf32ab

carmocca commented Jun 23, 2021

View reviewed changes

pytorch_lightning/plugins/training_type/ddp.py Outdated Show resolved Hide resolved

justusschock approved these changes Jun 23, 2021

View reviewed changes

carmocca changed the title ~~Destroy process group in DDP teardown~~ Destroy process group in DDP destructor Jun 23, 2021

RPC destructor

0444d54

carmocca added 4 commits July 2, 2021 16:05

Merge branch 'master' into bug/teardown-ddp-process-group

f457aad

Merge master

3c2ac10

Merge master

9ee6ee2

Unnecessary annotation

fc6338e

Borda modified the milestones: v1.3.x, v1.4 Jul 6, 2021

edenlightning modified the milestones: v1.4, v1.5, v1.4.x Jul 6, 2021

This was referenced Aug 10, 2021

Predicton step that can be run in between training epochs #8808

Closed

Support fit with DDP then test without DDP #8375

Closed

carmocca mentioned this pull request Sep 24, 2021

Remove call_configure_sharded_model lifecycle property #9612

Merged

12 tasks

tchaton modified the milestones: v1.4.x, v1.5, v1.6.x Nov 1, 2021

awaelchli modified the milestones: v1.6.x, 1.5.x Nov 3, 2021

Borda modified the milestones: 1.5.x, 1.6 Mar 21, 2022

carmocca modified the milestones: 1.6, future Mar 28, 2022

carmocca closed this Jul 28, 2022

carmocca deleted the bug/teardown-ddp-process-group branch July 28, 2022 17:22

carmocca removed this from the future milestone Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Destroy process group in DDP destructor #8080

Destroy process group in DDP destructor #8080

carmocca commented Jun 22, 2021 •

edited

Loading

codecov bot commented Jun 22, 2021 •

edited

Loading

tchaton left a comment

SeanNaren commented Jun 23, 2021

awaelchli commented Jun 23, 2021 •

edited

Loading

carmocca commented Jun 23, 2021

awaelchli commented Jul 28, 2022

carmocca commented Jul 28, 2022

Destroy process group in DDP destructor #8080

Destroy process group in DDP destructor #8080

Conversation

carmocca commented Jun 22, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

codecov bot commented Jun 22, 2021 • edited Loading

Codecov Report

tchaton left a comment

Choose a reason for hiding this comment

SeanNaren commented Jun 23, 2021

awaelchli commented Jun 23, 2021 • edited Loading

carmocca commented Jun 23, 2021

awaelchli commented Jul 28, 2022

carmocca commented Jul 28, 2022

carmocca commented Jun 22, 2021 •

edited

Loading

codecov bot commented Jun 22, 2021 •

edited

Loading

awaelchli commented Jun 23, 2021 •

edited

Loading