Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move init_ddp_connection to DDP Plugin #4407

Merged
merged 10 commits into from
Nov 18, 2020

Conversation

ananthsub
Copy link
Contributor

What does this PR do?

This moves init_ddp_connection to the DDP Plugin, making it easily overridable in the same place as the configure_ddp for the model. This means custom implementations can be shared easily cross accelerator backends

Some questions:

  • Should this live in the specific DDP accelerators? i made this change to avoid copy/paste across accelerators by stick everything in the base class and to get feedback. DDP doesn't apply in the base class so it sticks out right now
  • On the other hand, that copy/paste already exists for configure_ddp

cc @williamFalcon @awaelchli

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@justusschock
Copy link
Member

Do we need the init_ddp_connection in the accelerator at all if it just forwards to the plugin? Or is there anything different across different types of DDP?

@ananthsub
Copy link
Contributor Author

@justusschock init_ddp_connection used to be a hook on the lightning module, then it was moved to the accelerator. there are use cases which want to customize this (e.g. create custom process groups), but currently they'd need to write a whole accelerator backend just to do so

Comment on lines +67 to +73
def init_ddp_connection(
self,
trainer,
cluster_environment,
global_rank: int,
world_size: int,
is_slurm_managing_tasks: bool = True,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know if this is the right interface. is this leaking too much? should the plugin have a reference to the trainer?

Copy link
Member

@justusschock justusschock Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we already have too many things with trainer references. Besides that, I think the interface is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned earlier, we’re cleaning up trainer references for the next stage of refactoring.

but i think that’s for a different PR.

@stale
Copy link

stale bot commented Nov 11, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Nov 11, 2020
@stale
Copy link

stale bot commented Nov 16, 2020

This pull request is going to be closed. Please feel free to reopen it create a new from the actual master.

@stale stale bot closed this Nov 16, 2020
@SeanNaren SeanNaren reopened this Nov 18, 2020
@stale stale bot removed the won't fix This will not be worked on label Nov 18, 2020
@SeanNaren
Copy link
Contributor

Opening this because I'm running into a case where I need to define an additional torch RPC connection. This would allow the additional connection to exist within the plugin.

@SeanNaren SeanNaren added this to the 1.1 milestone Nov 18, 2020
@SeanNaren SeanNaren added the distributed Generic distributed-related topic label Nov 18, 2020
@codecov
Copy link

codecov bot commented Nov 18, 2020

Codecov Report

Merging #4407 (f77ffdb) into master (e7134a9) will decrease coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #4407   +/-   ##
======================================
- Coverage      93%     93%   -0%     
======================================
  Files         117     117           
  Lines        8949    8954    +5     
======================================
+ Hits         8319    8321    +2     
- Misses        630     633    +3     

@awaelchli awaelchli added refactor design Includes a design discussion labels Nov 18, 2020
@williamFalcon williamFalcon merged commit 45c5760 into Lightning-AI:master Nov 18, 2020
rohitgr7 pushed a commit that referenced this pull request Nov 21, 2020
* Move init_ddp_connection to DDP Plugin

* cluster-env

* trainer?

* imports

* Update ddp_plugin.py

Co-authored-by: Sean Naren <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Includes a design discussion distributed Generic distributed-related topic refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants