Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace PostLocalSGDOptimizer with a dedicated model averaging component #12378

Merged
merged 12 commits into from
Mar 25, 2022

Conversation

daniellepintz
Copy link
Contributor

@daniellepintz daniellepintz commented Mar 19, 2022

What does this PR do?

In this PR I am finishing #9446 which was created by @wayi1 before the Strategy refactor. The description below is copied from that PR.

This is an improvement of #8967.

Replace PostLocalSGDOptimizer by a dedicated model averaging component that can run after optimizer_step.

The previous implementation can cause data races and fail the training for some cases. For example, if the data loading phase also involves allreduce (e.g., for checking if the input data stream has reached the end), and such allreduce can be kicked off during optimizer.step(). Therefore, the implementation in this PR can fix such issue.

Proposal: pytorch/pytorch#59699

There are two options to enable post-local SGD by using vanilla PyTorch:

  1. Post-LocalSGD Comm Hook + Periodic Model Averager
    https://github.com/pytorch/pytorch/blob/master/torch/distributed/algorithms/model_averaging/averagers.py#L48-L78

    This option requires 1-line addition (such as averager.average_parameters(model.parameters())) in the training loop.

  2. Post-LocalSGD Comm Hook + Post-LocalSGD Optimizer
    https://github.com/pytorch/pytorch/blob/master/torch/distributed/optim/post_localSGD_optimizer.py#L15-L48

    This option does not need any code change in the training loop. I tried this option earlier. However, one limitation is that if there is another phase such as data loading that also runs some communication like allreduce and overlaps with optimizer_step(), then this can cause potential data race and fail the training. Since such concurrent phase like data loading is out of the control of model averaging, I didn't choose this implementation.

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@daniellepintz daniellepintz added this to the 1.7 milestone Mar 19, 2022
@daniellepintz daniellepintz marked this pull request as ready for review March 20, 2022 00:16
Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is compatible with ddp spawn. I don't immediately see a reason why not.
Since launchers have been extracted, we want to align the two strategies as much as possible and eventually merge them into one.

Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a test and changelog entry as well

pytorch_lightning/strategies/ddp.py Outdated Show resolved Hide resolved
pytorch_lightning/strategies/ddp.py Outdated Show resolved Hide resolved
pytorch_lightning/strategies/ddp.py Outdated Show resolved Hide resolved
pytorch_lightning/strategies/ddp.py Outdated Show resolved Hide resolved
Copy link
Contributor

@krshrimali krshrimali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @daniellepintz for the PR. Looks good to me, left a couple of comments for your review. :)

pytorch_lightning/strategies/ddp.py Outdated Show resolved Hide resolved
pytorch_lightning/strategies/ddp.py Outdated Show resolved Hide resolved
@mergify mergify bot added the ready PRs ready to be merged label Mar 21, 2022
@daniellepintz
Copy link
Contributor Author

For some reason the tests for PT <1.10 are complaining name 'post_localSGD' is not defined even though the test is marked as min_torch="1.10.0". Seems to be an issue with pytest.

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

pytorch_lightning/strategies/ddp.py Outdated Show resolved Hide resolved
pytorch_lightning/strategies/ddp.py Outdated Show resolved Hide resolved
@mergify mergify bot removed the ready PRs ready to be merged label Mar 23, 2022
Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daniellepintz please also update the docs for DDP optimizations with this support

@mergify mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Mar 23, 2022
@daniellepintz
Copy link
Contributor Author

I updated the docs but something seems wrong with the way it is displaying in the CI. The "note" and code sections get combined: https://153988-178626720-gh.circle-artifacts.com/0/html/advanced/model_parallel.html#ddp-optimizations compared to the current docs: https://pytorch-lightning.readthedocs.io/en/latest/advanced/model_parallel.html?highlight=ddp#ddp-optimizations

however it seems this is an issue in all CI jobs, for example this is the same page from #12245: https://153982-178626720-gh.circle-artifacts.com/0/html/advanced/model_parallel.html#ddp-optimizations

@daniellepintz
Copy link
Contributor Author

I believe this is ready to be merged - can someone please enable auto merge?

@ananthsub ananthsub merged commit 6329be6 into Lightning-AI:master Mar 25, 2022
@daniellepintz daniellepintz deleted the model_averager branch March 25, 2022 01:00
@daniellepintz
Copy link
Contributor Author

Confirmed the docs on master look good: https://pytorch-lightning.readthedocs.io/en/latest/advanced/model_parallel.html?highlight=ddp#ddp-optimizations

So it was just a weird thing with the CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization optimizer ready PRs ready to be merged strategy: ddp DistributedDataParallel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants