Replace PostLocalSGDOptimizer with a dedicated model averaging component #12378

daniellepintz · 2022-03-19T23:58:49Z

What does this PR do?

In this PR I am finishing #9446 which was created by @wayi1 before the Strategy refactor. The description below is copied from that PR.

This is an improvement of #8967.

Replace PostLocalSGDOptimizer by a dedicated model averaging component that can run after optimizer_step.

The previous implementation can cause data races and fail the training for some cases. For example, if the data loading phase also involves allreduce (e.g., for checking if the input data stream has reached the end), and such allreduce can be kicked off during optimizer.step(). Therefore, the implementation in this PR can fix such issue.

Proposal: pytorch/pytorch#59699

There are two options to enable post-local SGD by using vanilla PyTorch:

Post-LocalSGD Comm Hook + Periodic Model Averager
https://github.com/pytorch/pytorch/blob/master/torch/distributed/algorithms/model_averaging/averagers.py#L48-L78

This option requires 1-line addition (such as averager.average_parameters(model.parameters())) in the training loop.
Post-LocalSGD Comm Hook + Post-LocalSGD Optimizer
https://github.com/pytorch/pytorch/blob/master/torch/distributed/optim/post_localSGD_optimizer.py#L15-L48

This option does not need any code change in the training loop. I tried this option earlier. However, one limitation is that if there is another phase such as data loading that also runs some communication like allreduce and overlaps with optimizer_step(), then this can cause potential data race and fail the training. Since such concurrent phase like data loading is out of the control of model averaging, I didn't choose this implementation.

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

awaelchli

I wonder if this is compatible with ddp spawn. I don't immediately see a reason why not.
Since launchers have been extracted, we want to align the two strategies as much as possible and eventually merge them into one.

ananthsub

please add a test and changelog entry as well

pytorch_lightning/strategies/ddp.py

krshrimali

Thanks, @daniellepintz for the PR. Looks good to me, left a couple of comments for your review. :)

pytorch_lightning/strategies/ddp.py

daniellepintz · 2022-03-22T03:57:40Z

For some reason the tests for PT <1.10 are complaining name 'post_localSGD' is not defined even though the test is marked as min_torch="1.10.0". Seems to be an issue with pytest.

tchaton

LGTM !

pytorch_lightning/strategies/ddp.py

…lightning into model_averager

ananthsub

@daniellepintz please also update the docs for DDP optimizations with this support

pytorch_lightning/strategies/ddp.py

daniellepintz · 2022-03-24T04:09:47Z

I updated the docs but something seems wrong with the way it is displaying in the CI. The "note" and code sections get combined: https://153988-178626720-gh.circle-artifacts.com/0/html/advanced/model_parallel.html#ddp-optimizations compared to the current docs: https://pytorch-lightning.readthedocs.io/en/latest/advanced/model_parallel.html?highlight=ddp#ddp-optimizations

however it seems this is an issue in all CI jobs, for example this is the same page from #12245: https://153982-178626720-gh.circle-artifacts.com/0/html/advanced/model_parallel.html#ddp-optimizations

daniellepintz · 2022-03-24T13:37:07Z

I believe this is ready to be merged - can someone please enable auto merge?

daniellepintz · 2022-03-25T01:03:04Z

Confirmed the docs on master look good: https://pytorch-lightning.readthedocs.io/en/latest/advanced/model_parallel.html?highlight=ddp#ddp-optimizations

So it was just a weird thing with the CI

Replace PostLocalSGDOptimizer with a dedicated model averaging component

2c8c409

daniellepintz added this to the 1.7 milestone Mar 19, 2022

daniellepintz marked this pull request as ready for review March 20, 2022 00:16

daniellepintz requested review from tchaton, SeanNaren, awaelchli, justusschock and kaushikb11 as code owners March 20, 2022 00:16

fix test

7de20fe

wayi1 mentioned this pull request Mar 20, 2022

Replace PostLocalSGDOptimizer with a dedicated model averaging component #9446

Closed

12 tasks

awaelchli added optimization strategy: ddp DistributedDataParallel optimizer labels Mar 20, 2022

awaelchli approved these changes Mar 20, 2022

View reviewed changes

ananthsub reviewed Mar 21, 2022

View reviewed changes

pytorch_lightning/strategies/ddp.py Outdated Show resolved Hide resolved

pytorch_lightning/strategies/ddp.py Outdated Show resolved Hide resolved

pytorch_lightning/strategies/ddp.py Outdated Show resolved Hide resolved

pytorch_lightning/strategies/ddp.py Outdated Show resolved Hide resolved

krshrimali approved these changes Mar 21, 2022

View reviewed changes

pytorch_lightning/strategies/ddp.py Outdated Show resolved Hide resolved

pytorch_lightning/strategies/ddp.py Outdated Show resolved Hide resolved

mergify bot added the ready PRs ready to be merged label Mar 21, 2022

Borda reviewed Mar 21, 2022

View reviewed changes

pytorch_lightning/strategies/ddp.py Outdated Show resolved Hide resolved

daniellepintz added 2 commits March 21, 2022 14:13

addr some comments

12ec915

add tests and changelog

fc5d8ac

daniellepintz requested review from carmocca, williamFalcon and rohitgr7 as code owners March 22, 2022 03:47

ananthsub approved these changes Mar 22, 2022

View reviewed changes

daniellepintz added 2 commits March 21, 2022 20:59

remove comment

6b8f7fe

fix comment

de82a65

tchaton approved these changes Mar 22, 2022

View reviewed changes

pytorch_lightning/strategies/ddp.py Outdated Show resolved Hide resolved

pytorch_lightning/strategies/ddp.py Outdated Show resolved Hide resolved

fix caps

231a06d

mergify bot added the has conflicts label Mar 23, 2022

mergify bot removed the ready PRs ready to be merged label Mar 23, 2022

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

9367904

…lightning into model_averager

ananthsub approved these changes Mar 23, 2022

View reviewed changes

fix tests

afdf93b

mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Mar 23, 2022

rohan-varma reviewed Mar 23, 2022

View reviewed changes

pytorch_lightning/strategies/ddp.py Show resolved Hide resolved

rohan-varma reviewed Mar 23, 2022

View reviewed changes

pytorch_lightning/strategies/ddp.py Outdated Show resolved Hide resolved

addr comments

6c62edb

daniellepintz requested a review from edenlightning as a code owner March 24, 2022 03:11

daniellepintz added 2 commits March 23, 2022 20:41

attempt fix docs

b108d2f

fix doc

d51878b

wayi1 approved these changes Mar 24, 2022

View reviewed changes

ananthsub merged commit 6329be6 into Lightning-AI:master Mar 25, 2022

daniellepintz deleted the model_averager branch March 25, 2022 01:00

wayi1 mentioned this pull request Mar 25, 2022

Avoid calling average_parameters multiple times per optimizer step #12452

Merged

12 tasks

carmocca modified the milestones: 1.7, 1.6 Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace PostLocalSGDOptimizer with a dedicated model averaging component #12378

Replace PostLocalSGDOptimizer with a dedicated model averaging component #12378

daniellepintz commented Mar 19, 2022 •

edited

Loading

awaelchli left a comment

ananthsub left a comment

krshrimali left a comment

daniellepintz commented Mar 22, 2022

tchaton left a comment

ananthsub left a comment

daniellepintz commented Mar 24, 2022

daniellepintz commented Mar 24, 2022

daniellepintz commented Mar 25, 2022

Replace PostLocalSGDOptimizer with a dedicated model averaging component #12378

Replace PostLocalSGDOptimizer with a dedicated model averaging component #12378

Conversation

daniellepintz commented Mar 19, 2022 • edited Loading

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

awaelchli left a comment

Choose a reason for hiding this comment

ananthsub left a comment

Choose a reason for hiding this comment

krshrimali left a comment

Choose a reason for hiding this comment

daniellepintz commented Mar 22, 2022

tchaton left a comment

Choose a reason for hiding this comment

ananthsub left a comment

Choose a reason for hiding this comment

daniellepintz commented Mar 24, 2022

daniellepintz commented Mar 24, 2022

daniellepintz commented Mar 25, 2022

daniellepintz commented Mar 19, 2022 •

edited

Loading