-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace PostLocalSGDOptimizer with a dedicated model averaging component #12378
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is compatible with ddp spawn. I don't immediately see a reason why not.
Since launchers have been extracted, we want to align the two strategies as much as possible and eventually merge them into one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a test and changelog entry as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @daniellepintz for the PR. Looks good to me, left a couple of comments for your review. :)
For some reason the tests for PT <1.10 are complaining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daniellepintz please also update the docs for DDP optimizations with this support
I updated the docs but something seems wrong with the way it is displaying in the CI. The "note" and code sections get combined: https://153988-178626720-gh.circle-artifacts.com/0/html/advanced/model_parallel.html#ddp-optimizations compared to the current docs: https://pytorch-lightning.readthedocs.io/en/latest/advanced/model_parallel.html?highlight=ddp#ddp-optimizations however it seems this is an issue in all CI jobs, for example this is the same page from #12245: https://153982-178626720-gh.circle-artifacts.com/0/html/advanced/model_parallel.html#ddp-optimizations |
I believe this is ready to be merged - can someone please enable auto merge? |
Confirmed the docs on master look good: https://pytorch-lightning.readthedocs.io/en/latest/advanced/model_parallel.html?highlight=ddp#ddp-optimizations So it was just a weird thing with the CI |
What does this PR do?
In this PR I am finishing #9446 which was created by @wayi1 before the Strategy refactor. The description below is copied from that PR.
This is an improvement of #8967.
Replace
PostLocalSGDOptimizer
by a dedicated model averaging component that can run afteroptimizer_step
.The previous implementation can cause data races and fail the training for some cases. For example, if the data loading phase also involves allreduce (e.g., for checking if the input data stream has reached the end), and such allreduce can be kicked off during
optimizer.step()
. Therefore, the implementation in this PR can fix such issue.Proposal: pytorch/pytorch#59699
There are two options to enable post-local SGD by using vanilla PyTorch:
Post-LocalSGD Comm Hook + Periodic Model Averager
https://github.com/pytorch/pytorch/blob/master/torch/distributed/algorithms/model_averaging/averagers.py#L48-L78
This option requires 1-line addition (such as
averager.average_parameters(model.parameters())
) in the training loop.Post-LocalSGD Comm Hook + Post-LocalSGD Optimizer
https://github.com/pytorch/pytorch/blob/master/torch/distributed/optim/post_localSGD_optimizer.py#L15-L48
This option does not need any code change in the training loop. I tried this option earlier. However, one limitation is that if there is another phase such as data loading that also runs some communication like allreduce and overlaps with
optimizer_step()
, then this can cause potential data race and fail the training. Since such concurrent phase like data loading is out of the control of model averaging, I didn't choose this implementation.Does your PR introduce any breaking changes? If yes, please list them.
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃