Support checkpoint save and load with Stochastic Weight Averaging #9938

adamreeve · 2021-10-15T01:47:20Z

What does this PR do?

Partly addresses #6074 by supporting saving and loading the StochasticWeightAveraging callback data in checkpoints. Support for using SWA during validation will be done as a follow up PR.

Does your PR introduce any breaking changes? If yes, please list them.

No

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG**? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

pytorch_lightning/callbacks/stochastic_weight_avg.py

rohitgr7

Perfect!
Great work!

Apologies for the delayed review! but it was complex and you pulled it out smartly 😃 🚀

awaelchli · 2022-08-15T16:56:03Z

@rohitgr7 Will this go into 1.7.x?

) Co-authored-by: thomas chaton <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Carlos Mocholi <[email protected]> Co-authored-by: Kushashwa Ravi Shrimali <[email protected]> Co-authored-by: Jirka <[email protected]> Co-authored-by: Rohit Gupta <[email protected]>

rohitgr7 · 2022-08-15T18:25:16Z

@awaelchli yes

…ghtning-AI#9938) Co-authored-by: thomas chaton <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Carlos Mocholi <[email protected]> Co-authored-by: Kushashwa Ravi Shrimali <[email protected]> Co-authored-by: Jirka <[email protected]> Co-authored-by: Rohit Gupta <[email protected]>

* update version and changelog for 1.7.2 release * Reset all results on epoch end (#14061) Co-authored-by: Carlos Mocholí <[email protected]> * Skip ddp fork tests on windows (#14121) * Fix device placement when `.cuda()` called without specifying index (#14128) * Convert subprocess test to standalone test (#14101) * Fix entry point test for Python 3.10 (#14154) * Fix flaky test caused by weak reference (#14157) * Fix saving hyperparameters in a composition where parent is not a LM or LDM (#14151) Co-authored-by: Rohit Gupta <[email protected]> * Remove DeepSpeed version restriction from Lite (#13967) * Configure the check-group app (#14165) Co-authored-by: Jirka <[email protected]> * Update onnxruntime requirement from <=1.12.0 to <1.13.0 in /requirements (#14083) Updates the requirements on [onnxruntime](https://github.com/microsoft/onnxruntime) to permit the latest version. - [Release notes](https://github.com/microsoft/onnxruntime/releases) - [Changelog](https://github.com/microsoft/onnxruntime/blob/master/docs/ReleaseManagement.md) - [Commits](microsoft/onnxruntime@v0.1.4...v1.12.1) --- updated-dependencies: - dependency-name: onnxruntime dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update gcsfs requirement from <2022.6.0,>=2021.5.0 to >=2021.5.0,<2022.8.0 in /requirements (#14079) Update gcsfs requirement in /requirements Updates the requirements on [gcsfs](https://github.com/fsspec/gcsfs) to permit the latest version. - [Release notes](https://github.com/fsspec/gcsfs/releases) - [Commits](fsspec/gcsfs@2021.05.0...2022.7.1) --- updated-dependencies: - dependency-name: gcsfs dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix a bug that caused spurious `AttributeError` when multiple `DataLoader` classes are imported (#14117) fix * CI: Replace `_` of in GHA workflow filenames with `-` (#13917) * Rename workflow files * Update docs * Fix azure badges * Update the main readme * bad rebase * Update doc * CI: Update Windows version from 2019 to 2022 (#14129) Update windows * CI/CD: Add CUDA version to docker image tags (#13831) * append cuda version to tags * revertme: push to hub * Update docker readme * Build base-conda-py3.9-torch1.12-cuda11.3.1 * Use new images in conda tests * revertme: push to hub * Revert "revertme: push to hub" This reverts commit 0f7d534. * Revert "revertme: push to hub" This reverts commit 46a05fc. * Run conda if workflow edited * Run gpu testing if workflow edited * Use new tags in release/Dockerfile * Build base-cuda and PL release images with all combinations * Update release docker * Update conda from py3.9-torch1.12 to py3.10-torch.1.12 * Fix ubuntu version * Revert conda * revertme: push to hub * Don't build Python 3.10 for now... * Fix pl release builder * updating version contribute to the error? docker/buildx#456 * Update actions' versions * Update slack user to notify * Don't use 11.6.0 to avoid bagua incompatibility * Don't use 11.1, and use 11.1.1 * Update .github/workflows/ci-pytorch_test-conda.yml Co-authored-by: Luca Medeiros <[email protected]> * Update trigger * Ignore artfacts from tutorials * Trim docker images to distribute * Add an image for tutorials * Update conda image 3.8x1.10 * Try different conda variants * No need to set cuda for conda jobs * Update who to notify ipu failure * Don't push * update filenaem Co-authored-by: Luca Medeiros <[email protected]> * Avoid entry_points deprecation warning (#14052) Co-authored-by: Adam J. Stewart <[email protected]> Co-authored-by: Akihiro Nitta <[email protected]> * Configure the check-group app (#14165) Co-authored-by: Jirka <[email protected]> * Profile batch transfer and gradient clipping hooks (#14069) Co-authored-by: Rohit Gupta <[email protected]> * Avoid false positive warning about using `sync_dist` when using torchmetrics (#14143) Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> * Avoid raising the sampler warning if num_replicas=1 (#14097) Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: otaj <[email protected]> * Remove skipping logic in favor of path filtering (#14170) * Support checkpoint save and load with Stochastic Weight Averaging (#9938) Co-authored-by: thomas chaton <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Carlos Mocholi <[email protected]> Co-authored-by: Kushashwa Ravi Shrimali <[email protected]> Co-authored-by: Jirka <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> * Use fsdp module to initialize precision scalar for fsdp native (#14092) Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Laverne Henderson <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> * add more issues types (#14174) * add more issues types * Update .github/ISSUE_TEMPLATE/config.yml Co-authored-by: Mansy <[email protected]> * typo Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Kaushik B <[email protected]> Co-authored-by: Mansy <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Laverne Henderson <[email protected]> Co-authored-by: Akihiro Nitta <[email protected]> * CI: clean building docs (#14216) * CI: clean building docs * group * . * CI: docker focus on PL only (#14246) * CI: docker focus on PL only * group * Allowed setting attributes on `DataLoader` and `BatchSampler` when instantiated inside `*_dataloader` hooks (#14212) Co-authored-by: otaj <[email protected]> * Revert "Remove skipping logic in favor of path filtering (#14170)" (#14244) * Update defaults for WandbLogger's run name and project name (#14145) Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: Jirka <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <[email protected]> Co-authored-by: Luca Medeiros <[email protected]> Co-authored-by: Adam J. Stewart <[email protected]> Co-authored-by: otaj <[email protected]> Co-authored-by: Adam Reeve <[email protected]> Co-authored-by: thomas chaton <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kushashwa Ravi Shrimali <[email protected]> Co-authored-by: Laverne Henderson <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Kaushik B <[email protected]> Co-authored-by: Mansy <[email protected]>

kampelmuehler · 2023-01-16T11:42:11Z

@adamreeve
Am I correct that the "follow up PR" on using the _average_model for validation hasn't yet been done?

For those looking to do it manually, this would be a very rough sketch

    def on_fit_start(self):
        for cb in self.trainer.callbacks:
            if isinstance(cb, StochasticWeightAveraging):
                self.swa = cb

    def validation_step(self, batch, batch_idx):
        x, y = batch

        loss = self.loss_fn(self(x), y)
        print(f"{loss.item():.02f}", end="")
        if self.swa._initialized:
            loss_swa = self.loss_fn(self.swa._average_model(x), y)
            print(f" {loss_swa.item():.02f}", end="")
        print()
        return loss

kampelmuehler · 2023-01-16T11:43:25Z

@adamreeve
I think that these features should also somehow be documented. Currently the docs include nothing on how to actually use the SWA model (or that indeed only in the end of training the weights are transferred to the original model). What do you think?

Edit:
Couple of users unsure about SWA usage
https://lightning.ai/forums/t/how-to-implement-swa/1761
https://lightning.ai/forums/t/stochasticweightaveraging-validation-logging-and-checkpoints/2023/2

kampelmuehler · 2023-01-16T13:33:05Z

Also:
The checkpointing is not in any sense automated, or is it?
Only way I see thus far is to manually load the checkpoint, get the state_dict for swa and load it manually into the callback instance.

If I just declare the model, add an swa callback to the trainer and pass the checkpoint path to trainer.fit() it won't load the swa callback states.

EDIT:
I was wrong, passing to trainer.fit() correctly loads the callback state as well, trainer.predict(), however, will not load the swa callback state. Which might be intentional?

kampelmuehler · 2023-01-16T13:53:11Z

cc @awaelchli

adamreeve · 2023-01-16T20:28:40Z

Hi @kampelmuehler

Am I correct that the "follow up PR" on using the _average_model for validation hasn't yet been done?

Yes, that was originally part of this PR but the scope was reduced, so #6074 probably shouldn't have been closed. I had started working towards that in a separate branch (https://github.com/adamreeve/pytorch-lightning/commits/swa_validation) but that's now quite far behind master and doesn't include all the changes from this PR. Supporting batch normalization in conjunction with SWA made that a lot more complicated, but I think it was working. I don't currently have any plans to continue with that work.

This PR didn't really add any new user-visible feature but just fixed checkpointing to work correctly with SWA so that training could be resumed. Eg. it fixed #11665. But I agree that the documentation could be improved to better explain how SWA works.

It's a while since I looked at this now so I'm not sure whether trainer.predict() not loading the SWA parameters is intentional or just a limitation of the current approach, but it sounds consistent with the behaviour of the averaged parameters not being transferred until training is completed.

kampelmuehler · 2023-01-17T07:51:07Z

Hi @adamreeve - thanks for the quick response and all the insights!

zhong-yy · 2024-05-27T12:55:47Z

Hi, is swa_validation removed in the latest version?

adamreeve mentioned this pull request Oct 15, 2021

Support training resume and saving best model for SWA #6074

Closed

adamreeve added 4 commits October 18, 2021 16:24

Save StochasticWeightAveraging callback data in checkpoints

72d0433

Add option to use SWA parameters during validation

3d2bf65

Allow restoring SWA parameters to a model from a checkpoint

1696273

Refactor SWA batch norm moment update to work with validation

c8db9d8

adamreeve force-pushed the swa_checkpoint branch from d1dc06a to c8db9d8 Compare October 18, 2021 03:33

Add test for loading a model from a checkpoint with SWA parameters

004959b

adamreeve force-pushed the swa_checkpoint branch from 67364e9 to 004959b Compare October 19, 2021 01:37

adamreeve added 3 commits October 19, 2021 16:57

Recompute batch norm moments when updating parameters from a checkpoint

d76528b

Handle when data batch is a list or tuple

0ea22e0

Save SWA scheduler step count in checkpoints

01ca2a7

adamreeve commented Oct 27, 2021

View reviewed changes

pytorch_lightning/callbacks/stochastic_weight_avg.py Outdated Show resolved Hide resolved

adamreeve added 2 commits October 28, 2021 11:28

Update SWA documentation and changelog

08d655b

Fix DeepSource code style issues

91ab357

tchaton added this to the v1.6 milestone Nov 1, 2021

tchaton added feature Is an improvement or enhancement bug Something isn't working and removed feature Is an improvement or enhancement labels Nov 1, 2021

tchaton modified the milestones: v1.6, v1.6.x Nov 1, 2021

awaelchli modified the milestones: v1.6.x, 1.5.x Nov 3, 2021

adamreeve changed the title ~~[Draft] Support checkpoint save and load with Stochastic Weight Averaging~~ Support checkpoint save and load with Stochastic Weight Averaging Nov 8, 2021

adamreeve marked this pull request as ready for review November 8, 2021 20:28

adamreeve requested review from awaelchli, Borda, carmocca, edenlightning, justusschock and kaushikb11 as code owners November 8, 2021 20:28

rohitgr7 linked an issue Aug 8, 2022 that may be closed by this pull request

Cant reload from checkpoint when using SWA #11665

Closed

rohitgr7 approved these changes Aug 9, 2022

View reviewed changes

Merge branch 'master' into swa_checkpoint

dcf5fea

mergify bot added ready PRs ready to be merged has conflicts and removed has conflicts ready PRs ready to be merged labels Aug 9, 2022

Merge branch 'master' into swa_checkpoint

ce9bcea

awaelchli enabled auto-merge (squash) August 9, 2022 22:41

mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Aug 9, 2022

awaelchli merged commit 975a4fc into Lightning-AI:master Aug 9, 2022

adamreeve deleted the swa_checkpoint branch August 10, 2022 08:24

awaelchli mentioned this pull request Aug 18, 2022

Profile batch transfer and gradient clipping hooks #14069

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support checkpoint save and load with Stochastic Weight Averaging #9938

Support checkpoint save and load with Stochastic Weight Averaging #9938

adamreeve commented Oct 15, 2021 •

edited by Borda

Loading

rohitgr7 left a comment

awaelchli commented Aug 15, 2022

rohitgr7 commented Aug 15, 2022

kampelmuehler commented Jan 16, 2023 •

edited

Loading

kampelmuehler commented Jan 16, 2023 •

edited

Loading

kampelmuehler commented Jan 16, 2023 •

edited

Loading

kampelmuehler commented Jan 16, 2023

adamreeve commented Jan 16, 2023 •

edited

Loading

kampelmuehler commented Jan 17, 2023

zhong-yy commented May 27, 2024

Support checkpoint save and load with Stochastic Weight Averaging #9938

Support checkpoint save and load with Stochastic Weight Averaging #9938

Conversation

adamreeve commented Oct 15, 2021 • edited by Borda Loading

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

rohitgr7 left a comment

Choose a reason for hiding this comment

awaelchli commented Aug 15, 2022

rohitgr7 commented Aug 15, 2022

kampelmuehler commented Jan 16, 2023 • edited Loading

kampelmuehler commented Jan 16, 2023 • edited Loading

kampelmuehler commented Jan 16, 2023 • edited Loading

kampelmuehler commented Jan 16, 2023

adamreeve commented Jan 16, 2023 • edited Loading

kampelmuehler commented Jan 17, 2023

zhong-yy commented May 27, 2024

adamreeve commented Oct 15, 2021 •

edited by Borda

Loading

kampelmuehler commented Jan 16, 2023 •

edited

Loading

kampelmuehler commented Jan 16, 2023 •

edited

Loading

kampelmuehler commented Jan 16, 2023 •

edited

Loading

adamreeve commented Jan 16, 2023 •

edited

Loading