Support gradient accumulation using Horovod's `backward_passes_per_step` #11911

krshrimali · 2022-02-14T10:17:33Z

What does this PR do?

This PR attempts to fix #11732.

Uses backward_passes_per_step kwarg in Horovod's DistributedOptimizer for the purpose as mentioned in the issue.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

for more information, see https://pre-commit.ci

CHANGELOG.md

pytorch_lightning/strategies/horovod.py

Co-authored-by: Rohit Gupta <[email protected]>

…krshrimali/pytorch-lightning into feature/11732_grad_accumulation_horovod

tests/models/test_horovod.py

pytorch_lightning/strategies/horovod.py

Co-authored-by: ananthsub <[email protected]>

…krshrimali/pytorch-lightning into feature/11732_grad_accumulation_horovod

for more information, see https://pre-commit.ci

pytorch_lightning/strategies/horovod.py

tests/models/test_horovod.py

…and devices=count

…krshrimali/pytorch-lightning into feature/11732_grad_accumulation_horovod

for more information, see https://pre-commit.ci

carmocca

Always try to keep tests as minimal as possible

tests/models/test_horovod.py

Co-authored-by: Carlos Mocholí <[email protected]>

krshrimali · 2022-02-16T16:11:19Z

Thanks, @carmocca for the review and suggestions. Just to let you know, I'll make the changes to merge the cpu and gpu tests into one as you suggested, and will update this PR again. I've committed through your suggestions for some of your comments, and others will come along with my next commit. Thank you for the suggestions again! :))

carmocca

LGTM!

As a side note, we should open an issue about rethinking Horovod testing.

…ep` (#11911) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: ananthsub <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]>

Support gradient accumulation using Horovod's backward_passes_per_step

f4b1fa1

krshrimali requested review from tchaton, SeanNaren, awaelchli, justusschock, kaushikb11, williamFalcon, Borda, carmocca and rohitgr7 as code owners February 14, 2022 10:17

rohitgr7 requested a review from ananthsub February 14, 2022 10:18

[pre-commit.ci] auto fixes from pre-commit.com hooks

aeb85ca

for more information, see https://pre-commit.ci

rohitgr7 added feature Is an improvement or enhancement strategy: horovod (removed) labels Feb 14, 2022

rohitgr7 added this to the 1.6 milestone Feb 14, 2022

rohitgr7 added the optimization label Feb 14, 2022

rohitgr7 reviewed Feb 14, 2022

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

CHANGELOG.md Outdated Show resolved Hide resolved

pytorch_lightning/strategies/horovod.py Outdated Show resolved Hide resolved

krshrimali and others added 4 commits February 14, 2022 16:20

Update CHANGELOG.md

0e1b93a

Co-authored-by: Rohit Gupta <[email protected]>

Update CHANGELOG.md

7db02ca

Co-authored-by: Rohit Gupta <[email protected]>

Add test and minor fix

0bf0fb9

Merge branch 'feature/11732_grad_accumulation_horovod' of github.com:…

67a0bce

…krshrimali/pytorch-lightning into feature/11732_grad_accumulation_horovod

krshrimali commented Feb 14, 2022

View reviewed changes

tests/models/test_horovod.py Outdated Show resolved Hide resolved

ananthsub reviewed Feb 14, 2022

View reviewed changes

pytorch_lightning/strategies/horovod.py Outdated Show resolved Hide resolved

pytorch_lightning/strategies/horovod.py Outdated Show resolved Hide resolved

krshrimali and others added 4 commits February 15, 2022 11:03

Update pytorch_lightning/strategies/horovod.py

5a17c2b

Co-authored-by: ananthsub <[email protected]>

Changes per review, pass accumulate_grad_batches as arg

53fd24a

Merge branch 'feature/11732_grad_accumulation_horovod' of github.com:…

82ebd9e

…krshrimali/pytorch-lightning into feature/11732_grad_accumulation_horovod

[pre-commit.ci] auto fixes from pre-commit.com hooks

c914e3a

for more information, see https://pre-commit.ci

ananthsub reviewed Feb 16, 2022

View reviewed changes

pytorch_lightning/strategies/horovod.py Show resolved Hide resolved

tests/models/test_horovod.py Outdated Show resolved Hide resolved

krshrimali added 2 commits February 16, 2022 14:14

Add tests to ensure error raised, replace gpus= with accelerator=gpu …

2a4c6fa

…and devices=count

Merge branch 'feature/11732_grad_accumulation_horovod' of github.com:…

2a792df

…krshrimali/pytorch-lightning into feature/11732_grad_accumulation_horovod

krshrimali and others added 3 commits February 16, 2022 14:46

Don't add tests for now

0850b0a

Use basic model, and trainer.fit instead of _run_horovod

9fd23ce

[pre-commit.ci] auto fixes from pre-commit.com hooks

04ee625

for more information, see https://pre-commit.ci

carmocca reviewed Feb 16, 2022

View reviewed changes

Apply suggestions from code review

7c73b0c

Co-authored-by: Carlos Mocholí <[email protected]>

carmocca added 3 commits February 17, 2022 14:16

Self review

8c1e3f1

Remove duplicated test

60ba198

Missed horovod in RunIf

5786ce8

ananthsub approved these changes Feb 17, 2022

View reviewed changes

Further simplification

670fbac

carmocca approved these changes Feb 17, 2022

View reviewed changes

mergify bot added ready PRs ready to be merged has conflicts labels Feb 17, 2022

Resolve merge conflict

372d322

mergify bot removed the has conflicts label Feb 18, 2022

krshrimali mentioned this pull request Feb 18, 2022

Rethink/Refactor Horovod Testing #11975

Closed

awaelchli approved these changes Feb 19, 2022

View reviewed changes

awaelchli merged commit 0374fe6 into Lightning-AI:master Feb 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support gradient accumulation using Horovod's `backward_passes_per_step` #11911

Support gradient accumulation using Horovod's `backward_passes_per_step` #11911

krshrimali commented Feb 14, 2022 •

edited

Loading

carmocca left a comment

krshrimali commented Feb 16, 2022

carmocca left a comment

Support gradient accumulation using Horovod's backward_passes_per_step #11911

Support gradient accumulation using Horovod's backward_passes_per_step #11911

Conversation

krshrimali commented Feb 14, 2022 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

carmocca left a comment

Choose a reason for hiding this comment

krshrimali commented Feb 16, 2022

carmocca left a comment

Choose a reason for hiding this comment

Support gradient accumulation using Horovod's `backward_passes_per_step` #11911

Support gradient accumulation using Horovod's `backward_passes_per_step` #11911

krshrimali commented Feb 14, 2022 •

edited

Loading