Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not force sync_dist=True on epoch end #13364

Merged
merged 22 commits into from
Jul 22, 2022
Merged

Do not force sync_dist=True on epoch end #13364

merged 22 commits into from
Jul 22, 2022

Conversation

krishnakalyan3
Copy link
Contributor

@krishnakalyan3 krishnakalyan3 commented Jun 22, 2022

Fixes #13210

Just FYI, with fault-tolerant training, we need to ensure sync happens because during a failure on, let's say, a spot instance, the metrics are synced and saved to the checkpoint, so during restart, metrics on rank 0 are from the accumulated ones from the previous run, and on other ranks, they are 0. So we need to make sure they are synced in further training to ensure correct calculation.

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@rohitgr7 rohitgr7 added logging Related to the `LoggerConnector` and `log()` breaking change Includes a breaking change labels Jul 12, 2022
@rohitgr7 rohitgr7 added this to the pl:1.7 milestone Jul 12, 2022
@mergify mergify bot added the ready PRs ready to be merged label Jul 12, 2022
@mergify mergify bot added ready PRs ready to be merged and removed ready PRs ready to be merged labels Jul 16, 2022
@Borda Borda requested a review from carmocca July 17, 2022 18:23
@rohitgr7 rohitgr7 requested a review from carmocca July 19, 2022 08:02
@rohitgr7 rohitgr7 changed the title Do not force sync_dist=True on epoch end Do not force sync_dist=True on epoch end Jul 19, 2022
@carmocca carmocca enabled auto-merge (squash) July 19, 2022 11:34
@krishnakalyan3
Copy link
Contributor Author

Thank you @rohitgr7. It was really nice learning from you.
Thanks for the reviews @carmocca and @otaj

@mergify mergify bot added has conflicts and removed ready PRs ready to be merged labels Jul 20, 2022
@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Jul 22, 2022
@mergify mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Jul 22, 2022
@codecov
Copy link

codecov bot commented Jul 22, 2022

Codecov Report

Merging #13364 (105ee0a) into master (9596fab) will decrease coverage by 10%.
The diff coverage is 86%.

@@            Coverage Diff            @@
##           master   #13364     +/-   ##
=========================================
- Coverage      86%      76%    -10%     
=========================================
  Files         327      327             
  Lines       25500    25505      +5     
=========================================
- Hits        21898    19455   -2443     
- Misses       3602     6050   +2448     

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Includes a breaking change logging Related to the `LoggerConnector` and `log()` pl Generic label for PyTorch Lightning package ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not force sync_dist=True on epoch end
4 participants