Fix support for dataloader with None batches #7342

ethanwharris · 2021-05-04T09:29:19Z

What does this PR do?

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
[N/A] Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2021-05-04T09:30:35Z

Codecov Report

Merging #7342 (108b409) into master (a6aa1a0) will decrease coverage by 5%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #7342    +/-   ##
=======================================
- Coverage      91%     86%    -5%     
=======================================
  Files         200     200            
  Lines       12916   13210   +294     
=======================================
- Hits        11764   11317   -447     
- Misses       1152    1893   +741

tchaton

LGTM !

tests/trainer/data_flow/test_train_loop_flow_scalar.py

…ng/pytorch-lightning into bugfix/none_batch

ananthsub · 2021-05-04T19:25:40Z

pytorch_lightning/trainer/training_loop.py

+            self.warning_cache.warn("train_dataloader yielded None. If this was on purpose, ignore this warning...")
+            return AttributeDict(
+                signal=0,
+                grad_norm_dic=grad_norm_dic,
+                training_step_output_for_epoch_end=batch_outputs,
+            )


what happens in distributed training? won't we go out of sync across ranks if the batch is None on some ranks and non-None on others?

I'm not sure. It will be closer to working than it was before, but this use case may still need some work for distributed. The new behaviour (that is, after this fix) should be the same as when returning None from a step

There are two options:

either the trainer runs an all reduce at each step to determine whether all ranks skip. This is a general solution but wastes perf for any users who won't skip batches

the user runs the all reduce in their lightning module and determines when to skip. This way it's specific to just modules which need to skip batches or steps for any reason.

I prefer the latter approach because I don't want to slow down the trainer for everyone to handle a few corner cases

What do you think @tchaton @SeanNaren @awaelchli @carmocca

Yeah, I think it's fine to make the user responsible for making all ranks skip in sync if they want distributed skipping

Returning None in any kind of distributed setting or AMP is currently unsupported. We have this (very old) PR to add support but it's currently frozen

#5359

You can have a look at the proposal though

ethanwharris added 2 commits May 4, 2021 10:27

Fix Dataloader None batch

d632701

Fix Dataloader None batch

eed78a3

ethanwharris requested review from awaelchli, Borda, carmocca, justusschock, kaushikb11, SeanNaren, tchaton and williamFalcon as code owners May 4, 2021 09:29

ethanwharris added 2 commits May 4, 2021 10:30

Update CHANGELOG.md

1e050e8

Fix breaking test

43d5197

ethanwharris changed the title ~~Bugfix/none batch~~ Fix support for dataloader with None batches May 4, 2021

mergify bot added the has conflicts label May 4, 2021

tchaton approved these changes May 4, 2021

View reviewed changes

Merge branch 'master' into bugfix/none_batch

ae24fcb

mergify bot removed the has conflicts label May 4, 2021

ethanwharris added the bug Something isn't working label May 4, 2021

kaushikb11 approved these changes May 4, 2021

View reviewed changes

carmocca reviewed May 4, 2021

View reviewed changes

tests/trainer/data_flow/test_train_loop_flow_scalar.py Outdated Show resolved Hide resolved

ethanwharris added 2 commits May 4, 2021 13:01

Address comments

b519823

Merge branch 'bugfix/none_batch' of https://github.com/PyTorchLightni…

108b409

…ng/pytorch-lightning into bugfix/none_batch

awaelchli approved these changes May 4, 2021

View reviewed changes

awaelchli added the ready PRs ready to be merged label May 4, 2021

awaelchli added this to the v1.3 milestone May 4, 2021

SkafteNicki approved these changes May 4, 2021

View reviewed changes

SkafteNicki enabled auto-merge (squash) May 4, 2021 12:20

SkafteNicki merged commit 2a740eb into master May 4, 2021

SkafteNicki deleted the bugfix/none_batch branch May 4, 2021 12:24

ananthsub reviewed May 4, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix support for dataloader with None batches #7342

Fix support for dataloader with None batches #7342

ethanwharris commented May 4, 2021 •

edited by awaelchli

Loading

codecov bot commented May 4, 2021 •

edited

Loading

tchaton left a comment

ananthsub May 4, 2021

ethanwharris May 4, 2021 •

edited

Loading

ananthsub May 4, 2021 •

edited

Loading

ethanwharris May 4, 2021

carmocca May 4, 2021 •

edited

Loading

Fix support for dataloader with None batches #7342

Fix support for dataloader with None batches #7342

Conversation

ethanwharris commented May 4, 2021 • edited by awaelchli Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

codecov bot commented May 4, 2021 • edited Loading

Codecov Report

tchaton left a comment

Choose a reason for hiding this comment

ananthsub May 4, 2021

Choose a reason for hiding this comment

ethanwharris May 4, 2021 • edited Loading

Choose a reason for hiding this comment

ananthsub May 4, 2021 • edited Loading

Choose a reason for hiding this comment

ethanwharris May 4, 2021

Choose a reason for hiding this comment

carmocca May 4, 2021 • edited Loading

Choose a reason for hiding this comment

ethanwharris commented May 4, 2021 •

edited by awaelchli

Loading

codecov bot commented May 4, 2021 •

edited

Loading

ethanwharris May 4, 2021 •

edited

Loading

ananthsub May 4, 2021 •

edited

Loading

carmocca May 4, 2021 •

edited

Loading