Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] Fix dataloading for iterable datasets and limit_train_batches #7306

Merged
merged 22 commits into from
May 3, 2021

Conversation

ananthsub
Copy link
Contributor

@ananthsub ananthsub commented May 1, 2021

What does this PR do?

Fixes #7303
Fixes #6332

The check in the training loop for whether to run validation was not accounting for val_check_batch being inf which occurs for iterable style datasets which don't have len defined. https://github.com/PyTorchLightning/pytorch-lightning/blob/490cc57809ebeba19003b4101393a8a058217c31/pytorch_lightning/trainer/data_loading.py#L288-L290

I think the dataloading logic is too complicated here. At the very least we need to consolidate logic across the dataloading mixin, debug connector, and training loop.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented May 1, 2021

Codecov Report

Merging #7306 (a735169) into master (3927427) will increase coverage by 1%.
The diff coverage is 100%.

❗ Current head a735169 differs from pull request most recent head 0e7b0ad. Consider uploading reports for the commit 0e7b0ad to get more accurate results

@@           Coverage Diff           @@
##           master   #7306    +/-   ##
=======================================
+ Coverage      86%     87%    +1%     
=======================================
  Files         200     200            
  Lines       12865   12875    +10     
=======================================
+ Hits        11119   11242   +123     
+ Misses       1746    1633   -113     

@ananthsub ananthsub added bug Something isn't working data handling Generic data-related topic labels May 1, 2021
@ananthsub ananthsub added this to the v1.3 milestone May 1, 2021
Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm,
just minor comments for improvement

tests/trainer/test_dataloaders.py Show resolved Hide resolved
pytorch_lightning/callbacks/progress.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/progress.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/progress.py Show resolved Hide resolved
tests/trainer/test_dataloaders.py Show resolved Hide resolved
pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved
tests/trainer/test_dataloaders.py Outdated Show resolved Hide resolved
# val_check_batch is inf for iterable datasets with no length defined
# TODO: let training/eval loop handle logic around limit_*_batches and val_check_batch
is_val_check_batch = False
if isinstance(self.trainer.limit_train_batches, int) and self.trainer.val_check_batch == float('inf'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this refactor!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaushikb11 thanks! it still feels complicated to me. part of that is from limit_train_batches / val_check_interval having different types and possible meanings depending on both depending on the user input and dataloader specified.

i'm wondering what's a better way to split "when to stop training mid-epoch" vs when to run validation or if a split is needed at all.

@kaushikb11
Copy link
Contributor

@ananthsub There's a failing test other than the unrelated deepspeed failing test.

@ananthsub
Copy link
Contributor Author

@kaushikb11 for some reason I cannot reproduce the test failure locally.

but it's pointing out an issue from #6671 which is that the train loop force calling checkpointing on train end is bad because we cannot guarantee the monitor value is present

@awaelchli @carmocca any suggestions on how to debug in this case, or why the failure isn't consistent across versions?

@pep8speaks
Copy link

pep8speaks commented May 2, 2021

Hello @ananthsub! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-03 15:28:50 UTC

@@ -114,7 +114,7 @@ def pre_dispatch(self, trainer: 'pl.Trainer') -> None:
def _move_optimizer_state(self) -> None:
""" Moves the state of the optimizers to the GPU if needed. """
for opt in self.optimizers:
state = defaultdict(dict)
state: DefaultDict = defaultdict(dict)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sent this out as a different PR here: #7318
but wanted it here to ensure tests were passing

Copy link
Member

@ethanwharris ethanwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! LGTM 😃

@carmocca carmocca added the ready PRs ready to be merged label May 3, 2021
@ananthsub ananthsub enabled auto-merge (squash) May 3, 2021 16:22
@ananthsub ananthsub disabled auto-merge May 3, 2021 18:38
@ananthsub ananthsub enabled auto-merge (squash) May 3, 2021 18:38
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@ananthsub ananthsub merged commit 14c552b into Lightning-AI:master May 3, 2021
kaushikb11 pushed a commit to kaushikb11/pytorch-lightning that referenced this pull request May 4, 2021
Lightning-AI#7306)

* bugfix-dataloading

* rm-logs

* Update CHANGELOG.md

* Update test_dataloaders.py

* Update test_dataloaders.py

* Update training_loop.py

* Update test_dataloaders.py

* Update CHANGELOG.md

* Update CHANGELOG.md

* Update test_dataloaders.py

* Update training_loop.py

* Update training_loop.py

* comments

* address comments

* more tests

* Update progress.py

* Update test_dataloaders.py

* Update test_dataloaders.py

* Update training_loop.py

* Update training_loop.py

* test ckpt fix?

* update again
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data handling Generic data-related topic ready PRs ready to be merged
Projects
None yet
7 participants