[bugfix] Fix dataloading for iterable datasets and limit_train_batches #7306

ananthsub · 2021-05-01T11:19:29Z

What does this PR do?

The check in the training loop for whether to run validation was not accounting for val_check_batch being inf which occurs for iterable style datasets which don't have len defined. https://github.com/PyTorchLightning/pytorch-lightning/blob/490cc57809ebeba19003b4101393a8a058217c31/pytorch_lightning/trainer/data_loading.py#L288-L290

I think the dataloading logic is too complicated here. At the very least we need to consolidate logic across the dataloading mixin, debug connector, and training loop.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2021-05-01T11:20:49Z

Codecov Report

Merging #7306 (a735169) into master (3927427) will increase coverage by 1%.
The diff coverage is 100%.

❗ Current head a735169 differs from pull request most recent head 0e7b0ad. Consider uploading reports for the commit 0e7b0ad to get more accurate results

@@           Coverage Diff           @@
##           master   #7306    +/-   ##
=======================================
+ Coverage      86%     87%    +1%     
=======================================
  Files         200     200            
  Lines       12865   12875    +10     
=======================================
+ Hits        11119   11242   +123     
+ Misses       1746    1633   -113

awaelchli

lgtm,
just minor comments for improvement

tests/trainer/test_dataloaders.py

pytorch_lightning/callbacks/progress.py

tests/trainer/test_dataloaders.py

pytorch_lightning/trainer/training_loop.py

tests/trainer/test_dataloaders.py

kaushikb11 · 2021-05-02T21:39:00Z

pytorch_lightning/trainer/training_loop.py

+        # val_check_batch is inf for iterable datasets with no length defined
+        # TODO: let training/eval loop handle logic around limit_*_batches and val_check_batch
+        is_val_check_batch = False
+        if isinstance(self.trainer.limit_train_batches, int) and self.trainer.val_check_batch == float('inf'):


Love this refactor!

@kaushikb11 thanks! it still feels complicated to me. part of that is from limit_train_batches / val_check_interval having different types and possible meanings depending on both depending on the user input and dataloader specified.

i'm wondering what's a better way to split "when to stop training mid-epoch" vs when to run validation or if a split is needed at all.

kaushikb11 · 2021-05-02T21:41:53Z

@ananthsub There's a failing test other than the unrelated deepspeed failing test.

ananthsub · 2021-05-02T22:19:39Z

@kaushikb11 for some reason I cannot reproduce the test failure locally.

but it's pointing out an issue from #6671 which is that the train loop force calling checkpointing on train end is bad because we cannot guarantee the monitor value is present

@awaelchli @carmocca any suggestions on how to debug in this case, or why the failure isn't consistent across versions?

pep8speaks · 2021-05-02T22:41:53Z

Hello @ananthsub! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-03 15:28:50 UTC

ananthsub · 2021-05-02T23:11:42Z

pytorch_lightning/accelerators/accelerator.py

@@ -114,7 +114,7 @@ def pre_dispatch(self, trainer: 'pl.Trainer') -> None:
    def _move_optimizer_state(self) -> None:
        """ Moves the state of the optimizers to the GPU if needed. """
        for opt in self.optimizers:
-            state = defaultdict(dict)
+            state: DefaultDict = defaultdict(dict)


sent this out as a different PR here: #7318
but wanted it here to ensure tests were passing

ethanwharris

Awesome! LGTM 😃

tchaton

LGTM !

Lightning-AI#7306) * bugfix-dataloading * rm-logs * Update CHANGELOG.md * Update test_dataloaders.py * Update test_dataloaders.py * Update training_loop.py * Update test_dataloaders.py * Update CHANGELOG.md * Update CHANGELOG.md * Update test_dataloaders.py * Update training_loop.py * Update training_loop.py * comments * address comments * more tests * Update progress.py * Update test_dataloaders.py * Update test_dataloaders.py * Update training_loop.py * Update training_loop.py * test ckpt fix? * update again

ananthsub requested review from awaelchli, Borda, carmocca, justusschock, kaushikb11, SeanNaren, tchaton and williamFalcon as code owners May 1, 2021 11:19

ananthsub added bug Something isn't working data handling Generic data-related topic labels May 1, 2021

ananthsub added this to the v1.3 milestone May 1, 2021

awaelchli approved these changes May 2, 2021

View reviewed changes

kaushikb11 reviewed May 2, 2021

View reviewed changes

ananthsub mentioned this pull request May 2, 2021

Add DefaultDict typehint for optimizer state in accelerator #7318

Merged

11 tasks

ananthsub force-pushed the debug-dl branch from 0ab6d8b to a735169 Compare May 2, 2021 23:10

ananthsub commented May 2, 2021

View reviewed changes

ethanwharris approved these changes May 3, 2021

View reviewed changes

carmocca approved these changes May 3, 2021

View reviewed changes

carmocca added the ready PRs ready to be merged label May 3, 2021

ananthsub added 7 commits May 3, 2021 08:28

bugfix-dataloading

e2fbbd0

rm-logs

d1fb173

Update CHANGELOG.md

4fb0aa2

Update test_dataloaders.py

8058a3c

Update test_dataloaders.py

a5f70c7

Update training_loop.py

ec93877

Update test_dataloaders.py

f1d9e4d

ananthsub added 15 commits May 3, 2021 08:28

Update CHANGELOG.md

069d5b6

Update CHANGELOG.md

3df1116

Update test_dataloaders.py

05656ef

Update training_loop.py

0044e0d

Update training_loop.py

15e52be

comments

c95f996

address comments

01dca98

more tests

89d284a

Update progress.py

67bc801

Update test_dataloaders.py

62b546a

Update test_dataloaders.py

cc1ba4a

Update training_loop.py

a3f959c

Update training_loop.py

49cd18c

test ckpt fix?

5cd2482

update again

0e7b0ad

ananthsub force-pushed the debug-dl branch from a735169 to 0e7b0ad Compare May 3, 2021 15:28

ananthsub enabled auto-merge (squash) May 3, 2021 16:22

ananthsub disabled auto-merge May 3, 2021 18:38

ananthsub enabled auto-merge (squash) May 3, 2021 18:38

tchaton approved these changes May 3, 2021

View reviewed changes

ananthsub merged commit 14c552b into Lightning-AI:master May 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] Fix dataloading for iterable datasets and limit_train_batches #7306

[bugfix] Fix dataloading for iterable datasets and limit_train_batches #7306

ananthsub commented May 1, 2021 •

edited

Loading

codecov bot commented May 1, 2021 •

edited

Loading

awaelchli left a comment

kaushikb11 May 2, 2021

ananthsub May 3, 2021

kaushikb11 commented May 2, 2021

ananthsub commented May 2, 2021

pep8speaks commented May 2, 2021 •

edited

Loading

ananthsub May 2, 2021

ethanwharris left a comment

tchaton left a comment

[bugfix] Fix dataloading for iterable datasets and limit_train_batches #7306

[bugfix] Fix dataloading for iterable datasets and limit_train_batches #7306

Conversation

ananthsub commented May 1, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

codecov bot commented May 1, 2021 • edited Loading

Codecov Report

awaelchli left a comment

Choose a reason for hiding this comment

kaushikb11 May 2, 2021

Choose a reason for hiding this comment

ananthsub May 3, 2021

Choose a reason for hiding this comment

kaushikb11 commented May 2, 2021

ananthsub commented May 2, 2021

pep8speaks commented May 2, 2021 • edited Loading

Comment last updated at 2021-05-03 15:28:50 UTC

ananthsub May 2, 2021

Choose a reason for hiding this comment

ethanwharris left a comment

Choose a reason for hiding this comment

tchaton left a comment

Choose a reason for hiding this comment

ananthsub commented May 1, 2021 •

edited

Loading

codecov bot commented May 1, 2021 •

edited

Loading

pep8speaks commented May 2, 2021 •

edited

Loading