Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a flavor of training_step that takes dataloader_iter as an argument #8807

Merged
merged 16 commits into from
Aug 16, 2021

Conversation

yifuwang
Copy link
Contributor

@yifuwang yifuwang commented Aug 9, 2021

What does this PR do?

Implements the feature request #8316

Introduced FlexibleOptimizationFlow, which is a "batch processor" alternative to TrainingBatchLoop when the training_step takes dataloader_iter as an argument.

New tests are WIP.

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented Aug 9, 2021

Codecov Report

Merging #8807 (5f8bdd5) into master (89156b7) will decrease coverage by 4%.
The diff coverage is 96%.

@@           Coverage Diff           @@
##           master   #8807    +/-   ##
=======================================
- Coverage      93%     89%    -4%     
=======================================
  Files         173     176     +3     
  Lines       14182   14268    +86     
=======================================
- Hits        13174   12678   -496     
- Misses       1008    1590   +582     

@yifuwang yifuwang force-pushed the training_step_dataloader_iter branch from 0fb8317 to 81a8010 Compare August 9, 2021 08:54
pytorch_lightning/__about__.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/trainer.py Show resolved Hide resolved
@yifuwang yifuwang force-pushed the training_step_dataloader_iter branch 2 times, most recently from 9f78849 to 2297f23 Compare August 9, 2021 23:20
@yifuwang yifuwang requested a review from kaushikb11 as a code owner August 9, 2021 23:20
@yifuwang yifuwang force-pushed the training_step_dataloader_iter branch from a2bc359 to 1cec6f8 Compare August 9, 2021 23:24
@yifuwang yifuwang changed the title [WIP] Add a flavor of training_step that takes dataloader_iter as an argument Add a flavor of training_step that takes dataloader_iter as an argument Aug 10, 2021
@pep8speaks
Copy link

pep8speaks commented Aug 10, 2021

Hello @yifuwang! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 56:23: E203 whitespace before ':'

Comment last updated at 2021-08-12 21:17:43 UTC

@SeanNaren
Copy link
Contributor

Just catching up here, but by introducing a method to allow the dataloader to be passed into the training step, we're bypassing a loop, migrating the loop into the training step right?

If so I think I understand @tchaton concerns of it being a risky hole to delve into without care. Shouldn't the logic exist within a loop past into Trainer or as done within #8700?

@tchaton
Copy link
Contributor

tchaton commented Aug 11, 2021

Dear @yifuwang @ananthsub @carmocca @awaelchli @justusschock @williamFalcon @SeanNaren @kaushikb11 @Borda

After chatting with @yifuwang, I understood there is some urgency on this feature to enable converting libraries to PL.

However, the plan would be to start replacing this as soon as it is merged in follow-ups PRs by introducing a LightningPrefetcher with best practices which could benefit to all users, respect fault tolerant training requirements and can be extended for new use-cases.

The LightningPrefetcher should provide a way to provide iterator directly to the training_step while tracking progress to replace entirely this PR changes.

@yifuwang committed to co-author the refactor once this PR has been merged starting from the works within #8476 and #8700.

So let's get this PR merged asap :)

Best,
T.C

@yifuwang
Copy link
Contributor Author

Hi Thomas, thanks for the review and feedback!

I'm fully onboard with enabling this through a broader design for data prefetching use cases (i.e. #8700). I'll be committed to helping execute the direction and replacing this before the next release.

@ananthsub
Copy link
Contributor

@tchaton I agree, thank you for summarizing this! I'll also echo @yifuwang 's comment here as well!

I'm fully onboard with enabling this through a broader design for data prefetching use cases (i.e. #8700). I'll be committed to helping execute the direction and replacing this before the next release.

+1000

@tchaton
Copy link
Contributor

tchaton commented Aug 12, 2021

@tchaton I agree, thank you for summarizing this! I'll also echo @yifuwang 's comment here as well!

I'm fully onboard with enabling this through a broader design for data prefetching use cases (i.e. #8700). I'll be committed to helping execute the direction and replacing this before the next release.

+1000

Thanks @ananthsub @yifuwang for confirming :)

IMO, Here is the scope around the refactoring:

  • Drop IteratorBatchProcessor once refactor is finalised.
  • Anything else within this PR seems appropriate refactors and should be conserved.

Finally, I created an RFC for the refactor re-design.

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

premature accept due to urgency
I do not fully understand what's going on here. I hope there is still a bit of time for us to test it.

I trust that we will follow up with meetings and discussions with how all of this fits into Lightnings philosophy. Ideally the order would be the other way around.

pytorch_lightning/loops/utilities.py Outdated Show resolved Hide resolved
pytorch_lightning/loops/utilities.py Outdated Show resolved Hide resolved
pytorch_lightning/loops/utilities.py Show resolved Hide resolved
@awaelchli awaelchli added feature Is an improvement or enhancement refactor labels Aug 12, 2021
@awaelchli awaelchli added this to the v1.5 milestone Aug 12, 2021
@yifuwang yifuwang force-pushed the training_step_dataloader_iter branch from 13a03c7 to 377a340 Compare August 12, 2021 20:33
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

pytorch_lightning/loops/utilities.py Show resolved Hide resolved
@mergify mergify bot added the ready PRs ready to be merged label Aug 13, 2021
Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed with the feedback, and that this is just the start for how:

  • we can build a foundation around how the trainer interacts with the training step (and especially multiple flavors of training steps)
  • how we can bring better clarity to the individual batch loops, so the code is straightforward to read for both developers and users, especially those coming from pure pytorch training
  • how we can better select which loops to run based on the trainer & module attributes

pytorch_lightning/loops/utilities.py Outdated Show resolved Hide resolved
pytorch_lightning/loops/utilities.py Outdated Show resolved Hide resolved
pytorch_lightning/loops/utilities.py Outdated Show resolved Hide resolved
tests/loops/test_inter_batch_parallelism.py Outdated Show resolved Hide resolved
@mergify mergify bot removed the has conflicts label Aug 16, 2021
@ananthsub ananthsub enabled auto-merge (squash) August 16, 2021 18:45
@ananthsub ananthsub merged commit 938a191 into Lightning-AI:master Aug 16, 2021
awaelchli pushed a commit that referenced this pull request Aug 17, 2021
…nt (#8807)

* Add a flavor of training_step that takes dataloader_iter as an argument
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement ready PRs ready to be merged refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a flavor of training_step that allows for expressing inter-batch parallelism
9 participants