-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace prefetch with val iterator check in megatron models #7318
Conversation
nemo/collections/nlp/models/language_modeling/megatron_gpt_prompt_learning_model.py
Fixed
Show fixed
Hide fixed
e0fbbad
to
a18a36e
Compare
self.trainer.limit_val_batches *= get_num_microbatches() | ||
# Override num sanity steps equal to num of microbatches and perform one val_step | ||
self.trainer.num_sanity_val_steps = get_num_microbatches() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is called only in pretraining models. Not setting limit_val_batches
to be a multiple of microbatches
for downstream models as we are passing just 1 batch to val_step
each time and limit is not required. Val step is run as many times as the user entered limit_val_batches
in case of downstream tasks. Also leaving the num_sanity_val_step
for downstream at the default val of 2.
mode = 'test' if self.trainer.testing else 'val' | ||
batch = next(dataloader_iter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was not needed for prompt learning as we are just passing 1 batch.
try: | ||
element = next(iterator) | ||
elements.append(element) | ||
_ = next(iterator) # exhausting the iterator so that PTL knows to go to validation_epoch_end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lightning requires to hit a StopIteration
to break out of the evaluation loop, otherwise it keeps running indefinitely. It hits a StopIteration at the (n+1)th step so we check for StopIteration
after limit_val_batches
is exhausted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/models/language_modeling/megatron_gpt_prompt_learning_model.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/models/language_modeling/megatron_base_model.py
Outdated
Show resolved
Hide resolved
Other reviewer brought up some good points.
bf346be
to
d79e924
Compare
ab62ff4
to
c6cde73
Compare
a1530f8
to
7f04651
Compare
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree <[email protected]>
1) Remove if condition, self._val_micro_batches_consumed in def _val_iterator_done and check with just try(and reinsert), except 2) Use _val_iterator_done in all megatron models that use dataloader_iter to maintain uniformity Signed-off-by: Abhishree <[email protected]>
7679f6d
to
9b419eb
Compare
Signed-off-by: Abhishree <[email protected]>
f9a606f
to
d6bf6b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!!
@athitten before finally merging this in, it would be great to re-ensure once again that we:
- call
_val_iterator_done
in all models - do
_reconfigure_val_batches
only in the pre-training models
any missing/misconfigured models above can make it super difficult for anyone using the model to figure out what might be wrong
Overall, looks amazing!
Signed-off-by: Abhishree <[email protected]>
2f9c728
to
c510074
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
) * Add counter for num_microbatches Signed-off-by: Abhishree <[email protected]> * Reset self.total_val_micro_batches Signed-off-by: Abhishree <[email protected]> * Replace _prefetch() with _val_iterator_done() Signed-off-by: Abhishree <[email protected]> * Override limit_val_batches for pretraining models Signed-off-by: Abhishree <[email protected]> * Return iterator in _val_iterator_done when iterator is not exhuasted Signed-off-by: Abhishree <[email protected]> * Temporarily comment BioMegatron Bert CI test Signed-off-by: Abhishree <[email protected]> * Move _reconfigure_val_batches() to MegatronGPTModel Signed-off-by: Abhishree <[email protected]> * Move self_reconfigure_val_batches to build_train_valid_test_datasets Signed-off-by: Abhishree <[email protected]> * Avoid fetching and reinserting back to the iterator Signed-off-by: Abhishree <[email protected]> * Increase limit_val_batches in CI tests Signed-off-by: Abhishree <[email protected]> * Use _val_iterator_done to check for iterator end in all megatron models 1) Remove if condition, self._val_micro_batches_consumed in def _val_iterator_done and check with just try(and reinsert), except 2) Use _val_iterator_done in all megatron models that use dataloader_iter to maintain uniformity Signed-off-by: Abhishree <[email protected]> * Minor edit to return outside of try block Signed-off-by: Abhishree <[email protected]> * Add _val_iterator_done for megatron_nmt_model Signed-off-by: Abhishree <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhishree <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]>
What does this PR do ?
dataloader_iter
to exit gracefully fromvalidation_step
without having to prefetch allmicrobatches
in a step which was being done previously. This is a temporary piece of code that is needed until we have lightning's fix that takes care of catching the end ofdataloader_iter
limit_val_batches
andnum_sanity_val_steps
forpretraining
to be a multiple of num ofmicrobatches
.limit_val_batches
is reconfigured so that we run as many global batches or validation_steps as the user enteredlimit_val_batches
.Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information