Replace prefetch with val iterator check in megatron models #7318

athitten · 2023-08-25T00:05:30Z

What does this PR do ?

Catch the end of dataloader_iter to exit gracefully from validation_step without having to prefetch all microbatches in a step which was being done previously. This is a temporary piece of code that is needed until we have lightning's fix that takes care of catching the end of dataloader_iter
Set limit_val_batches and num_sanity_val_steps for pretraining to be a multiple of num of microbatches. limit_val_batches is reconfigured so that we run as many global batches or validation_steps as the user entered limit_val_batches.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

nemo/collections/nlp/models/language_modeling/megatron_base_model.py

nemo/collections/nlp/models/language_modeling/megatron_gpt_prompt_learning_model.py

athitten · 2023-08-25T20:30:50Z

nemo/collections/nlp/models/language_modeling/megatron_base_model.py

+        self.trainer.limit_val_batches *= get_num_microbatches()
+        # Override num sanity steps equal to num of microbatches and perform one val_step
+        self.trainer.num_sanity_val_steps = get_num_microbatches()
+


This method is called only in pretraining models. Not setting limit_val_batches to be a multiple of microbatches for downstream models as we are passing just 1 batch to val_step each time and limit is not required. Val step is run as many times as the user entered limit_val_batches in case of downstream tasks. Also leaving the num_sanity_val_step for downstream at the default val of 2.

athitten · 2023-08-25T20:32:39Z

nemo/collections/nlp/models/language_modeling/megatron_gpt_prompt_learning_model.py

        mode = 'test' if self.trainer.testing else 'val'
-        batch = next(dataloader_iter)


This was not needed for prompt learning as we are just passing 1 batch.

athitten · 2023-08-25T20:36:49Z

nemo/collections/nlp/models/language_modeling/megatron_base_model.py

            try:
-                element = next(iterator)
-                elements.append(element)
+                _ = next(iterator)  # exhausting the iterator so that PTL knows to go to validation_epoch_end


Lightning requires to hit a StopIteration to break out of the evaluation loop, otherwise it keeps running indefinitely. It hits a StopIteration at the (n+1)th step so we check for StopIteration after limit_val_batches is exhausted.

ericharper

LGTM. Thanks!

nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py

nemo/collections/nlp/models/language_modeling/megatron_gpt_prompt_learning_model.py

nemo/collections/nlp/models/language_modeling/megatron_base_model.py

Other reviewer brought up some good points.

Jenkinsfile

Signed-off-by: Abhishree <[email protected]>

1) Remove if condition, self._val_micro_batches_consumed in def _val_iterator_done and check with just try(and reinsert), except 2) Use _val_iterator_done in all megatron models that use dataloader_iter to maintain uniformity Signed-off-by: Abhishree <[email protected]>

Signed-off-by: Abhishree <[email protected]>

aklife97

LGTM, thank you!!

@athitten before finally merging this in, it would be great to re-ensure once again that we:

call _val_iterator_done in all models
do _reconfigure_val_batches only in the pre-training models

any missing/misconfigured models above can make it super difficult for anyone using the model to figure out what might be wrong

Overall, looks amazing!

Signed-off-by: Abhishree <[email protected]>

for more information, see https://pre-commit.ci

ericharper

LGTM. Thanks!

) * Add counter for num_microbatches Signed-off-by: Abhishree <[email protected]> * Reset self.total_val_micro_batches Signed-off-by: Abhishree <[email protected]> * Replace _prefetch() with _val_iterator_done() Signed-off-by: Abhishree <[email protected]> * Override limit_val_batches for pretraining models Signed-off-by: Abhishree <[email protected]> * Return iterator in _val_iterator_done when iterator is not exhuasted Signed-off-by: Abhishree <[email protected]> * Temporarily comment BioMegatron Bert CI test Signed-off-by: Abhishree <[email protected]> * Move _reconfigure_val_batches() to MegatronGPTModel Signed-off-by: Abhishree <[email protected]> * Move self_reconfigure_val_batches to build_train_valid_test_datasets Signed-off-by: Abhishree <[email protected]> * Avoid fetching and reinserting back to the iterator Signed-off-by: Abhishree <[email protected]> * Increase limit_val_batches in CI tests Signed-off-by: Abhishree <[email protected]> * Use _val_iterator_done to check for iterator end in all megatron models 1) Remove if condition, self._val_micro_batches_consumed in def _val_iterator_done and check with just try(and reinsert), except 2) Use _val_iterator_done in all megatron models that use dataloader_iter to maintain uniformity Signed-off-by: Abhishree <[email protected]> * Minor edit to return outside of try block Signed-off-by: Abhishree <[email protected]> * Add _val_iterator_done for megatron_nmt_model Signed-off-by: Abhishree <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhishree <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]>

github-actions bot added the NLP label Aug 25, 2023

github-advanced-security bot found potential problems Aug 25, 2023

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_base_model.py Fixed Show fixed Hide fixed

nemo/collections/nlp/models/language_modeling/megatron_gpt_prompt_learning_model.py Fixed Show fixed Hide fixed

github-actions bot added the CI label Aug 25, 2023

athitten force-pushed the athitten/limit_val_batches branch from e0fbbad to a18a36e Compare August 25, 2023 20:12

github-actions bot removed the CI label Aug 25, 2023

athitten commented Aug 25, 2023

View reviewed changes

athitten requested review from ericharper and aklife97 August 25, 2023 20:37

ericharper previously approved these changes Aug 25, 2023

View reviewed changes

aklife97 reviewed Aug 25, 2023

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py Outdated Show resolved Hide resolved

aklife97 reviewed Aug 25, 2023

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_gpt_prompt_learning_model.py Outdated Show resolved Hide resolved

aklife97 reviewed Aug 25, 2023

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_base_model.py Outdated Show resolved Hide resolved

ericharper self-requested a review August 25, 2023 21:07

athitten force-pushed the athitten/limit_val_batches branch from bf346be to d79e924 Compare August 26, 2023 01:10

athitten marked this pull request as ready for review August 26, 2023 01:10

github-actions bot added the CI label Aug 26, 2023

athitten force-pushed the athitten/limit_val_batches branch 3 times, most recently from ab62ff4 to c6cde73 Compare August 27, 2023 05:37

aklife97 reviewed Aug 28, 2023

View reviewed changes

Jenkinsfile Outdated Show resolved Hide resolved

athitten force-pushed the athitten/limit_val_batches branch 2 times, most recently from a1530f8 to 7f04651 Compare August 28, 2023 20:50

athitten added 5 commits August 28, 2023 15:04

Add counter for num_microbatches

0b0d5aa

Signed-off-by: Abhishree <[email protected]>

Reset self.total_val_micro_batches

12307fa

Signed-off-by: Abhishree <[email protected]>

Replace _prefetch() with _val_iterator_done()

f5af3cb

Signed-off-by: Abhishree <[email protected]>

Override limit_val_batches for pretraining models

c31681f

Signed-off-by: Abhishree <[email protected]>

Return iterator in _val_iterator_done when iterator is not exhuasted

6fd2a01

Signed-off-by: Abhishree <[email protected]>

athitten added 6 commits August 28, 2023 15:04

Temporarily comment BioMegatron Bert CI test

daa5d73

Signed-off-by: Abhishree <[email protected]>

Move _reconfigure_val_batches() to MegatronGPTModel

3fd50ea

Signed-off-by: Abhishree <[email protected]>

Move self_reconfigure_val_batches to build_train_valid_test_datasets

f3fea70

Signed-off-by: Abhishree <[email protected]>

Avoid fetching and reinserting back to the iterator

2769078

Signed-off-by: Abhishree <[email protected]>

Increase limit_val_batches in CI tests

6765cb3

Signed-off-by: Abhishree <[email protected]>

athitten force-pushed the athitten/limit_val_batches branch from 7679f6d to 9b419eb Compare August 28, 2023 22:04

Minor edit to return outside of try block

d6bf6b5

Signed-off-by: Abhishree <[email protected]>

athitten force-pushed the athitten/limit_val_batches branch from f9a606f to d6bf6b5 Compare August 28, 2023 22:13

aklife97 previously approved these changes Aug 28, 2023

View reviewed changes

Add _val_iterator_done for megatron_nmt_model

c510074

Signed-off-by: Abhishree <[email protected]>

athitten dismissed aklife97’s stale review via c510074 August 28, 2023 23:40

athitten force-pushed the athitten/limit_val_batches branch from 2f9c728 to c510074 Compare August 28, 2023 23:40

pre-commit-ci bot and others added 3 commits August 28, 2023 23:42

[pre-commit.ci] auto fixes from pre-commit.com hooks

b949a59

for more information, see https://pre-commit.ci

Merge branch 'main' into athitten/limit_val_batches

6c8c814

Merge branch 'main' into athitten/limit_val_batches

613c30c

aklife97 approved these changes Aug 29, 2023

View reviewed changes

ericharper approved these changes Aug 29, 2023

View reviewed changes

ericharper merged commit 0e3b935 into main Aug 29, 2023
15 checks passed

ericharper deleted the athitten/limit_val_batches branch August 29, 2023 18:24

athitten mentioned this pull request Aug 29, 2023

Remove prefetch for megatron_gpt_sft_model.py #7274

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace prefetch with val iterator check in megatron models #7318

Replace prefetch with val iterator check in megatron models #7318

athitten commented Aug 25, 2023 •

edited

Loading

athitten Aug 25, 2023

athitten Aug 25, 2023

athitten Aug 25, 2023

ericharper left a comment

aklife97 left a comment

ericharper left a comment

		mode = 'test' if self.trainer.testing else 'val'
		batch = next(dataloader_iter)

Replace prefetch with val iterator check in megatron models #7318

Replace prefetch with val iterator check in megatron models #7318

Conversation

athitten commented Aug 25, 2023 • edited Loading

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

athitten Aug 25, 2023

Choose a reason for hiding this comment

athitten Aug 25, 2023

Choose a reason for hiding this comment

athitten Aug 25, 2023

Choose a reason for hiding this comment

ericharper left a comment

Choose a reason for hiding this comment

aklife97 left a comment

Choose a reason for hiding this comment

ericharper left a comment

Choose a reason for hiding this comment

athitten commented Aug 25, 2023 •

edited

Loading