Avoid unnecessarily accessing data loader with pipeline parallelism #6164

timmoon10 · 2023-03-09T22:31:03Z

What does this PR do ?

I've been experiencing a hang in the first validation step when running GPT with interleaved pipeline parallelism. This first shows up in #6049, so it seems that accessing the data loader in middle pipeline stages is the root cause.

Collection: NLP

Changelog

Avoid unnecessarily accessing data loader with pipeline parallelism

Usage

No change in usage

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to Add flag to get attention from fusion #6049

timmoon10 · 2023-03-09T22:34:07Z

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

-                        for k in batch.keys():
-                            batch[k] = batch[k].cuda(non_blocking=True) if k in ['attention_mask'] else None
+                    # Intermediate pipeline stage doesn't need any inputs
+                    batch = {k: None for k in ['tokens', 'position_ids', 'attention_mask', 'labels']}


I'm uncertain how to handle this case. If we want to keep the same behavior, we need to access the data loader in the case self.get_attention_mask_from_fusion==False so that we can access attention_mask. But that will cause a hang. Do we just require self.get_attention_mask_from_fusion==True to support interleaved pipeline parallelism?

I'm fine with this. Can you also default get_attention_mask_from_fusion to True in the megatron gpt config and the model?

okuchaiev

can you please send it to r1.17.0

Signed-off-by: Tim Moon <[email protected]>

ericharper

LGTM. Thanks!

…6164) Signed-off-by: Tim Moon <[email protected]>

…6164) (#6267) Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]>

…VIDIA#6164) (NVIDIA#6267) Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]> Signed-off-by: hsiehjackson <[email protected]>

timmoon10 added bug Something isn't working NLP labels Mar 9, 2023

timmoon10 requested review from ericharper and erhoo82 March 9, 2023 22:31

timmoon10 commented Mar 9, 2023

View reviewed changes

okuchaiev requested changes Mar 20, 2023

View reviewed changes

Avoid unnecessarily accessing data loader with pipeline parallelism

ab9aaca

Signed-off-by: Tim Moon <[email protected]>

timmoon10 changed the base branch from main to r1.17.0 March 21, 2023 23:01

timmoon10 force-pushed the interleaved-pipeline-bugfix branch 2 times, most recently from 9a012b5 to ab9aaca Compare March 21, 2023 23:07

ericharper approved these changes Mar 21, 2023

View reviewed changes

ericharper requested a review from okuchaiev March 22, 2023 00:42

ericharper merged commit 79b0e83 into NVIDIA:r1.17.0 Mar 22, 2023

github-actions bot pushed a commit that referenced this pull request Mar 22, 2023

Avoid unnecessarily accessing data loader with pipeline parallelism (#…

e25d869

…6164) Signed-off-by: Tim Moon <[email protected]>

ericharper pushed a commit that referenced this pull request Mar 22, 2023

Avoid unnecessarily accessing data loader with pipeline parallelism (#…

14effd7

…6164) (#6267) Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]>

timmoon10 mentioned this pull request Apr 19, 2023

fix to set model in the eval mode during validation with virtual pipelining #6449

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid unnecessarily accessing data loader with pipeline parallelism #6164

Avoid unnecessarily accessing data loader with pipeline parallelism #6164

timmoon10 commented Mar 9, 2023

timmoon10 Mar 9, 2023

ericharper Mar 21, 2023

okuchaiev left a comment

ericharper left a comment

Avoid unnecessarily accessing data loader with pipeline parallelism #6164

Avoid unnecessarily accessing data loader with pipeline parallelism #6164

Conversation

timmoon10 commented Mar 9, 2023

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

timmoon10 Mar 9, 2023

Choose a reason for hiding this comment

ericharper Mar 21, 2023

Choose a reason for hiding this comment

okuchaiev left a comment

Choose a reason for hiding this comment

ericharper left a comment

Choose a reason for hiding this comment