-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid unnecessarily accessing data loader with pipeline parallelism #6164
Avoid unnecessarily accessing data loader with pipeline parallelism #6164
Conversation
for k in batch.keys(): | ||
batch[k] = batch[k].cuda(non_blocking=True) if k in ['attention_mask'] else None | ||
# Intermediate pipeline stage doesn't need any inputs | ||
batch = {k: None for k in ['tokens', 'position_ids', 'attention_mask', 'labels']} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm uncertain how to handle this case. If we want to keep the same behavior, we need to access the data loader in the case self.get_attention_mask_from_fusion==False
so that we can access attention_mask
. But that will cause a hang. Do we just require self.get_attention_mask_from_fusion==True
to support interleaved pipeline parallelism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this. Can you also default get_attention_mask_from_fusion to True in the megatron gpt config and the model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please send it to r1.17.0
Signed-off-by: Tim Moon <[email protected]>
9a012b5
to
ab9aaca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
…6164) Signed-off-by: Tim Moon <[email protected]>
…6164) (#6267) Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]>
…VIDIA#6164) (NVIDIA#6267) Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]> Signed-off-by: hsiehjackson <[email protected]>
What does this PR do ?
I've been experiencing a hang in the first validation step when running GPT with interleaved pipeline parallelism. This first shows up in #6049, so it seems that accessing the data loader in middle pipeline stages is the root cause.
Collection: NLP
Changelog
Usage
No change in usage
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information