Extend sequence length padding for GPT SFT to account for context parallel #8869

vysarge · 2024-04-10T05:10:34Z

What does this PR do ?

Adds a check for MegatronGPTSFTModel to ensure sequence dimension will be padded to a multiple of 16 after being split over context parallel ranks, which TE requires during fp8 training; this fixes issues like AssertionError: FP8 execution requires 2D input matrices with height divisible by 8 and width divisible by 16, but got tensor with dims=[5504, 184] when using context parallel with datasets that have a variable sequence length.

Collection: nlp

Changelog

Alters pad_seq_length_to_mult input to GPT SFT datasets to account for context parallel

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

ericharper · 2024-04-16T00:18:47Z

jenkins

xrennvidia · 2024-04-16T02:56:39Z

Hi @vysarge , I do not totally understand what the first input dim and the second dim mean, could you please describe it? And how does it work if sequence_parallel=True and context_parallel_size > 1? Thanks.

nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py

vysarge · 2024-04-16T06:02:45Z

Hi @vysarge , I do not totally understand what the first input dim and the second dim mean, could you please describe it? And how does it work if sequence_parallel=True and context_parallel_size > 1? Thanks.

Hm, I thought sequence parallel and context parallel were mutually exclusive. If they're used together, do CP and TP counts split on the same dimension? Making the change you suggested should fix this case too if so, let me test it.

xrennvidia · 2024-04-16T21:00:12Z

Hi @vysarge , I do not totally understand what the first input dim and the second dim mean, could you please describe it? And how does it work if sequence_parallel=True and context_parallel_size > 1? Thanks.

Hm, I thought sequence parallel and context parallel were mutually exclusive. If they're used together, do CP and TP counts split on the same dimension? Making the change you suggested should fix this case too if so, let me test it.

Sequence parallel and context parallel can work together. Sequence parallel only split sequence of LayerNorm. In addition, context parallel split sequence of the whole transformer layer. Yeah, you can test my suggested fix, I think it should work for all cases. Thanks.

…quirements when using fp8 Signed-off-by: Valerie Sarge <[email protected]>

Signed-off-by: Valerie Sarge <[email protected]>

vysarge · 2024-04-18T18:54:35Z

Hi @vysarge , I do not totally understand what the first input dim and the second dim mean, could you please describe it? And how does it work if sequence_parallel=True and context_parallel_size > 1? Thanks.

Hm, I thought sequence parallel and context parallel were mutually exclusive. If they're used together, do CP and TP counts split on the same dimension? Making the change you suggested should fix this case too if so, let me test it.

Sequence parallel and context parallel can work together. Sequence parallel only split sequence of LayerNorm. In addition, context parallel split sequence of the whole transformer layer. Yeah, you can test my suggested fix, I think it should work for all cases. Thanks.

Tested and this fix is working well; PR has been updated, thanks.

xrennvidia · 2024-04-18T18:57:54Z

LGTM, thanks.

xrennvidia · 2024-04-18T20:12:03Z

jenkins

github-actions · 2024-05-05T01:46:07Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

…allel (NVIDIA#8869) * Pad outputs from G GPTSFTDataset / GPTSFTPackedDataset to match TE requirements when using fp8 Signed-off-by: Valerie Sarge <[email protected]> * Account for SP + CP case Signed-off-by: Valerie Sarge <[email protected]> --------- Signed-off-by: Valerie Sarge <[email protected]> Co-authored-by: Pablo Garay <[email protected]>

github-actions bot added the NLP label Apr 10, 2024

ericharper requested a review from xrennvidia April 16, 2024 00:18

xrennvidia reviewed Apr 16, 2024

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py Show resolved Hide resolved

vysarge added 2 commits April 18, 2024 11:53

Pad outputs from G GPTSFTDataset / GPTSFTPackedDataset to match TE re…

6fbded8

…quirements when using fp8 Signed-off-by: Valerie Sarge <[email protected]>

Account for SP + CP case

76c7f71

Signed-off-by: Valerie Sarge <[email protected]>

vysarge force-pushed the vsarge/cp_size_padding branch from eab5ecb to 76c7f71 Compare April 18, 2024 18:53

vysarge requested a review from xrennvidia April 18, 2024 18:53

xrennvidia approved these changes Apr 18, 2024

View reviewed changes

Merge branch 'main' into vsarge/cp_size_padding

f319702

github-actions bot added the stale label May 5, 2024

xrennvidia removed the stale label May 5, 2024

ericharper merged commit 48a2a6b into NVIDIA:main May 6, 2024
125 checks passed

vysarge deleted the vsarge/cp_size_padding branch May 6, 2024 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend sequence length padding for GPT SFT to account for context parallel #8869

Extend sequence length padding for GPT SFT to account for context parallel #8869

vysarge commented Apr 10, 2024

ericharper commented Apr 16, 2024

xrennvidia commented Apr 16, 2024

vysarge commented Apr 16, 2024

xrennvidia commented Apr 16, 2024

vysarge commented Apr 18, 2024

xrennvidia commented Apr 18, 2024

xrennvidia commented Apr 18, 2024

github-actions bot commented May 5, 2024

Extend sequence length padding for GPT SFT to account for context parallel #8869

Extend sequence length padding for GPT SFT to account for context parallel #8869

Conversation

vysarge commented Apr 10, 2024

What does this PR do ?

Changelog

Usage

Jenkins CI

Before your PR is "Ready for review"

Who can review?

Additional Information

ericharper commented Apr 16, 2024

xrennvidia commented Apr 16, 2024

vysarge commented Apr 16, 2024

xrennvidia commented Apr 16, 2024

vysarge commented Apr 18, 2024

xrennvidia commented Apr 18, 2024

xrennvidia commented Apr 18, 2024

github-actions bot commented May 5, 2024