Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virtual pipeline parallel support for MegatronGPTSFTModel #7964

Merged
merged 13 commits into from
Jan 17, 2024

Conversation

vysarge
Copy link
Contributor

@vysarge vysarge commented Dec 2, 2023

What does this PR do ?

Enables running MegatronGPTSFTModel with virtual pipeline parallel when peft_scheme=null.

Collection: NLP (language_modeling)

Changelog

  • Alters word embedding initialization and passing of several parameters in MegatronGPTSFTModel to accommodate virtual pipeline parallelism when self.use_peft is false
  • Adds a guard in MegatronGPTSFTModel.__init__ to raise a ValueError if self.use_peft is true
  • Corrects the TP group initialization check in MegatronGPTSFTModel to account for both TE and MCore flags

Usage

Run a GPT model with SFT and set virtual_pipeline_model_parallel_size, such as:

python examples/nlp/language_modeling/tuning/megatron_gpt_peft_tuning.py \
++model.virtual_pipeline_model_parallel_size=10 \
...

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

@github-actions github-actions bot added the NLP label Dec 2, 2023
@vysarge vysarge marked this pull request as ready for review December 5, 2023 00:59
Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot removed the stale label Dec 21, 2023
@erhoo82
Copy link
Collaborator

erhoo82 commented Jan 2, 2024

jenkins

@ericharper
Copy link
Collaborator

jenkins

Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@ericharper ericharper merged commit 8811946 into NVIDIA:main Jan 17, 2024
11 checks passed
minitu pushed a commit to minitu/NeMo that referenced this pull request Jan 19, 2024
* Virtual pipeline parallel support for MegatronGPTSFTModel

Signed-off-by: Valerie Sarge <[email protected]>

* Deduplicate word embedding init code in MegatronGPTModel and MegatronGPTSFTModel into one method

Signed-off-by: Valerie Sarge <[email protected]>

* Correct TP group init call in MegatronGPTSFTModel to check for both TE and MCore, as in MegatronGPTModel

Signed-off-by: Valerie Sarge <[email protected]>

* Correct accidental double pipeline parallel size check

Signed-off-by: Valerie Sarge <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Correct get_gpt_module_list -> get_model_module_list from SFT model

Signed-off-by: Valerie Sarge <[email protected]>

---------

Signed-off-by: Valerie Sarge <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
akoumpa pushed a commit to akoumpa/NeMo that referenced this pull request Jan 20, 2024
* Virtual pipeline parallel support for MegatronGPTSFTModel

Signed-off-by: Valerie Sarge <[email protected]>

* Deduplicate word embedding init code in MegatronGPTModel and MegatronGPTSFTModel into one method

Signed-off-by: Valerie Sarge <[email protected]>

* Correct TP group init call in MegatronGPTSFTModel to check for both TE and MCore, as in MegatronGPTModel

Signed-off-by: Valerie Sarge <[email protected]>

* Correct accidental double pipeline parallel size check

Signed-off-by: Valerie Sarge <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Correct get_gpt_module_list -> get_model_module_list from SFT model

Signed-off-by: Valerie Sarge <[email protected]>

---------

Signed-off-by: Valerie Sarge <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
stevehuang52 pushed a commit that referenced this pull request Jan 31, 2024
* Virtual pipeline parallel support for MegatronGPTSFTModel

Signed-off-by: Valerie Sarge <[email protected]>

* Deduplicate word embedding init code in MegatronGPTModel and MegatronGPTSFTModel into one method

Signed-off-by: Valerie Sarge <[email protected]>

* Correct TP group init call in MegatronGPTSFTModel to check for both TE and MCore, as in MegatronGPTModel

Signed-off-by: Valerie Sarge <[email protected]>

* Correct accidental double pipeline parallel size check

Signed-off-by: Valerie Sarge <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Correct get_gpt_module_list -> get_model_module_list from SFT model

Signed-off-by: Valerie Sarge <[email protected]>

---------

Signed-off-by: Valerie Sarge <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: stevehuang52 <[email protected]>
ssh-meister pushed a commit to ssh-meister/NeMo that referenced this pull request Feb 15, 2024
* Virtual pipeline parallel support for MegatronGPTSFTModel

Signed-off-by: Valerie Sarge <[email protected]>

* Deduplicate word embedding init code in MegatronGPTModel and MegatronGPTSFTModel into one method

Signed-off-by: Valerie Sarge <[email protected]>

* Correct TP group init call in MegatronGPTSFTModel to check for both TE and MCore, as in MegatronGPTModel

Signed-off-by: Valerie Sarge <[email protected]>

* Correct accidental double pipeline parallel size check

Signed-off-by: Valerie Sarge <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Correct get_gpt_module_list -> get_model_module_list from SFT model

Signed-off-by: Valerie Sarge <[email protected]>

---------

Signed-off-by: Valerie Sarge <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Sasha Meister <[email protected]>
@vysarge vysarge deleted the vsarge/gpt_sft_vp branch March 12, 2024 22:36
pablo-garay pushed a commit that referenced this pull request Mar 19, 2024
* Virtual pipeline parallel support for MegatronGPTSFTModel

Signed-off-by: Valerie Sarge <[email protected]>

* Deduplicate word embedding init code in MegatronGPTModel and MegatronGPTSFTModel into one method

Signed-off-by: Valerie Sarge <[email protected]>

* Correct TP group init call in MegatronGPTSFTModel to check for both TE and MCore, as in MegatronGPTModel

Signed-off-by: Valerie Sarge <[email protected]>

* Correct accidental double pipeline parallel size check

Signed-off-by: Valerie Sarge <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Correct get_gpt_module_list -> get_model_module_list from SFT model

Signed-off-by: Valerie Sarge <[email protected]>

---------

Signed-off-by: Valerie Sarge <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Pablo Garay <[email protected]>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* Virtual pipeline parallel support for MegatronGPTSFTModel

Signed-off-by: Valerie Sarge <[email protected]>

* Deduplicate word embedding init code in MegatronGPTModel and MegatronGPTSFTModel into one method

Signed-off-by: Valerie Sarge <[email protected]>

* Correct TP group init call in MegatronGPTSFTModel to check for both TE and MCore, as in MegatronGPTModel

Signed-off-by: Valerie Sarge <[email protected]>

* Correct accidental double pipeline parallel size check

Signed-off-by: Valerie Sarge <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Correct get_gpt_module_list -> get_model_module_list from SFT model

Signed-off-by: Valerie Sarge <[email protected]>

---------

Signed-off-by: Valerie Sarge <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants