Start using ModelParallelConfig from Megatron Core #6885

ericharper · 2023-06-19T23:13:10Z

What does this PR do ?

This PR is adding the ModelParallelConfig arguments to be used with the next release of Megatron Core.

Collection: NLP

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

github-advanced-security

CodeQL found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

michalivne

LGTM! Very useful to collect configs into model parallel configs. See minor comments.

michalivne · 2023-08-01T20:57:38Z

nemo/collections/nlp/models/language_modeling/megatron_base_model.py

+            # hidden size is needed for pipeline schedules but is not currently in ModelParallelConfig
+            setattr(model_parallel_config, 'hidden_size', self.cfg.hidden_size)
+        except AttributeError:
+            logging.warning(


Why not also fail here? If missing and will fail later wouldn't here be a good place to stop?

I found this was too brittle. Maybe we can add a strict argument?

What do you think about the suggestion?

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

michalivne · 2023-08-01T21:02:44Z

nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py

+        """ Hidden size needs to be set from the cfg.encoder for the pipeline schedule.
+        """
+
+        model_parallel_config = super().build_model_parallel_config()


Wouldn't parent class return warning if hidden_size is not in cfg.model.hidden_size? Perhaps this argument can be passed to parent method?

Maybe you could expand more on your suggestion? I was adding this because the parent class didn't have hidden_size for this model.

nemo/collections/nlp/modules/common/megatron/module.py

aklife97

LGTM, thank you!
The main concern I have is MPConfig vs TransformerConfig, we need to probably discuss more how we should structure the usages. Apart from that, this looks like it covers everything

Signed-off-by: ericharper <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: ericharper <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: ericharper <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: ericharper <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: ericharper <[email protected]>

Signed-off-by: eharper <[email protected]>

for more information, see https://pre-commit.ci

nemo/collections/nlp/models/language_modeling/megatron_bert_model.py

aklife97

LGTM, thank you!! just 1 potential issue with sequence length setting

Signed-off-by: eharper <[email protected]>

for more information, see https://pre-commit.ci

aklife97

LGTM! I think we should merge this in now
@michalivne: let us know what your feedback is on Eric's response, and we can send fixes in later PRs accordingly!

* start adding gpt from megatron core path Signed-off-by: ericharper <[email protected]> * set model parallel config Signed-off-by: ericharper <[email protected]> * use model parallel config object Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add TransformerConfig Signed-off-by: ericharper <[email protected]> * start updating to TransformerConfig Signed-off-by: ericharper <[email protected]> * add todo Signed-off-by: ericharper <[email protected]> * revert to model parallel config Signed-off-by: ericharper <[email protected]> * add hidden_size to model_parallel_config Signed-off-by: ericharper <[email protected]> * remove imports Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: ericharper <[email protected]> * small clean up Signed-off-by: ericharper <[email protected]> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <[email protected]> * update module args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add config obj to flash attention tests Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <[email protected]> * add config to self Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * add config to test Signed-off-by: ericharper <[email protected]> * get hidden_size from config Signed-off-by: ericharper <[email protected]> * add try except Signed-off-by: ericharper <[email protected]> * use default Signed-off-by: ericharper <[email protected]> * update config with hidden size Signed-off-by: ericharper <[email protected]> * remove arg Signed-off-by: ericharper <[email protected]> * comment out jenkins test Signed-off-by: ericharper <[email protected]> * revert import Signed-off-by: ericharper <[email protected]> * remove optimizer_idx Signed-off-by: eharper <[email protected]> * prefetch num microbatches Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: eharper <[email protected]> * temporarily comment jenkins test Signed-off-by: eharper <[email protected]> * update seq_length Signed-off-by: eharper <[email protected]> * remove commented code Signed-off-by: eharper <[email protected]> * update arg Signed-off-by: eharper <[email protected]> * update mbs and gbs of test Signed-off-by: eharper <[email protected]> * update batch size in test Signed-off-by: eharper <[email protected]> * fix precision in test Signed-off-by: eharper <[email protected]> * update precision Signed-off-by: eharper <[email protected]> * move hidden_size out of conditional Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: ericharper <[email protected]> Signed-off-by: eharper <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

odelalleau · 2023-08-18T12:26:59Z

tests/collections/nlp/test_flash_attention.py

@@ -16,6 +16,7 @@

 import pytest
 import torch
+from megatron.core import ModelParallelConfig


This is breaking pytest --cpu when doing a basic setup without all the fluff.

* start adding gpt from megatron core path Signed-off-by: ericharper <[email protected]> * set model parallel config Signed-off-by: ericharper <[email protected]> * use model parallel config object Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add TransformerConfig Signed-off-by: ericharper <[email protected]> * start updating to TransformerConfig Signed-off-by: ericharper <[email protected]> * add todo Signed-off-by: ericharper <[email protected]> * revert to model parallel config Signed-off-by: ericharper <[email protected]> * add hidden_size to model_parallel_config Signed-off-by: ericharper <[email protected]> * remove imports Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: ericharper <[email protected]> * small clean up Signed-off-by: ericharper <[email protected]> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <[email protected]> * update module args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add config obj to flash attention tests Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <[email protected]> * add config to self Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * add config to test Signed-off-by: ericharper <[email protected]> * get hidden_size from config Signed-off-by: ericharper <[email protected]> * add try except Signed-off-by: ericharper <[email protected]> * use default Signed-off-by: ericharper <[email protected]> * update config with hidden size Signed-off-by: ericharper <[email protected]> * remove arg Signed-off-by: ericharper <[email protected]> * comment out jenkins test Signed-off-by: ericharper <[email protected]> * revert import Signed-off-by: ericharper <[email protected]> * remove optimizer_idx Signed-off-by: eharper <[email protected]> * prefetch num microbatches Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: eharper <[email protected]> * temporarily comment jenkins test Signed-off-by: eharper <[email protected]> * update seq_length Signed-off-by: eharper <[email protected]> * remove commented code Signed-off-by: eharper <[email protected]> * update arg Signed-off-by: eharper <[email protected]> * update mbs and gbs of test Signed-off-by: eharper <[email protected]> * update batch size in test Signed-off-by: eharper <[email protected]> * fix precision in test Signed-off-by: eharper <[email protected]> * update precision Signed-off-by: eharper <[email protected]> * move hidden_size out of conditional Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: ericharper <[email protected]> Signed-off-by: eharper <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: dorotat <[email protected]>

* start adding gpt from megatron core path Signed-off-by: ericharper <[email protected]> * set model parallel config Signed-off-by: ericharper <[email protected]> * use model parallel config object Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add TransformerConfig Signed-off-by: ericharper <[email protected]> * start updating to TransformerConfig Signed-off-by: ericharper <[email protected]> * add todo Signed-off-by: ericharper <[email protected]> * revert to model parallel config Signed-off-by: ericharper <[email protected]> * add hidden_size to model_parallel_config Signed-off-by: ericharper <[email protected]> * remove imports Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: ericharper <[email protected]> * small clean up Signed-off-by: ericharper <[email protected]> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <[email protected]> * update module args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add config obj to flash attention tests Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <[email protected]> * add config to self Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * add config to test Signed-off-by: ericharper <[email protected]> * get hidden_size from config Signed-off-by: ericharper <[email protected]> * add try except Signed-off-by: ericharper <[email protected]> * use default Signed-off-by: ericharper <[email protected]> * update config with hidden size Signed-off-by: ericharper <[email protected]> * remove arg Signed-off-by: ericharper <[email protected]> * comment out jenkins test Signed-off-by: ericharper <[email protected]> * revert import Signed-off-by: ericharper <[email protected]> * remove optimizer_idx Signed-off-by: eharper <[email protected]> * prefetch num microbatches Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: eharper <[email protected]> * temporarily comment jenkins test Signed-off-by: eharper <[email protected]> * update seq_length Signed-off-by: eharper <[email protected]> * remove commented code Signed-off-by: eharper <[email protected]> * update arg Signed-off-by: eharper <[email protected]> * update mbs and gbs of test Signed-off-by: eharper <[email protected]> * update batch size in test Signed-off-by: eharper <[email protected]> * fix precision in test Signed-off-by: eharper <[email protected]> * update precision Signed-off-by: eharper <[email protected]> * move hidden_size out of conditional Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: ericharper <[email protected]> Signed-off-by: eharper <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

github-actions bot added the NLP label Jun 19, 2023

github-advanced-security bot found potential problems Jun 19, 2023

View reviewed changes

github-actions bot added CI core Changes to NeMo Core labels Jun 30, 2023

ericharper marked this pull request as ready for review July 25, 2023 16:39

michalivne reviewed Aug 1, 2023

View reviewed changes

aklife97 reviewed Aug 7, 2023

View reviewed changes

nemo/collections/nlp/modules/common/megatron/module.py Show resolved Hide resolved

aklife97 previously approved these changes Aug 7, 2023

View reviewed changes

ericharper and others added 22 commits August 8, 2023 10:46

start adding gpt from megatron core path

63c127d

Signed-off-by: ericharper <[email protected]>

set model parallel config

16d85c4

Signed-off-by: ericharper <[email protected]>

use model parallel config object

19e1420

Signed-off-by: ericharper <[email protected]>

update args

a309c0b

Signed-off-by: ericharper <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2ea9285

for more information, see https://pre-commit.ci

set vp size to none if it is 1

46ec121

Signed-off-by: ericharper <[email protected]>

set vp size to none if it is 1

575ef8a

Signed-off-by: ericharper <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a8b177c

for more information, see https://pre-commit.ci

add TransformerConfig

a296be8

Signed-off-by: ericharper <[email protected]>

start updating to TransformerConfig

ec3c170

Signed-off-by: ericharper <[email protected]>

add todo

e2090ae

Signed-off-by: ericharper <[email protected]>

revert to model parallel config

e1f38d8

Signed-off-by: ericharper <[email protected]>

add hidden_size to model_parallel_config

cbfb0d4

Signed-off-by: ericharper <[email protected]>

remove imports

cbf5036

Signed-off-by: ericharper <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

c6fe7ed

for more information, see https://pre-commit.ci

remove import

2bd408c

Signed-off-by: ericharper <[email protected]>

small clean up

06064bf

Signed-off-by: ericharper <[email protected]>

update hidden size in peft base model, add mcore commit to jenkins

d8e9f4f

Signed-off-by: ericharper <[email protected]>

update module args

afdf3f0

Signed-off-by: ericharper <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

90e8160

for more information, see https://pre-commit.ci

add config obj to flash attention tests

3f194ac

Signed-off-by: ericharper <[email protected]>

remove args

95a3b68

Signed-off-by: ericharper <[email protected]>

prefetch num microbatches

aa5b5fb

Signed-off-by: eharper <[email protected]>

ericharper dismissed aklife97’s stale review via aa5b5fb August 8, 2023 22:52

ericharper force-pushed the mcore_gpt_path branch from 3d974be to aa5b5fb Compare August 8, 2023 22:52

pre-commit-ci bot and others added 2 commits August 8, 2023 22:53

[pre-commit.ci] auto fixes from pre-commit.com hooks

2db6c2a

for more information, see https://pre-commit.ci

Merge branch 'main' into mcore_gpt_path

e5d48ae

aklife97 reviewed Aug 8, 2023

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_bert_model.py Outdated Show resolved Hide resolved

aklife97 previously approved these changes Aug 8, 2023

View reviewed changes

ericharper and others added 2 commits August 10, 2023 10:38

Merge branch 'main' into mcore_gpt_path

6199567

remove import

4ddc99f

Signed-off-by: eharper <[email protected]>

ericharper dismissed aklife97’s stale review via 4ddc99f August 10, 2023 19:36

ericharper and others added 13 commits August 10, 2023 15:13

temporarily comment jenkins test

4551eb6

Signed-off-by: eharper <[email protected]>

pull main

072dcad

Signed-off-by: eharper <[email protected]>

update seq_length

a5f24e6

Signed-off-by: eharper <[email protected]>

remove commented code

8f2e8fb

Signed-off-by: eharper <[email protected]>

update arg

f07af89

Signed-off-by: eharper <[email protected]>

resolve conflict

c36054c

Signed-off-by: eharper <[email protected]>

update mbs and gbs of test

7dcf6b7

Signed-off-by: eharper <[email protected]>

update batch size in test

7519e0f

Signed-off-by: eharper <[email protected]>

fix precision in test

7aa1188

Signed-off-by: eharper <[email protected]>

update precision

ceca1f3

Signed-off-by: eharper <[email protected]>

move hidden_size out of conditional

82a55f5

Signed-off-by: eharper <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

0a702a0

for more information, see https://pre-commit.ci

Merge branch 'main' into mcore_gpt_path

e4ba515

aklife97 approved these changes Aug 14, 2023

View reviewed changes

ericharper merged commit 4833347 into main Aug 14, 2023
13 of 15 checks passed

ericharper deleted the mcore_gpt_path branch August 14, 2023 04:55

odelalleau reviewed Aug 18, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start using ModelParallelConfig from Megatron Core #6885

Start using ModelParallelConfig from Megatron Core #6885

ericharper commented Jun 19, 2023

github-advanced-security bot left a comment

michalivne left a comment

michalivne Aug 1, 2023

ericharper Aug 8, 2023

ericharper Aug 11, 2023

michalivne Aug 1, 2023

ericharper Aug 11, 2023

aklife97 left a comment

aklife97 left a comment

aklife97 left a comment

odelalleau Aug 18, 2023

Start using ModelParallelConfig from Megatron Core #6885

Start using ModelParallelConfig from Megatron Core #6885

Conversation

ericharper commented Jun 19, 2023

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

github-advanced-security bot left a comment

Choose a reason for hiding this comment

michalivne left a comment

Choose a reason for hiding this comment

michalivne Aug 1, 2023

Choose a reason for hiding this comment

ericharper Aug 8, 2023

Choose a reason for hiding this comment

ericharper Aug 11, 2023

Choose a reason for hiding this comment

michalivne Aug 1, 2023

Choose a reason for hiding this comment

ericharper Aug 11, 2023

Choose a reason for hiding this comment

aklife97 left a comment

Choose a reason for hiding this comment

aklife97 left a comment

Choose a reason for hiding this comment

aklife97 left a comment

Choose a reason for hiding this comment

odelalleau Aug 18, 2023

Choose a reason for hiding this comment