Support tp pp conversion #6218

titu1994 · 2023-03-16T03:55:44Z

What does this PR do ?

Adds support for changing pipeline parallel version post construction for GPT

Collection: [Core, NLP]

Changelog

Add new script for pp conversion to avoid breaking old script (should probably be deprecated eventually when new script fully supports all functionality)
Add some modifications to allow partially loading model parallel models into memory to extract parameters
Add PP TP conversion support for both Megatron GPT and Megatron T5 (only when shared embeddings are used between encoder and decoder)
Add support for forcing CPU construction of T5 and GPT models
Add support for loading model based on AppState parameters instead of Trainer parameters.

Usage

Usage:

# Megatron GPT
python megatron_change_num_partitions.py \
    --model_file=PATH_TO_SRC_FILE \
    --target_file=PATH_TO_TGT_FILE \
    --tensor_model_parallel_size=1 \
    --target_tensor_model_parallel_size=1 \
    --pipeline_model_parallel_size=1 \
    --target_pipeline_model_parallel_size=1 \
    --precision=bf16

# Megatron T5
python megatron_change_num_partitions.py \
    --model_file=PATH_TO_SRC_FILE \
    --target_file=PATH_TO_TGT_FILE \
    --model_class="nemo.collections.nlp.models.language_modeling.megatron_t5_model.MegatronT5Model" \
    --tensor_model_parallel_size=1 \
    --target_tensor_model_parallel_size=1 \
    --pipeline_model_parallel_size=1 \
    --target_pipeline_model_parallel_size=1 \
    --target_pipeline_model_parallel_split_rank=0 \
    --precision=bf16

# NOTE: When converting large models, always ensure that you pre-extract the nemo model and then only perform conversion

$ mkdir "unpacked_nemo_file"
$ tar -xvf "<path to nemo file>" -C "<absolute path to pwd>/unpacked_nemo_file/"

python megatron_change_num_partitions.py \
    ...
    --model_extracted_dir="<Absolute path to pwd>/unpacked_nemo_file/"

# NOTE: Conversion of other model types. 
# Default model type is MegatronGPTModel, if you want another model you need to pass classpath of the model
# For example - MegatronT5Model - 

python megatron_change_num_partitions.py \
    ...
    --model_class="nemo.collections.nlp.models.language_modeling.megatron_t5_model.MegatronT5Model"

# Additional arguments:

--num_gpu_per_node: Number of GPUs per node. Default is 8.
--megatron_legacy: Whether the model is a legacy Megatron model or not. Default is False. May be unsuported for 
    Pipeline Parallelism change.
--tokenizer_model_path: Path to tokenizer model. Default is None. When not None, overrides the tokenizer model path
    in the model config.
--tokenizer_vocab_file: Path to tokenizer vocab file. Default is None. When not None, overrides the tokenizer vocab
    file in the model config.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

examples/nlp/language_modeling/megatron_change_num_partitions_pp.py

Signed-off-by: smajumdar <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: smajumdar <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: smajumdar <[email protected]>

yidong72

pretty cool PR. just have one minor comment.

examples/nlp/language_modeling/megatron_change_num_partitions_pp.py

nemo/collections/nlp/models/language_modeling/megatron_base_model.py

examples/nlp/language_modeling/megatron_change_num_partitions_pp.py

Signed-off-by: smajumdar <[email protected]>

…upport_tp_pp_conversion

Signed-off-by: smajumdar <[email protected]>

examples/nlp/language_modeling/megatron_change_num_partitions_pp.py

Signed-off-by: smajumdar <[email protected]>

examples/nlp/language_modeling/megatron_change_num_partitions_pp.py

nemo/collections/nlp/parts/nlp_overrides.py

Signed-off-by: smajumdar <[email protected]>

for more information, see https://pre-commit.ci

ericharper

LGTM. Thanks!

aklife97

LGTM! just a couple of minor comments, everything else looks great!

examples/nlp/language_modeling/megatron_change_num_partitions.py

aklife97 · 2023-03-24T06:29:41Z

also, can we add a PP change CI as well? would be helpful to keep testing that since the PR brings in global overrides that may cause issues if someone changes it

Signed-off-by: smajumdar <[email protected]>

titu1994 · 2023-03-24T08:06:33Z

Good point about jenkins test - updated old one from tp reduce and increase to jointly increase pp by even or odd number,
Though this tests only GPT. We need nightly tests that test the whole matrix of TP (inc x dec) x PP (inc x dec) x {GPT, T5} - but that is super expensive wrt time and storage space on this CI. Will need to look into how to set that up.

aklife97

LGTM, thanks!

Signed-off-by: smajumdar <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: smajumdar <[email protected]>

* Add required flags to partially laod model Signed-off-by: smajumdar <[email protected]> * Add cleaned up script for tp pp change Signed-off-by: smajumdar <[email protected]> * Add cleaned up script for tp pp change Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add support to change parameter dtypes during conversion Signed-off-by: smajumdar <[email protected]> * Add Debug Prints flag Signed-off-by: smajumdar <[email protected]> * Improve error logs Signed-off-by: smajumdar <[email protected]> * Fix issues with TP > 1 for Megatron T5 Signed-off-by: smajumdar <[email protected]> * Finalize splitting of T5 models Signed-off-by: smajumdar <[email protected]> * Update docstrings Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Finalize pp tp change for T5 models Signed-off-by: smajumdar <[email protected]> * Fix CodeQL issue Signed-off-by: smajumdar <[email protected]> * Fix dtype cast of num_gpu_per_node Signed-off-by: smajumdar <[email protected]> * Update config Signed-off-by: smajumdar <[email protected]> * Remove block for config checks Signed-off-by: smajumdar <[email protected]> * Reduce shared embedding check for older configs Signed-off-by: smajumdar <[email protected]> * Add support for extracted directory path Signed-off-by: smajumdar <[email protected]> * Force CPU init for TP 1 PP 1 temp model Signed-off-by: smajumdar <[email protected]> * Patch T5 models to init fully on CPU Signed-off-by: smajumdar <[email protected]> * Update docstring Signed-off-by: smajumdar <[email protected]> * Update docstring Signed-off-by: smajumdar <[email protected]> * Update prints to logging Signed-off-by: smajumdar <[email protected]> * Patch apex code Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Patch typo Signed-off-by: smajumdar <[email protected]> * Fix import test of ModelType Signed-off-by: smajumdar <[email protected]> * Add docstring comment for nlp override Signed-off-by: smajumdar <[email protected]> * Merge new file with old file Signed-off-by: smajumdar <[email protected]> * Update script call signature Signed-off-by: smajumdar <[email protected]> * Remove comments Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update jenkins test Signed-off-by: smajumdar <[email protected]> * Fix formatting Signed-off-by: smajumdar <[email protected]> * Add open_dict hooks Signed-off-by: smajumdar <[email protected]> * Fix unit test Signed-off-by: smajumdar <[email protected]> * Fix unit test Signed-off-by: smajumdar <[email protected]> * Retry in another directory Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert second test cause of shutil.rename error on CI Signed-off-by: smajumdar <[email protected]> --------- Signed-off-by: smajumdar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adi Renduchintala <[email protected]> Signed-off-by: hsiehjackson <[email protected]>

github-actions bot added the NLP label Mar 16, 2023

titu1994 marked this pull request as draft March 16, 2023 04:16

github-advanced-security bot found potential problems Mar 16, 2023

View reviewed changes

examples/nlp/language_modeling/megatron_change_num_partitions_pp.py Fixed Show fixed Hide fixed

examples/nlp/language_modeling/megatron_change_num_partitions_pp.py Fixed Show fixed Hide fixed

examples/nlp/language_modeling/megatron_change_num_partitions_pp.py Fixed Show fixed Hide fixed

okuchaiev requested review from arendu and yidong72 March 16, 2023 17:25

titu1994 and others added 8 commits March 16, 2023 17:48

Add required flags to partially laod model

ee50d70

Signed-off-by: smajumdar <[email protected]>

Add cleaned up script for tp pp change

1ae6ae9

Signed-off-by: smajumdar <[email protected]>

Add cleaned up script for tp pp change

070cfa5

Signed-off-by: smajumdar <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d8aa228

for more information, see https://pre-commit.ci

Add support to change parameter dtypes during conversion

e0bd0f1

Signed-off-by: smajumdar <[email protected]>

Add Debug Prints flag

07f0cca

Signed-off-by: smajumdar <[email protected]>

Improve error logs

acd408d

Signed-off-by: smajumdar <[email protected]>

Fix issues with TP > 1 for Megatron T5

c5f7a2e

Signed-off-by: smajumdar <[email protected]>

titu1994 force-pushed the support_tp_pp_conversion branch from eebbd0b to c5f7a2e Compare March 17, 2023 01:51

titu1994 and others added 5 commits March 16, 2023 22:18

Finalize splitting of T5 models

f82a00f

Signed-off-by: smajumdar <[email protected]>

Update docstrings

4b5b252

Signed-off-by: smajumdar <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

8e9ae21

for more information, see https://pre-commit.ci

Finalize pp tp change for T5 models

277274f

Signed-off-by: smajumdar <[email protected]>

Fix CodeQL issue

a5af3f3

Signed-off-by: smajumdar <[email protected]>

titu1994 marked this pull request as ready for review March 17, 2023 07:07

titu1994 and others added 2 commits March 17, 2023 00:18

Fix dtype cast of num_gpu_per_node

665740b

Signed-off-by: smajumdar <[email protected]>

Merge branch 'main' into support_tp_pp_conversion

770a18c

yidong72 reviewed Mar 20, 2023

View reviewed changes

titu1994 added 4 commits March 20, 2023 14:06

Update config

8764f27

Signed-off-by: smajumdar <[email protected]>

Remove block for config checks

260b2fc

Signed-off-by: smajumdar <[email protected]>

Merge remote-tracking branch 'origin/support_tp_pp_conversion' into s…

b7f8f36

…upport_tp_pp_conversion

Reduce shared embedding check for older configs

e94b3da

Signed-off-by: smajumdar <[email protected]>

github-advanced-security bot found potential problems Mar 20, 2023

View reviewed changes

examples/nlp/language_modeling/megatron_change_num_partitions_pp.py Fixed Show fixed Hide fixed

titu1994 added 2 commits March 22, 2023 15:11

Add support for extracted directory path

3efccaa

Signed-off-by: smajumdar <[email protected]>

Force CPU init for TP 1 PP 1 temp model

56f109f

Signed-off-by: smajumdar <[email protected]>

ericharper reviewed Mar 24, 2023

View reviewed changes

examples/nlp/language_modeling/megatron_change_num_partitions_pp.py Outdated Show resolved Hide resolved

ericharper reviewed Mar 24, 2023

View reviewed changes

nemo/collections/nlp/parts/nlp_overrides.py Outdated Show resolved Hide resolved

titu1994 added 3 commits March 23, 2023 19:53

Add docstring comment for nlp override

035483b

Signed-off-by: smajumdar <[email protected]>

Merge new file with old file

2a1981f

Signed-off-by: smajumdar <[email protected]>

Update script call signature

aba99e3

Signed-off-by: smajumdar <[email protected]>

titu1994 requested a review from ericharper March 24, 2023 05:00

titu1994 and others added 2 commits March 23, 2023 22:12

Remove comments

fbef6eb

Signed-off-by: smajumdar <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

76541b5

for more information, see https://pre-commit.ci

ericharper previously approved these changes Mar 24, 2023

View reviewed changes

aklife97 reviewed Mar 24, 2023

View reviewed changes

examples/nlp/language_modeling/megatron_change_num_partitions.py Show resolved Hide resolved

examples/nlp/language_modeling/megatron_change_num_partitions.py Show resolved Hide resolved

Merge branch 'main' into support_tp_pp_conversion

8c9ae9e

Update jenkins test

64ac539

Signed-off-by: smajumdar <[email protected]>

titu1994 dismissed ericharper’s stale review via 64ac539 March 24, 2023 08:04

github-actions bot added the CI label Mar 24, 2023

aklife97 previously approved these changes Mar 24, 2023

View reviewed changes

Fix formatting

a54353c

Signed-off-by: smajumdar <[email protected]>

titu1994 dismissed aklife97’s stale review via a54353c March 24, 2023 08:13

titu1994 and others added 6 commits March 24, 2023 09:08

Add open_dict hooks

383a98c

Signed-off-by: smajumdar <[email protected]>

Fix unit test

c268a70

Signed-off-by: smajumdar <[email protected]>

Fix unit test

c4edbb0

Signed-off-by: smajumdar <[email protected]>

Retry in another directory

9f6857c

Signed-off-by: smajumdar <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

0cb2e75

for more information, see https://pre-commit.ci

Revert second test cause of shutil.rename error on CI

ea9e4b1

Signed-off-by: smajumdar <[email protected]>

aklife97 approved these changes Mar 25, 2023

View reviewed changes

aklife97 merged commit aaa0cca into NVIDIA:main Mar 25, 2023

wdykas mentioned this pull request Mar 29, 2023

Move checkpoint consolidation to cpu #6324

Closed

6 tasks

titu1994 deleted the support_tp_pp_conversion branch March 31, 2023 22:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support tp pp conversion #6218

Support tp pp conversion #6218

titu1994 commented Mar 16, 2023 •

edited

Loading

yidong72 left a comment

ericharper left a comment

aklife97 left a comment

aklife97 commented Mar 24, 2023

titu1994 commented Mar 24, 2023 •

edited

Loading

aklife97 left a comment

Support tp pp conversion #6218

Support tp pp conversion #6218

Conversation

titu1994 commented Mar 16, 2023 • edited Loading

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

yidong72 left a comment

Choose a reason for hiding this comment

ericharper left a comment

Choose a reason for hiding this comment

aklife97 left a comment

Choose a reason for hiding this comment

aklife97 commented Mar 24, 2023

titu1994 commented Mar 24, 2023 • edited Loading

aklife97 left a comment

Choose a reason for hiding this comment

titu1994 commented Mar 16, 2023 •

edited

Loading

titu1994 commented Mar 24, 2023 •

edited

Loading