Refactor PP conversion + add support for TP only conversion #6419

titu1994 · 2023-04-13T00:47:50Z

What does this PR do ?

Updates the TP PP conversion script to support TP only conversion (back ported from NeMo 1.16)
Refactor PP conversion mechanism.

Collection: [NLP]

Changelog

Refactor components to functions, split apart TP and PP conversion components
Refactor PP conversion to model specific classes for PP conversion support
Backport TP only conversion support (irrespective of whether PP conversion is supported or not)

Usage

python megatron_change_num_partitions.py \
    --model_file="ckpt/megatron_gpt_model.nemo" \
    --target_file="ckpt/megatron_gpt_model_tp2_pp1.nemo" \
    --tensor_model_parallel_size=1 \
    --target_tensor_model_parallel_size=2 \
    --pipeline_model_parallel_size=1 \
    --target_pipeline_model_parallel_size=1 \
    --tp_conversion_only

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

Signed-off-by: smajumdar <[email protected]>

aklife97 · 2023-04-13T21:05:40Z

examples/nlp/language_modeling/megatron_change_num_partitions.py

+        return offset_diff
+
+
+class T5Handler:


with this handler way of doing things, how do we handle other models? like say BERT? do we add a handler for each model?

Yep, we'll need a handler for each model. Every minor change in the model's arch can shift around too many layers or require duplication which complicates Pipeline parallelism merging.

aklife97

LGTM! thanks

) Signed-off-by: smajumdar <[email protected]> Signed-off-by: hsiehjackson <[email protected]>

Refactor PP conversion + add support for tp only conversion

906abac

Signed-off-by: smajumdar <[email protected]>

github-actions bot added the NLP label Apr 13, 2023

aklife97 reviewed Apr 13, 2023

View reviewed changes

aklife97 approved these changes Apr 14, 2023

View reviewed changes

titu1994 merged commit 8055411 into NVIDIA:main Apr 14, 2023

titu1994 deleted the refactor_converter branch April 14, 2023 01:29

hsiehjackson pushed a commit to hsiehjackson/NeMo that referenced this pull request Jun 2, 2023

Refactor PP conversion + add support for tp only conversion (NVIDIA#6419

9455c31

) Signed-off-by: smajumdar <[email protected]> Signed-off-by: hsiehjackson <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor PP conversion + add support for TP only conversion #6419

Refactor PP conversion + add support for TP only conversion #6419

titu1994 commented Apr 13, 2023

aklife97 Apr 13, 2023

titu1994 Apr 14, 2023

aklife97 left a comment

Refactor PP conversion + add support for TP only conversion #6419

Refactor PP conversion + add support for TP only conversion #6419

Conversation

titu1994 commented Apr 13, 2023

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

aklife97 Apr 13, 2023

Choose a reason for hiding this comment

titu1994 Apr 14, 2023

Choose a reason for hiding this comment

aklife97 left a comment

Choose a reason for hiding this comment