[NLP] Support T5 with Megatron Core #6222

SeanNaren · 2023-03-16T12:28:56Z

What does this PR do ?

Adds megatron core support for our T5 model. This only works for pre-training, due to dynamic max lengths in the GLUE/XLNI datasets fine-tuning seems to be broken.

The fix going forward I think will be to pad to the maximum length to ensure the size of sequences are always the same. This is a requirement of the iterator object now used in the training_step.

I also need to confirm that the weights of the model are the same between main and this branch.

Fix support for GLUE/XLNI Fine-tuning
Ensure convergence is the same
Fix support for encode/decode functions

Collection: NLP

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

nemo/collections/nlp/data/language_modeling/megatron/megatron_batch_samplers.py

SeanNaren · 2023-03-16T12:31:51Z

Seems formatting on the GPT_integrate_core was off and included in this PR.

I've pushed formatting changes directly to the GPT_integrate_core branch @aklife97

nemo/collections/nlp/models/language_modeling/megatron_finetune_model.py

nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py

Signed-off-by: SeanNaren <[email protected]>

* import parallel_state and tensor_parallel from megatron.core Signed-off-by: ericharper <[email protected]> * update column parallel async allreduce arg Signed-off-by: ericharper <[email protected]> * typos Signed-off-by: ericharper <[email protected]> * play stash + some changes Signed-off-by: Abhinav Khattar <[email protected]> * make grad scaler callable Signed-off-by: ericharper <[email protected]> * Fixed formatting Signed-off-by: SeanNaren <[email protected]> * Make sure RETRO integrates well with the core (#6207) * fix tests Signed-off-by: Yi Dong <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Yi Dong <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [NLP] Support T5 with Megatron Core (#6222) * Support T5 with Megatron Core Signed-off-by: SeanNaren <[email protected]> * Remove comment Signed-off-by: SeanNaren <[email protected]> * Update prediction step Signed-off-by: SeanNaren <[email protected]> * Further changes to fix fine-tuning Signed-off-by: SeanNaren <[email protected]> * Bug fixes from runs Signed-off-by: SeanNaren <[email protected]> * Revert changes to batch sampler, swap to pretrained sampler Signed-off-by: SeanNaren <[email protected]> * Address feedback Signed-off-by: SeanNaren <[email protected]> --------- Signed-off-by: SeanNaren <[email protected]> * GPT P-tuning core (max_len pad -> slow) Signed-off-by: Abhinav Khattar <[email protected]> * add GPT p-tuning w/ global batch based passing Signed-off-by: Abhinav Khattar <[email protected]> * add T5 p-tuning support Signed-off-by: Abhinav Khattar <[email protected]> * add megatron core install to Jenkinsfile Signed-off-by: ericharper <[email protected]> * fix command Signed-off-by: ericharper <[email protected]> * add guard efault for arg Signed-off-by: ericharper <[email protected]> * shift bert, retro, adapter + other namespace changes Signed-off-by: Abhinav Khattar <[email protected]> * build_model merge into one Signed-off-by: Abhinav Khattar <[email protected]> * Ensure fine-tuning/prompt learning work for T5 (#6385) Signed-off-by: SeanNaren <[email protected]> * rm extra split impl Signed-off-by: Abhinav Khattar <[email protected]> * fix for CI Signed-off-by: Abhinav Khattar <[email protected]> * temp change for tests Signed-off-by: Abhinav Khattar <[email protected]> * add bs=1 for log Signed-off-by: Abhinav Khattar <[email protected]> * fix Signed-off-by: Abhinav Khattar <[email protected]> * iter changes NMT Signed-off-by: Abhinav Khattar <[email protected]> * NMT partial fix Signed-off-by: Abhinav Khattar <[email protected]> * move on_train_batch_end to base_model Signed-off-by: Abhinav Khattar <[email protected]> * rm on_train_batch_end Signed-off-by: Abhinav Khattar <[email protected]> * temp remove NMT test Signed-off-by: Abhinav Khattar <[email protected]> * add training_step logic for T5 derived dynamic len models Signed-off-by: Abhinav Khattar <[email protected]> * add NMT test back Signed-off-by: Abhinav Khattar <[email protected]> * style fix Signed-off-by: Abhinav Khattar <[email protected]> * change no_async_tensor_model_parallel_allreduce Signed-off-by: Abhinav Khattar <[email protected]> * sequence_parallel_enabled -> sequence_parallel Signed-off-by: Abhinav Khattar <[email protected]> * fix T5 FT batch size Signed-off-by: Abhinav Khattar <[email protected]> * seq enabled Signed-off-by: Abhinav Khattar <[email protected]> * T5 sequence length fix Signed-off-by: Abhinav Khattar <[email protected]> * NMT mp fork to spawn Signed-off-by: Abhinav Khattar <[email protected]> * make function signatures consistent across models Signed-off-by: Abhinav Khattar <[email protected]> * make print log Signed-off-by: Abhinav Khattar <[email protected]> * rm unused import Signed-off-by: Abhinav Khattar <[email protected]> * update Dockerfile to install core Signed-off-by: Abhinav Khattar <[email protected]> * keep core path in workspace Signed-off-by: Abhinav Khattar <[email protected]> --------- Signed-off-by: ericharper <[email protected]> Signed-off-by: Abhinav Khattar <[email protected]> Signed-off-by: SeanNaren <[email protected]> Signed-off-by: Yi Dong <[email protected]> Co-authored-by: ericharper <[email protected]> Co-authored-by: SeanNaren <[email protected]> Co-authored-by: Yi Dong <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* import parallel_state and tensor_parallel from megatron.core Signed-off-by: ericharper <[email protected]> * update column parallel async allreduce arg Signed-off-by: ericharper <[email protected]> * typos Signed-off-by: ericharper <[email protected]> * play stash + some changes Signed-off-by: Abhinav Khattar <[email protected]> * make grad scaler callable Signed-off-by: ericharper <[email protected]> * Fixed formatting Signed-off-by: SeanNaren <[email protected]> * Make sure RETRO integrates well with the core (NVIDIA#6207) * fix tests Signed-off-by: Yi Dong <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Yi Dong <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [NLP] Support T5 with Megatron Core (NVIDIA#6222) * Support T5 with Megatron Core Signed-off-by: SeanNaren <[email protected]> * Remove comment Signed-off-by: SeanNaren <[email protected]> * Update prediction step Signed-off-by: SeanNaren <[email protected]> * Further changes to fix fine-tuning Signed-off-by: SeanNaren <[email protected]> * Bug fixes from runs Signed-off-by: SeanNaren <[email protected]> * Revert changes to batch sampler, swap to pretrained sampler Signed-off-by: SeanNaren <[email protected]> * Address feedback Signed-off-by: SeanNaren <[email protected]> --------- Signed-off-by: SeanNaren <[email protected]> * GPT P-tuning core (max_len pad -> slow) Signed-off-by: Abhinav Khattar <[email protected]> * add GPT p-tuning w/ global batch based passing Signed-off-by: Abhinav Khattar <[email protected]> * add T5 p-tuning support Signed-off-by: Abhinav Khattar <[email protected]> * add megatron core install to Jenkinsfile Signed-off-by: ericharper <[email protected]> * fix command Signed-off-by: ericharper <[email protected]> * add guard efault for arg Signed-off-by: ericharper <[email protected]> * shift bert, retro, adapter + other namespace changes Signed-off-by: Abhinav Khattar <[email protected]> * build_model merge into one Signed-off-by: Abhinav Khattar <[email protected]> * Ensure fine-tuning/prompt learning work for T5 (NVIDIA#6385) Signed-off-by: SeanNaren <[email protected]> * rm extra split impl Signed-off-by: Abhinav Khattar <[email protected]> * fix for CI Signed-off-by: Abhinav Khattar <[email protected]> * temp change for tests Signed-off-by: Abhinav Khattar <[email protected]> * add bs=1 for log Signed-off-by: Abhinav Khattar <[email protected]> * fix Signed-off-by: Abhinav Khattar <[email protected]> * iter changes NMT Signed-off-by: Abhinav Khattar <[email protected]> * NMT partial fix Signed-off-by: Abhinav Khattar <[email protected]> * move on_train_batch_end to base_model Signed-off-by: Abhinav Khattar <[email protected]> * rm on_train_batch_end Signed-off-by: Abhinav Khattar <[email protected]> * temp remove NMT test Signed-off-by: Abhinav Khattar <[email protected]> * add training_step logic for T5 derived dynamic len models Signed-off-by: Abhinav Khattar <[email protected]> * add NMT test back Signed-off-by: Abhinav Khattar <[email protected]> * style fix Signed-off-by: Abhinav Khattar <[email protected]> * change no_async_tensor_model_parallel_allreduce Signed-off-by: Abhinav Khattar <[email protected]> * sequence_parallel_enabled -> sequence_parallel Signed-off-by: Abhinav Khattar <[email protected]> * fix T5 FT batch size Signed-off-by: Abhinav Khattar <[email protected]> * seq enabled Signed-off-by: Abhinav Khattar <[email protected]> * T5 sequence length fix Signed-off-by: Abhinav Khattar <[email protected]> * NMT mp fork to spawn Signed-off-by: Abhinav Khattar <[email protected]> * make function signatures consistent across models Signed-off-by: Abhinav Khattar <[email protected]> * make print log Signed-off-by: Abhinav Khattar <[email protected]> * rm unused import Signed-off-by: Abhinav Khattar <[email protected]> * update Dockerfile to install core Signed-off-by: Abhinav Khattar <[email protected]> * keep core path in workspace Signed-off-by: Abhinav Khattar <[email protected]> --------- Signed-off-by: ericharper <[email protected]> Signed-off-by: Abhinav Khattar <[email protected]> Signed-off-by: SeanNaren <[email protected]> Signed-off-by: Yi Dong <[email protected]> Co-authored-by: ericharper <[email protected]> Co-authored-by: SeanNaren <[email protected]> Co-authored-by: Yi Dong <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]>

github-actions bot added the NLP label Mar 16, 2023

SeanNaren commented Mar 16, 2023

View reviewed changes

nemo/collections/nlp/data/language_modeling/megatron/megatron_batch_samplers.py Outdated Show resolved Hide resolved

github-actions bot added the core Changes to NeMo Core label Mar 16, 2023

SeanNaren force-pushed the GPT_integrate_core_t5 branch from 89c96ed to 5ddf524 Compare March 16, 2023 12:36

github-actions bot removed the core Changes to NeMo Core label Mar 16, 2023

MaximumEntropy self-requested a review March 20, 2023 17:52

SeanNaren force-pushed the GPT_integrate_core_t5 branch from 6d1dc59 to f2451f8 Compare March 22, 2023 10:54

SeanNaren marked this pull request as ready for review March 29, 2023 20:12

aklife97 reviewed Mar 30, 2023

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_finetune_model.py Show resolved Hide resolved

aklife97 reviewed Mar 30, 2023

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py Outdated Show resolved Hide resolved

SeanNaren added 7 commits April 4, 2023 22:09

Support T5 with Megatron Core

ffbafa0

Signed-off-by: SeanNaren <[email protected]>

Remove comment

6fa4be3

Signed-off-by: SeanNaren <[email protected]>

Update prediction step

f7cd69c

Signed-off-by: SeanNaren <[email protected]>

Further changes to fix fine-tuning

5d0dc94

Signed-off-by: SeanNaren <[email protected]>

Bug fixes from runs

bb447e9

Signed-off-by: SeanNaren <[email protected]>

Revert changes to batch sampler, swap to pretrained sampler

9c6db11

Signed-off-by: SeanNaren <[email protected]>

Address feedback

ef2d18e

Signed-off-by: SeanNaren <[email protected]>

SeanNaren force-pushed the GPT_integrate_core_t5 branch from 6f74d0d to ef2d18e Compare April 4, 2023 21:09

aklife97 merged commit a7c502b into GPT_integrate_core Apr 4, 2023

aklife97 deleted the GPT_integrate_core_t5 branch April 4, 2023 21:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NLP] Support T5 with Megatron Core #6222

[NLP] Support T5 with Megatron Core #6222

SeanNaren commented Mar 16, 2023 •

edited

Loading

SeanNaren commented Mar 16, 2023

[NLP] Support T5 with Megatron Core #6222

[NLP] Support T5 with Megatron Core #6222

Conversation

SeanNaren commented Mar 16, 2023 • edited Loading

What does this PR do ?

Before your PR is "Ready for review"

Who can review?

Additional Information

SeanNaren commented Mar 16, 2023

SeanNaren commented Mar 16, 2023 •

edited

Loading