Make sure RETRO integrates well with the core #6207

yidong72 · 2023-03-15T20:07:43Z

What does this PR do ?

Make sure all the unit tests passes.
Make sure the training loss is the same as with apex.

Signed-off-by: Yi Dong <[email protected]>

for more information, see https://pre-commit.ci

ericharper

LGTM. Thanks!

* import parallel_state and tensor_parallel from megatron.core Signed-off-by: ericharper <[email protected]> * update column parallel async allreduce arg Signed-off-by: ericharper <[email protected]> * typos Signed-off-by: ericharper <[email protected]> * play stash + some changes Signed-off-by: Abhinav Khattar <[email protected]> * make grad scaler callable Signed-off-by: ericharper <[email protected]> * Fixed formatting Signed-off-by: SeanNaren <[email protected]> * Make sure RETRO integrates well with the core (#6207) * fix tests Signed-off-by: Yi Dong <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Yi Dong <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [NLP] Support T5 with Megatron Core (#6222) * Support T5 with Megatron Core Signed-off-by: SeanNaren <[email protected]> * Remove comment Signed-off-by: SeanNaren <[email protected]> * Update prediction step Signed-off-by: SeanNaren <[email protected]> * Further changes to fix fine-tuning Signed-off-by: SeanNaren <[email protected]> * Bug fixes from runs Signed-off-by: SeanNaren <[email protected]> * Revert changes to batch sampler, swap to pretrained sampler Signed-off-by: SeanNaren <[email protected]> * Address feedback Signed-off-by: SeanNaren <[email protected]> --------- Signed-off-by: SeanNaren <[email protected]> * GPT P-tuning core (max_len pad -> slow) Signed-off-by: Abhinav Khattar <[email protected]> * add GPT p-tuning w/ global batch based passing Signed-off-by: Abhinav Khattar <[email protected]> * add T5 p-tuning support Signed-off-by: Abhinav Khattar <[email protected]> * add megatron core install to Jenkinsfile Signed-off-by: ericharper <[email protected]> * fix command Signed-off-by: ericharper <[email protected]> * add guard efault for arg Signed-off-by: ericharper <[email protected]> * shift bert, retro, adapter + other namespace changes Signed-off-by: Abhinav Khattar <[email protected]> * build_model merge into one Signed-off-by: Abhinav Khattar <[email protected]> * Ensure fine-tuning/prompt learning work for T5 (#6385) Signed-off-by: SeanNaren <[email protected]> * rm extra split impl Signed-off-by: Abhinav Khattar <[email protected]> * fix for CI Signed-off-by: Abhinav Khattar <[email protected]> * temp change for tests Signed-off-by: Abhinav Khattar <[email protected]> * add bs=1 for log Signed-off-by: Abhinav Khattar <[email protected]> * fix Signed-off-by: Abhinav Khattar <[email protected]> * iter changes NMT Signed-off-by: Abhinav Khattar <[email protected]> * NMT partial fix Signed-off-by: Abhinav Khattar <[email protected]> * move on_train_batch_end to base_model Signed-off-by: Abhinav Khattar <[email protected]> * rm on_train_batch_end Signed-off-by: Abhinav Khattar <[email protected]> * temp remove NMT test Signed-off-by: Abhinav Khattar <[email protected]> * add training_step logic for T5 derived dynamic len models Signed-off-by: Abhinav Khattar <[email protected]> * add NMT test back Signed-off-by: Abhinav Khattar <[email protected]> * style fix Signed-off-by: Abhinav Khattar <[email protected]> * change no_async_tensor_model_parallel_allreduce Signed-off-by: Abhinav Khattar <[email protected]> * sequence_parallel_enabled -> sequence_parallel Signed-off-by: Abhinav Khattar <[email protected]> * fix T5 FT batch size Signed-off-by: Abhinav Khattar <[email protected]> * seq enabled Signed-off-by: Abhinav Khattar <[email protected]> * T5 sequence length fix Signed-off-by: Abhinav Khattar <[email protected]> * NMT mp fork to spawn Signed-off-by: Abhinav Khattar <[email protected]> * make function signatures consistent across models Signed-off-by: Abhinav Khattar <[email protected]> * make print log Signed-off-by: Abhinav Khattar <[email protected]> * rm unused import Signed-off-by: Abhinav Khattar <[email protected]> * update Dockerfile to install core Signed-off-by: Abhinav Khattar <[email protected]> * keep core path in workspace Signed-off-by: Abhinav Khattar <[email protected]> --------- Signed-off-by: ericharper <[email protected]> Signed-off-by: Abhinav Khattar <[email protected]> Signed-off-by: SeanNaren <[email protected]> Signed-off-by: Yi Dong <[email protected]> Co-authored-by: ericharper <[email protected]> Co-authored-by: SeanNaren <[email protected]> Co-authored-by: Yi Dong <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* import parallel_state and tensor_parallel from megatron.core Signed-off-by: ericharper <[email protected]> * update column parallel async allreduce arg Signed-off-by: ericharper <[email protected]> * typos Signed-off-by: ericharper <[email protected]> * play stash + some changes Signed-off-by: Abhinav Khattar <[email protected]> * make grad scaler callable Signed-off-by: ericharper <[email protected]> * Fixed formatting Signed-off-by: SeanNaren <[email protected]> * Make sure RETRO integrates well with the core (NVIDIA#6207) * fix tests Signed-off-by: Yi Dong <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Yi Dong <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [NLP] Support T5 with Megatron Core (NVIDIA#6222) * Support T5 with Megatron Core Signed-off-by: SeanNaren <[email protected]> * Remove comment Signed-off-by: SeanNaren <[email protected]> * Update prediction step Signed-off-by: SeanNaren <[email protected]> * Further changes to fix fine-tuning Signed-off-by: SeanNaren <[email protected]> * Bug fixes from runs Signed-off-by: SeanNaren <[email protected]> * Revert changes to batch sampler, swap to pretrained sampler Signed-off-by: SeanNaren <[email protected]> * Address feedback Signed-off-by: SeanNaren <[email protected]> --------- Signed-off-by: SeanNaren <[email protected]> * GPT P-tuning core (max_len pad -> slow) Signed-off-by: Abhinav Khattar <[email protected]> * add GPT p-tuning w/ global batch based passing Signed-off-by: Abhinav Khattar <[email protected]> * add T5 p-tuning support Signed-off-by: Abhinav Khattar <[email protected]> * add megatron core install to Jenkinsfile Signed-off-by: ericharper <[email protected]> * fix command Signed-off-by: ericharper <[email protected]> * add guard efault for arg Signed-off-by: ericharper <[email protected]> * shift bert, retro, adapter + other namespace changes Signed-off-by: Abhinav Khattar <[email protected]> * build_model merge into one Signed-off-by: Abhinav Khattar <[email protected]> * Ensure fine-tuning/prompt learning work for T5 (NVIDIA#6385) Signed-off-by: SeanNaren <[email protected]> * rm extra split impl Signed-off-by: Abhinav Khattar <[email protected]> * fix for CI Signed-off-by: Abhinav Khattar <[email protected]> * temp change for tests Signed-off-by: Abhinav Khattar <[email protected]> * add bs=1 for log Signed-off-by: Abhinav Khattar <[email protected]> * fix Signed-off-by: Abhinav Khattar <[email protected]> * iter changes NMT Signed-off-by: Abhinav Khattar <[email protected]> * NMT partial fix Signed-off-by: Abhinav Khattar <[email protected]> * move on_train_batch_end to base_model Signed-off-by: Abhinav Khattar <[email protected]> * rm on_train_batch_end Signed-off-by: Abhinav Khattar <[email protected]> * temp remove NMT test Signed-off-by: Abhinav Khattar <[email protected]> * add training_step logic for T5 derived dynamic len models Signed-off-by: Abhinav Khattar <[email protected]> * add NMT test back Signed-off-by: Abhinav Khattar <[email protected]> * style fix Signed-off-by: Abhinav Khattar <[email protected]> * change no_async_tensor_model_parallel_allreduce Signed-off-by: Abhinav Khattar <[email protected]> * sequence_parallel_enabled -> sequence_parallel Signed-off-by: Abhinav Khattar <[email protected]> * fix T5 FT batch size Signed-off-by: Abhinav Khattar <[email protected]> * seq enabled Signed-off-by: Abhinav Khattar <[email protected]> * T5 sequence length fix Signed-off-by: Abhinav Khattar <[email protected]> * NMT mp fork to spawn Signed-off-by: Abhinav Khattar <[email protected]> * make function signatures consistent across models Signed-off-by: Abhinav Khattar <[email protected]> * make print log Signed-off-by: Abhinav Khattar <[email protected]> * rm unused import Signed-off-by: Abhinav Khattar <[email protected]> * update Dockerfile to install core Signed-off-by: Abhinav Khattar <[email protected]> * keep core path in workspace Signed-off-by: Abhinav Khattar <[email protected]> --------- Signed-off-by: ericharper <[email protected]> Signed-off-by: Abhinav Khattar <[email protected]> Signed-off-by: SeanNaren <[email protected]> Signed-off-by: Yi Dong <[email protected]> Co-authored-by: ericharper <[email protected]> Co-authored-by: SeanNaren <[email protected]> Co-authored-by: Yi Dong <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]>

fix tests

891070a

Signed-off-by: Yi Dong <[email protected]>

yidong72 requested review from aklife97 and ericharper March 15, 2023 20:07

github-actions bot added the NLP label Mar 15, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

60cc2f5

for more information, see https://pre-commit.ci

github-actions bot added the core Changes to NeMo Core label Mar 15, 2023

ericharper approved these changes Mar 15, 2023

View reviewed changes

yidong72 merged commit dea8098 into GPT_integrate_core Mar 17, 2023

yidong72 deleted the RETRO_integrate_core branch March 17, 2023 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sure RETRO integrates well with the core #6207

Make sure RETRO integrates well with the core #6207

yidong72 commented Mar 15, 2023

ericharper left a comment

Make sure RETRO integrates well with the core #6207

Make sure RETRO integrates well with the core #6207

Conversation

yidong72 commented Mar 15, 2023

What does this PR do ?

ericharper left a comment

Choose a reason for hiding this comment