Huvu/mcore retro (#8861) · alxzhang-amazon/NeMo@63b44ca

Commit

Huvu/mcore retro (NVIDIA#8861)

* update branch

Signed-off-by: eharper <[email protected]>

* Add dist ckpt support for regular optimizers (NVIDIA#7749)

* Add dist ckpt support for regular optimizers

Signed-off-by: Mikołaj Błaż <[email protected]>

* [tutorial] fixed missing RIR scripts file. (NVIDIA#8257)

Signed-off-by: Xuesong Yang <[email protected]>

* fix imports

Signed-off-by: dimapihtar <[email protected]>

* imports fix

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ci imports fix

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert asr notebook

Signed-off-by: dimapihtar <[email protected]>

* revert asr notebook

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Pin lhotse=1.19.2 in r1.23.0 (NVIDIA#8303)

Signed-off-by: Piotr Żelasko <[email protected]>

* Cache Aware Streaming tutorial notebook (NVIDIA#8296)

* add notebook

Signed-off-by: Elena Rastorgueva <[email protected]>

* rename old notebook to Buffered_Streaming

Signed-off-by: Elena Rastorgueva <[email protected]>

* call setup_streaming_params in set_default_att_context_size method

Signed-off-by: Elena Rastorgueva <[email protected]>

* update links in docs

Signed-off-by: Elena Rastorgueva <[email protected]>

* update links to tutorials in docs

Signed-off-by: Elena Rastorgueva <[email protected]>

* remove hard-coding

Signed-off-by: Elena Rastorgueva <[email protected]>

* rename var

Signed-off-by: Elena Rastorgueva <[email protected]>

---------

Signed-off-by: Elena Rastorgueva <[email protected]>

* fix path location and branch (NVIDIA#8304)

* fix path location and branch

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* change to a floating point number

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Somshubra Majumdar <[email protected]>

* add deallocate pipeline output optimization (NVIDIA#8279)

* add deallocate pipeline output optimization

Signed-off-by: Jimmy Zhang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jimmy Zhang <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix memory leak caused by context parallelism hanging references by omegaconf (NVIDIA#8299)

* save cp_size to self

Signed-off-by: Jimmy Zhang <[email protected]>

* use parallel_state instead of self

Signed-off-by: Jimmy Zhang <[email protected]>

---------

Signed-off-by: Jimmy Zhang <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* remove assertion (NVIDIA#8302)

Signed-off-by: dimapihtar <[email protected]>

* Update PEFT Doc (NVIDIA#8262)

* update peft doc

Signed-off-by: Chen Cui <[email protected]>

* remove old prompt learning doc and notebook

Signed-off-by: Chen Cui <[email protected]>

* fix table

Signed-off-by: Chen Cui <[email protected]>

* fix table

Signed-off-by: Chen Cui <[email protected]>

* fix table

Signed-off-by: Chen Cui <[email protected]>

* Merge branch 'r1.23.0' into chcui/update_peft_doc

Signed-off-by: Chen Cui <[email protected]>

* revert accidental changes

Signed-off-by: Chen Cui <[email protected]>

* revert accidental changes

Signed-off-by: Chen Cui <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>

* Attention encoder-decoder models for multiple speech-to-text tasks  (NVIDIA#8242) (NVIDIA#8324)

* Rebasing canary changes at current main

Signed-off-by: Piotr Żelasko <[email protected]>

* Move the changes from asr transformer to nlp transformer as originally intended

Signed-off-by: Piotr Żelasko <[email protected]>

* update eval to strip spaces before punctuations

Signed-off-by: stevehuang52 <[email protected]>

* update pc strip

Signed-off-by: stevehuang52 <[email protected]>

* [canary] Refactor: `PromptedAudioToTextLhotseDataset` and `EncDecMultiTaskModel` (NVIDIA#8247)

* Create a separate CanaryDataset and use it inside `transformer_bpe_models.py`. Ditches `token_sequence_format`.

Signed-off-by: Piotr Żelasko <[email protected]>

* [canary] Refactor: move changes in transformer_bpe_models.py to Canar… (NVIDIA#8252)

* [canary] Refactor: move changes in transformer_bpe_models.py to CanaryModel

Signed-off-by: Piotr Żelasko <[email protected]>

* Rename `CanaryModel` to `EncDecMultiTaskModel` and remove inheritance from `EncDecTransfModelBPE`; add a separate config for this model

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>

* Rename `CanaryDataset` to `PromptedAudioToTextLhotseDataset`; add `prompt_format_fn` argument; clean-up the `_canary_prompt_format` function a bit

Signed-off-by: Piotr Żelasko <[email protected]>

* Move tokenization into `prompt_format_fn`, fix usage, add docs

Signed-off-by: Piotr Żelasko <[email protected]>

* Backward-compatible utterance validation

Signed-off-by: Piotr Żelasko <[email protected]>

* Improve type annotations

Signed-off-by: Piotr Żelasko <[email protected]>

* config and prompt_fn registration changes from review

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>

* fix transcribe config

Signed-off-by: stevehuang52 <[email protected]>

* Refactor Canary to follow schema of remaining ASR models (NVIDIA#8260)

* Initial draft of multi task beam decoding strategy

Signed-off-by: smajumdar <[email protected]>

* Stabilize inference

Signed-off-by: smajumdar <[email protected]>

* Update AED Multi Task model to mostly conform to Archetype-Type format. Update config

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add change decoding strategy

Signed-off-by: smajumdar <[email protected]>

* Remove redundant imports

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Cleanup

Signed-off-by: smajumdar <[email protected]>

* Cleanup

Signed-off-by: smajumdar <[email protected]>

* remove asr transformer dependency on nlp

Signed-off-by: stevehuang52 <[email protected]>

* clean up

Signed-off-by: stevehuang52 <[email protected]>

* copy token_classifier from nlp to asr

Signed-off-by: stevehuang52 <[email protected]>

* Address comments

Signed-off-by: smajumdar <[email protected]>

* Add typing to beam decoding

Signed-off-by: smajumdar <[email protected]>

* Make prompt format configurable

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* drop asr dependency on nlp

Signed-off-by: stevehuang52 <[email protected]>

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: stevehuang52 <[email protected]>

* fix transcribe, update asr evaluator

Signed-off-by: stevehuang52 <[email protected]>

* Extend the docs for the canary prompt_fn

Signed-off-by: Piotr Żelasko <[email protected]>

* Incorporate changes from Nithin's code review

Signed-off-by: Piotr Żelasko <[email protected]>

* training bug fix and adding launch script for speech_multitask (NVIDIA#8270)

* bug fix and adding launch script for speech_multitask

Signed-off-by: Krishna Puvvada <[email protected]>

* update launch script example in speech_to_text_aed.py

Signed-off-by: Krishna Puvvada <[email protected]>

---------

Signed-off-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>

* Fix: drop_last must be true in validation/test otherwise the training will hang

Signed-off-by: Piotr Żelasko <[email protected]>

* revert to current transcribe API

Signed-off-by: stevehuang52 <[email protected]>

* revert changes to NLP, update docs

Signed-off-by: stevehuang52 <[email protected]>

* update eval utils

Signed-off-by: stevehuang52 <[email protected]>

* update docs

Signed-off-by: stevehuang52 <[email protected]>

* Remove DALI; rename compute_audio_loss to compute_loss

Signed-off-by: Piotr Żelasko <[email protected]>

* set default use_model_transcribe=False

Signed-off-by: stevehuang52 <[email protected]>

* change os.path.dirname to pathlib

Signed-off-by: stevehuang52 <[email protected]>

* [canary] Test for CanaryTokenizer + refactoring (NVIDIA#8285)

* Test for CanaryTokenizer

Signed-off-by: Piotr Żelasko <[email protected]>

* Attempt at refactor...

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>

* Update config for AED models (NVIDIA#8294)

Signed-off-by: smajumdar <[email protected]>

* set default calculate_wer=False in transcribe_speech.py

Signed-off-by: stevehuang52 <[email protected]>

* Attention encoder-decoder models for multiple speech-to-text tasks

Signed-off-by: Piotr Żelasko <[email protected]>

* Apply suggestions from code review, part 1

Co-authored-by: Nithin Rao <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>

* Apply suggestions from code review, part 2

Signed-off-by: Piotr Żelasko <[email protected]>

* Document compute_loss

Signed-off-by: Piotr Żelasko <[email protected]>

* update transcribe_speech.py

Signed-off-by: stevehuang52 <[email protected]>

* add docstring

Signed-off-by: stevehuang52 <[email protected]>

* Attention encoder-decoder models for multiple speech-to-text tasks

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Krishna Puvvada <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Co-authored-by: stevehuang52 <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: He Huang (Steve) <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
(cherry picked from commit d10726d)

Co-authored-by: Piotr Żelasko <[email protected]>

* add code for calling mcore_retro in NeMo

* add code for calling mcore_retro in NeMo

* runnable, training curve match retro mcore and nemo

* working on retro inference

* working on megatron_retro_eval.py and megatron_retro_inference.yaml

* refactoring text_generation_utils code and retro inference relevant files

* clean PR

* resolving quick hacks (reading number of train/valid samples from workdir, discrepancy in total samples and samples with neighbors retrieved, tokenizers)

* clean repository

* revert changes to inference/eval code to original in main

* clean code

* runable training code, with already implemented eval code

* [tutorial] fixed missing RIR scripts file. (NVIDIA#8257)

Signed-off-by: Xuesong Yang <[email protected]>

* add values to en tts dict (NVIDIA#7879)

Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>

* Add Bert HF checkpoint converter (NVIDIA#8088)

* Add Bert HF checkpoint converter

Signed-off-by: yaoyu-33 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Reformat

Signed-off-by: yaoyu-33 <[email protected]>

* Add BERT ONNX export

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add NeMo BERT to HF BERT script

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Clean code

Signed-off-by: yaoyu-33 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update argument names

Signed-off-by: yaoyu-33 <[email protected]>

* Update build_transformer_config in Bert

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Bobby Chen <[email protected]>

* revert to original eval code files

* revert to original eval code files 2

* revert to original eval code files 3

* revert to original eval code files 4

* clean code

* clean code

* update my code to support changes from lastest main

* commit before rebase r1.23.0

* Multimodal r1.23.0 bug fix  (NVIDIA#8315)

* Rename quick-gelu

Signed-off-by: yaoyu-33 <[email protected]>

* ddpm config guard

Signed-off-by: yaoyu-33 <[email protected]>

* Fix ddpm edit api

Signed-off-by: yaoyu-33 <[email protected]>

* Fix insert_image_token cfg issue

Signed-off-by: yaoyu-33 <[email protected]>

* neva updates

Signed-off-by: yaoyu-33 <[email protected]>

* reformat

Signed-off-by: yaoyu-33 <[email protected]>

* Add back jenkins

Signed-off-by: yaoyu-33 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix jenkins

Signed-off-by: yaoyu-33 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bugs

Signed-off-by: yaoyu-33 <[email protected]>

* Update default neva template

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* copy paste files from r1.23.0

* clean PR

* Fixes for MoE parameter passing & use of AutoTokenizer/Model for mistral. (NVIDIA#8272)

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Keep max_seqlen and cu_seqlens_argmin for later micro-batches when PP>1 (NVIDIA#8334)

Signed-off-by: Sangkug Lym <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Remove asr webapp (NVIDIA#8347)

Signed-off-by: smajumdar <[email protected]>

* remove _target_ at model level in aed config (NVIDIA#8351)

Signed-off-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>

* revert changes for tts and asr

* Add change_vocabulary and save_tokenizers() support to Multitask ASR models (NVIDIA#8357)

* Add change_vocabulary and save_tokenizers() support

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update nemo/collections/asr/models/aed_multitask_models.py

Co-authored-by: Piotr Żelasko <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <[email protected]>

* Change default (NVIDIA#8371)

Signed-off-by: smajumdar <[email protected]>

* implement retro's own fwd_bwd_step() and validation_step() to not have argument first_val_step, which the MLM commit doesn't support

* adding megatron compile_helpers(), in future can be fixed with correct MLM commit

* bug fix in fast-conformer-aed.yaml and adding jenkins test for speech_to_text_aed model (NVIDIA#8368)

Signed-off-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* Enable megatron core loggers for GPT pretraining (NVIDIA#8354)

* Logging changes tested for gpt_pretraining

Signed-off-by: Aishwarya Bhandare <[email protected]>

* Additional args

Signed-off-by: Aishwarya Bhandare <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Aishwarya Bhandare <[email protected]>
Co-authored-by: Aishwarya Bhandare <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* mcore ds fix (NVIDIA#8283)

* [tutorial] fixed missing RIR scripts file. (NVIDIA#8257)

Signed-off-by: Xuesong Yang <[email protected]>

* add values to en tts dict (NVIDIA#7879)

Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>

* mcore ds fix

Signed-off-by: Dmytro Pykhtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update mcore

Signed-off-by: dimapihtar <[email protected]>

* revert asr files

Signed-off-by: dimapihtar <[email protected]>

* add comments

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add support for mcore mock dataset

Signed-off-by: dimapihtar <[email protected]>

* update mcore version

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update gpt cfg

Signed-off-by: dimapihtar <[email protected]>

* update mcore commit

Signed-off-by: dimapihtar <[email protected]>

* fix Bert unit tests

Signed-off-by: dimapihtar <[email protected]>

* update bert tests

Signed-off-by: dimapihtar <[email protected]>

* fix bert mcore test

Signed-off-by: dimapihtar <[email protected]>

* fix gpt jenkins tests

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update apex & TE commits

Signed-off-by: dimapihtar <[email protected]>

* revert apex installation

Signed-off-by: dimapihtar <[email protected]>

* turn off the fusion for jenkins

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Pablo Garay <[email protected]>

* addressing Eric's reviews

* adding existing implementation RETRO files

* adding existing implementation RETRO files

* Add Finetuning tutorial with HF Datasets (NVIDIA#8356)

* Add Finetuning tutorial with HF Datasets

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* update on Som comments

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao Koluguri <nithinraok>

* release updates (NVIDIA#8378)

* [tutorial] fixed missing RIR scripts file. (NVIDIA#8257)

Signed-off-by: Xuesong Yang <[email protected]>

* add values to en tts dict (NVIDIA#7879)

Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>

* mcore ds fix

Signed-off-by: Dmytro Pykhtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update mcore

Signed-off-by: dimapihtar <[email protected]>

* revert asr files

Signed-off-by: dimapihtar <[email protected]>

* add comments

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add support for mcore mock dataset

Signed-off-by: dimapihtar <[email protected]>

* update mcore version

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update gpt cfg

Signed-off-by: dimapihtar <[email protected]>

* update mcore commit

Signed-off-by: dimapihtar <[email protected]>

* fix Bert unit tests

Signed-off-by: dimapihtar <[email protected]>

* update bert tests

Signed-off-by: dimapihtar <[email protected]>

* fix bert mcore test

Signed-off-by: dimapihtar <[email protected]>

* fix gpt jenkins tests

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add support for dict data input type

Signed-off-by: dimapihtar <[email protected]>

* add mock ds test

Signed-off-by: dimapihtar <[email protected]>

* add test for dict data input type

Signed-off-by: dimapihtar <[email protected]>

* mcore ds fix

Signed-off-by: dimapihtar <[email protected]>

* data input fix

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Pablo Garay <[email protected]>

* MCore dataset compatibility for tokenizers (NVIDIA#8390)

* Add unique_identifiers for all tokenizers and eod for SentencePieceTokenizer

Signed-off-by: Valerie Sarge <[email protected]>

* Add generalized token aliases to TokenizerSpec to conform with MegatronTokenizer's interface. Remove now-redundant individual fixes from AutoTokenizer and SentencePieceTokenizer.

Signed-off-by: Valerie Sarge <[email protected]>

---------

Signed-off-by: Valerie Sarge <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>

* Mcore customization doc (NVIDIA#8298)

* [tutorial] fixed missing RIR scripts file. (NVIDIA#8257)

Signed-off-by: Xuesong Yang <[email protected]>

* add values to en tts dict (NVIDIA#7879)

Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>

* Add Bert HF checkpoint converter (NVIDIA#8088)

* Add Bert HF checkpoint converter

Signed-off-by: yaoyu-33 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Reformat

Signed-off-by: yaoyu-33 <[email protected]>

* Add BERT ONNX export

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add NeMo BERT to HF BERT script

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Clean code

Signed-off-by: yaoyu-33 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update argument names

Signed-off-by: yaoyu-33 <[email protected]>

* Update build_transformer_config in Bert

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Bobby Chen <[email protected]>

* initial placeholder

Signed-off-by: Huiying Li <[email protected]>

* add to intro/index.rst

Signed-off-by: Huiying Li <[email protected]>

* initial content update

Signed-off-by: Huiying Li <[email protected]>

* add diff images

Signed-off-by: Huiying Li <[email protected]>

size

Signed-off-by: Huiying Li <[email protected]>

* minor fixes

* minor language change

Signed-off-by: Chen Cui <[email protected]>

* clean changes

---------

Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Huiying Li <[email protected]>
Signed-off-by: Huiying Li <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Bobby Chen <[email protected]>
Co-authored-by: Huiying Li <[email protected]>
Co-authored-by: Chen Cui <[email protected]>

* wer fix (NVIDIA#8404)

Signed-off-by: Travis Bartley <[email protected]>

* updated link to pubmed (NVIDIA#8402)

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao Koluguri <nithinraok>

* Update NFA video download link (NVIDIA#8406)

* update nfa nasa video link

Signed-off-by: Elena Rastorgueva <[email protected]>

* update link in markdown

Signed-off-by: Elena Rastorgueva <[email protected]>

---------

Signed-off-by: Elena Rastorgueva <[email protected]>

* revert changes (NVIDIA#8410)

Signed-off-by: Chen Cui <[email protected]>

* Fix dreambooth data sampler issue (NVIDIA#8400)

* Turn on drop last

Signed-off-by: yaoyu-33 <[email protected]>

* Some neva fixes

Signed-off-by: yaoyu-33 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fixed errors in the CTM gen functions (NVIDIA#8416)

Signed-off-by: Taejin Park <[email protected]>

* add ensemble decoding fix (NVIDIA#8427)

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao Koluguri <nithinraok>

* SDE bugfix log (NVIDIA#8430)

Signed-off-by: George <[email protected]>

* mcore customization doc minor fix (NVIDIA#8421)

Signed-off-by: Huiying Li <[email protected]>

* NeMo-Mistral to HF converter bugfix. (NVIDIA#8353)

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Fixing mcore bert for TP, PP and SP (NVIDIA#8336)

* Fixing mcore bert for TP, PP and SP

* Fixing mcore bert for TP, PP and SP

* Fixing mcore version

* Fixing mcore version

* Update Jenkinsfile

Signed-off-by: Shanmugam Ramasamy <[email protected]>

* Update Jenkinsfile

Signed-off-by: Shanmugam Ramasamy <[email protected]>

* Update Jenkinsfile

Signed-off-by: Shanmugam Ramasamy <[email protected]>

---------

Signed-off-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Add settings to suppress bf16 compile errors in CI on V100 (NVIDIA#8481)

* Add settings to suppress bf16 compile errors in CI on V100

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* MoE parameter passing (NVIDIA#8255)

* MoE parameter passing

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Pass EP/MoE params in consumer scripts.

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* PR fixes

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Use latest commit of mcore-0.5

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* CI fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update k2 version (NVIDIA#8478) (NVIDIA#8492)

Signed-off-by: Vladimir Bataev <[email protected]>

* Add fp8 support for SD/Update notebook paths (NVIDIA#8489)

* Add fp8 support for SD/Update notebook paths

Signed-off-by: Mingyuan Ma <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Mingyuan Ma <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* pin to 0.5.0 (NVIDIA#8465)

Signed-off-by: eharper <[email protected]>

* Update NeMo Multimodal Requirements (NVIDIA#8515)

* Update requirements_multimodal.txt

Signed-off-by: yaoyu-33 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* update github raw content link (NVIDIA#8517)

Signed-off-by: Chen Cui <[email protected]>

* Add dep notice for notebooks (NVIDIA#8522)

* add dep notice

Signed-off-by: eharper <[email protected]>

* revert

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>

* Revert FP8 integration (NVIDIA#8520)

* Revert FP8 integration

Signed-off-by: Mingyuan Ma <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Mingyuan Ma <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update data prep notebook (NVIDIA#8532)

Signed-off-by: Mingyuan Ma <[email protected]>

* before update branch with latest r1.23.0

* update to run with MLM ae2817b3dde4efb1515061a5311d01d8f85bd99c (runnable training and saving checkpoint)

* remove compile_helpers

* reverse changes from main branch to r1.23.0

* adding *_legacy files

* update MLM commit in Jenkinsfile to latest

* debugging Jenkinstest: test different mcore import in retro_dataset

* update Jenkinsfile edit megatron_retro_mutransfer_pretrain_legacy.py

* removing all mcore RETRO to pass the Jenkinstest

* fixing import legacy problem for tests/collections/nlp/test_indexed_retrieval_dataset.py

* update Jenkinsfile file to use TE v0.7

* update NeMo to work with latest mcore RETRO (solving TE problems)

* update TE commit Jenkinsfile to be the same with r1.23.0's Jenkinsfile

* update commit for MLM

* jenkinstest debugging

* temporary fix RETRO's __init__ for jenkinstest

* edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster

* edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster

* edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster

* edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster

* add model.data.dataloader_type=cyclic to jenkinsfile

* update code to work with latest megatron-lm main 81dab6067

* update M-LM commit in Jenkinsfile to latest main M-LM 81dab6067

* fix to by pass CI test bf16 problem (following this PR https://github.com/NVIDIA/NeMo/pull/8481/files)

* isort and black

* adjusting model.micro_batch_size to 1

* fix BRANCH = 'r1.23.0'

* replace tutorials dir from main branch to huvu/mcore_retro

* fix minor merges conflict

* update Jenkinsfile

* runnable with a temporary fix from Jacek (unfound -unfinished problem)

* runnable with a temporary fix from Jacek (unfound -unfinished problem)

* modified nlp_overrides.py back to original

* fix checkpoint from Jacek Bieniusiewicz

* config Jenkinsfile test

* set RETRO Jenkins MBS to 1

* black fix

* isort fix

* update TE commit

* update to latest Jenkinsfile with latest container and commits

* remove new RETRO jenkinstest

* merge latest main

* put RETRO Jenkinstest to the right place

* update code for megatron_retro_pretraining_legacy.py

* untrack ipa_cmudict-0.7b_nv23.01.txt

* untrack ipa_cmudict-0.7b_nv23.01.txt

* set config in megatron_retro_pretraining_legacy.py to megatron_retro_config_legacy

* update new RETRO jenkinstest to run faster

* merging latest main, and edit Jenkinstest

* update Jenkinstest for new RETRO to run faster

* fix isort

* fix whitespace

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Elena Rastorgueva <[email protected]>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Jimmy Zhang <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Sangkug Lym <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Krishna Puvvada <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Signed-off-by: Aishwarya Bhandare <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: Valerie Sarge <[email protected]>
Signed-off-by: Huiying Li <[email protected]>
Signed-off-by: Huiying Li <[email protected]>
Signed-off-by: Travis Bartley <[email protected]>
Signed-off-by: Taejin Park <[email protected]>
Signed-off-by: George <[email protected]>
Signed-off-by: Shanmugam Ramasamy <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Mingyuan Ma <[email protected]>
Co-authored-by: eharper <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <[email protected]>
Co-authored-by: Elena Rastorgueva <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: JimmyZhang12 <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Co-authored-by: Huy Vu2 <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Bobby Chen <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Sangkug Lym <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: ashbhandare <[email protected]>
Co-authored-by: Aishwarya Bhandare <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Valerie Sarge <[email protected]>
Co-authored-by: Huiying <[email protected]>
Co-authored-by: Huiying Li <[email protected]>
Co-authored-by: tbartley94 <[email protected]>
Co-authored-by: Taejin Park <[email protected]>
Co-authored-by: George <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Vladimir Bataev <[email protected]>
Co-authored-by: Ming <[email protected]>
Co-authored-by: Huy Vu2 <[email protected]>

Loading branch information

40 people authored and alxzhang-amazon committed Apr 26, 2024

1 parent 1cc530c commit 63b44ca

Jenkinsfile

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -125,6 +125,7 @@ pipeline {
  
            sh 'python tests/core_ptl/check_imports.py --domain "nlp"'

          }

        }

        stage('L0: Unit Tests GPU') {

          steps {

            sh 'NEMO_NUMBA_MINVER=0.53 pytest -m "not pleasefixme" --with_downloads'

    @@ -3517,6 +3518,64 @@ pipeline {
  
          failFast true

          steps {

            sh "python examples/nlp/language_modeling/megatron_retro_pretraining.py \

                trainer.num_nodes=1 \

                trainer.devices=2 \

                trainer.precision=bf16 \

                trainer.accelerator=gpu \

                model.data.data_prefix=['none'] \

                exp_manager.exp_dir=examples/nlp/language_modeling/mcore_retro_results \

                model.mcore_gpt=True \

                model.tensor_model_parallel_size=1 \

                model.pipeline_model_parallel_size=1 \

                model.optim.name=distributed_fused_adam \

                model.retro.retro_project_dir=/home/TestData/nlp/megatron_retro/mcore_retro/micro-wiki-core \

                model.data.num_workers=4 \

                model.micro_batch_size=1 \

                model.data.shuffle_documents=False \

                trainer.val_check_interval=30 \

                +trainer.num_sanity_val_steps=0 \

                model.init_method_std=0.023 \

                model.optim.lr=6.0e-4 \

                model.megatron_amp_O2=True \

                model.data.splits_string=\'\"98,2,0\"\' \

                model.data.dataloader_type=cyclic \

                trainer.max_steps=10"

            sh "python examples/nlp/language_modeling/megatron_retro_pretraining.py \

                trainer.num_nodes=1 \

                trainer.devices=2 \

                trainer.precision=bf16 \

                trainer.accelerator=gpu \

                model.data.data_prefix=['none'] \

                exp_manager.exp_dir=examples/nlp/language_modeling/mcore_retro_results \

                model.mcore_gpt=True \

                model.tensor_model_parallel_size=1 \

                model.pipeline_model_parallel_size=1 \

                model.optim.name=distributed_fused_adam \

                model.retro.retro_project_dir=/home/TestData/nlp/megatron_retro/mcore_retro/micro-wiki-core \

                model.data.num_workers=4 \

                model.micro_batch_size=1 \

                model.data.shuffle_documents=False \

                trainer.val_check_interval=30 \

                +trainer.num_sanity_val_steps=0 \

                model.init_method_std=0.023 \

                model.optim.lr=6.0e-4 \

                model.megatron_amp_O2=True \

                model.data.splits_string=\'\"98,2,0\"\' \

                model.data.dataloader_type=cyclic \

                trainer.max_steps=20"

            sh "rm -rf examples/nlp/language_modeling/mcore_retro_results"

          }

        }

        stage('L2: (Legacy) Megatron RETRO Pretraining and Resume Training') {

          when {

            anyOf {

              branch 'main'

              changeRequest target: 'main'

            }

          }

          failFast true

          steps {

            sh "python examples/nlp/language_modeling/megatron_retro_pretraining_legacy.py \

            trainer.devices=2 \

            trainer.num_nodes=1 \

            trainer.accelerator=gpu \

    @@ -3527,7 +3586,7 @@ pipeline {
  
            trainer.precision=16 \

            trainer.gradient_clip_val=1.0 \

            trainer.val_check_interval=10 \

            exp_manager.exp_dir=examples/nlp/language_modeling/retro_results \

            exp_manager.exp_dir=examples/nlp/language_modeling/retro_legacy_results \

            model.data.data_prefix='' \

            model.data.knn_index='' \

            model.data.retrieval_prefix='' \

    @@ -3546,7 +3605,7 @@ pipeline {
  
            model.enc_cross_attention=[1] \

            model.dec_cross_attention=[1] \

            +model.data.mock=True"

            sh "python examples/nlp/language_modeling/megatron_retro_pretraining.py \

            sh "python examples/nlp/language_modeling/megatron_retro_pretraining_legacy.py \

            trainer.devices=2 \

            trainer.num_nodes=1 \

            trainer.accelerator=gpu \

    @@ -3557,7 +3616,7 @@ pipeline {
  
            trainer.precision=16 \

            trainer.gradient_clip_val=1.0 \

            trainer.val_check_interval=10 \

            exp_manager.exp_dir=examples/nlp/language_modeling/retro_results \

            exp_manager.exp_dir=examples/nlp/language_modeling/retro_legacy_results \

            model.data.data_prefix='' \

            model.data.knn_index='' \

            model.data.retrieval_prefix='' \

    @@ -3576,10 +3635,10 @@ pipeline {
  
            model.enc_cross_attention=[1] \

            model.dec_cross_attention=[1] \

            +model.data.mock=True"

            sh "rm -rf examples/nlp/language_modeling/retro_results"

            sh "rm -rf examples/nlp/language_modeling/retro_legacy_results"

          }

        }

        stage('L2: Megatron RETRO muTransfer Pretraining Performance') {

        stage('L2: (Legacy) Megatron RETRO muTransfer Pretraining Performance') {

          when {

            anyOf {

              branch 'main'

    @@ -3600,7 +3659,7 @@ pipeline {
  
                    trainer.limit_val_batches=0 \

                    trainer.gradient_clip_val=1.0 \

                    +trainer.num_sanity_val_steps=0 \

                    exp_manager.exp_dir=examples/nlp/language_modeling/retro_results/ \

                    exp_manager.exp_dir=examples/nlp/language_modeling/retro_legacy_results/ \

                    +exp_manager.version=smalltest \

                    model.data.neighbors=2 \

                    model.megatron_amp_O2=False \

    @@ -3651,15 +3710,15 @@ import torch
  
    if not (torch.cuda.is_available() and 'A100' in torch.cuda.get_device_name()):

        import sys

        sys.exit(0)

    event_file = list(pathlib.Path('examples/nlp/language_modeling/retro_results/megatron_retro/smalltest').glob('events.out.tfevents*'))[0]

    event_file = list(pathlib.Path('examples/nlp/language_modeling/retro_legacy_results/megatron_retro/smalltest').glob('events.out.tfevents*'))[0]

    ea = EventAccumulator(str(event_file)).Reload()

    vals = []

    for i in ea.Scalars('reduced_train_loss'):

        vals.append(i.value)

    training_curve = pd.DataFrame({'loss': vals})

    gt_curve = pd.read_csv('/home/TestData/nlp/megatron_retro/expected_learning_curve.csv')

    assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''

            sh "rm -rf examples/nlp/language_modeling/retro_results"

            sh "rm -rf examples/nlp/language_modeling/retro_legacy_results"

          }

        }

        stage('L2: BioMegatron Bert NER Task') {

examples/nlp/language_modeling/conf/megatron_bert_config.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -158,4 +158,4 @@ model: @@
           name: CosineAnnealing
           warmup_steps: 500
           constant_steps: 50000
-          min_lr: 2e-5
+          min_lr: 2e-5

0 comments on commit `63b44ca`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `63b44ca`

Commit

There are no files selected for viewing

0 comments on commit 63b44ca

0 comments on commit `63b44ca`