MoE parameter passing (#8490) · NVIDIA/NeMo@8cb80eb

Commit

MoE parameter passing (#8490)

* MoE parameter passing (#8255)

* MoE parameter passing

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Pass EP/MoE params in consumer scripts.

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* PR fixes

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Use latest commit of mcore-0.5

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* CI fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Jiaqiz/option to disable adapters & merge all lora layers (#8029)

* Added LoRA support for the Dense layer of Attention

* Added LoRA MLP support to MCore and NeMo models.

* Change LoRA config default to QKV.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed bug with ddp training.

* use adapter only when it is enabled

Signed-off-by: jiaqi zeng <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix lora merge script (#8113)

Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: Adi Renduchintala <[email protected]>

* add peft ckpt to nemo

Signed-off-by: Jiaqi Zeng <[email protected]>

* merge lora weights for all layers, mcore only

Signed-off-by: Jiaqi Zeng <[email protected]>

* support/fix cpu initialization

Signed-off-by: Chen Cui <[email protected]>

* add example usage

Signed-off-by: Chen Cui <[email protected]>

* fix TP due to distributed checkpoint

Signed-off-by: Chen Cui <[email protected]>

* updating the logic of merging lora weights for all layers, mcore only

Signed-off-by: Jiaqi Zeng <[email protected]>

* MCoreMixin chages.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* merge in fp32 then cast back

Signed-off-by: Jiaqi Zeng <[email protected]>

* remove ckpt to nemo

Signed-off-by: Jiaqi Zeng <[email protected]>

* fix import

Signed-off-by: Jiaqi Zeng <[email protected]>

---------

Signed-off-by: jiaqi zeng <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Jiaqi Zeng <[email protected]>
Co-authored-by: Tugrul Konuk <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adi Renduchintala <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Update k2 version (#8478)

Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add mcore full TE transformer layer spec (#8328)

* Add spec and implement autocast layer

Signed-off-by: Jan Baczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jan Baczek <[email protected]>

* remove try-catchs, these dependecies are mandatory for this file

Signed-off-by: Jan Baczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jan Baczek <[email protected]>

* Check out this cool try/except clause

Signed-off-by: Jan Baczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unused import

Signed-off-by: Jan Baczek <[email protected]>

* Add import tests to Jenkinsfile

Signed-off-by: Jan Baczek <[email protected]>

* Move import tests to Jenkins and remove code that is developed only for passing tests

Signed-off-by: Jan Baczek <[email protected]>

* Make test robust to faulty base configs

Signed-off-by: Jan Baczek <[email protected]>

* Use proper GPT implementation in the test

Signed-off-by: Jan Baczek <[email protected]>

* Update nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

Co-authored-by: Sudhakar Singh <[email protected]>
Signed-off-by: jbaczek <[email protected]>

* Update nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

Co-authored-by: Sudhakar Singh <[email protected]>
Signed-off-by: jbaczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

Co-authored-by: Jaemin Choi <[email protected]>
Signed-off-by: jbaczek <[email protected]>

* Update nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

Co-authored-by: Jaemin Choi <[email protected]>
Signed-off-by: jbaczek <[email protected]>

* Add TE knobs to the copy of AutocastTransformerLayer

Signed-off-by: Jan Baczek <[email protected]>

* Add TE knobs to the copy of AutocastTransformerLayer

Signed-off-by: Jan Baczek <[email protected]>

* Add dummy parameter to accomodated for the changes in mcore

Signed-off-by: Jan Baczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update mcore to 0.5.0 in Jenkins pipeline

Signed-off-by: Jan Baczek <[email protected]>

* Bump mcore commit. This is commit from tot, not any release.

Signed-off-by: Jan Baczek <[email protected]>

* Remove from the test config option that is incompatible with bias_activation_fusion

Signed-off-by: Jan Baczek <[email protected]>

* Bump TE version in CI to 1.4

Signed-off-by: Jan Baczek <[email protected]>

* Update test

Signed-off-by: Jan Baczek <[email protected]>

* Change precision for the test - current runnens don't support bf16

Signed-off-by: Jan Baczek <[email protected]>

---------

Signed-off-by: Jan Baczek <[email protected]>
Signed-off-by: jbaczek <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Sudhakar Singh <[email protected]>
Co-authored-by: Jaemin Choi <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add mcore full TE transformer layer spec (#8328)

* Add spec and implement autocast layer

Signed-off-by: Jan Baczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jan Baczek <[email protected]>

* remove try-catchs, these dependecies are mandatory for this file

Signed-off-by: Jan Baczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jan Baczek <[email protected]>

* Check out this cool try/except clause

Signed-off-by: Jan Baczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unused import

Signed-off-by: Jan Baczek <[email protected]>

* Add import tests to Jenkinsfile

Signed-off-by: Jan Baczek <[email protected]>

* Move import tests to Jenkins and remove code that is developed only for passing tests

Signed-off-by: Jan Baczek <[email protected]>

* Make test robust to faulty base configs

Signed-off-by: Jan Baczek <[email protected]>

* Use proper GPT implementation in the test

Signed-off-by: Jan Baczek <[email protected]>

* Update nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

Co-authored-by: Sudhakar Singh <[email protected]>
Signed-off-by: jbaczek <[email protected]>

* Update nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

Co-authored-by: Sudhakar Singh <[email protected]>
Signed-off-by: jbaczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

Co-authored-by: Jaemin Choi <[email protected]>
Signed-off-by: jbaczek <[email protected]>

* Update nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

Co-authored-by: Jaemin Choi <[email protected]>
Signed-off-by: jbaczek <[email protected]>

* Add TE knobs to the copy of AutocastTransformerLayer

Signed-off-by: Jan Baczek <[email protected]>

* Add TE knobs to the copy of AutocastTransformerLayer

Signed-off-by: Jan Baczek <[email protected]>

* Add dummy parameter to accomodated for the changes in mcore

Signed-off-by: Jan Baczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update mcore to 0.5.0 in Jenkins pipeline

Signed-off-by: Jan Baczek <[email protected]>

* Bump mcore commit. This is commit from tot, not any release.

Signed-off-by: Jan Baczek <[email protected]>

* Remove from the test config option that is incompatible with bias_activation_fusion

Signed-off-by: Jan Baczek <[email protected]>

* Bump TE version in CI to 1.4

Signed-off-by: Jan Baczek <[email protected]>

* Update test

Signed-off-by: Jan Baczek <[email protected]>

* Change precision for the test - current runnens don't support bf16

Signed-off-by: Jan Baczek <[email protected]>

---------

Signed-off-by: Jan Baczek <[email protected]>
Signed-off-by: jbaczek <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Sudhakar Singh <[email protected]>
Co-authored-by: Jaemin Choi <[email protected]>

* Handle float limit_val_batches (#8426)

* Handle float limit_val_batches

Signed-off-by: Abhishree <[email protected]>

* Rectify reconfiguration of float limit_val_batches

Signed-off-by: Abhishree <[email protected]>

* Remove unused imports

Signed-off-by: Abhishree <[email protected]>

* Scale len(val_dataloader) with float limit_val_batches

Signed-off-by: Abhishree <[email protected]>

* Return len(dataloader) in microbatches

Signed-off-by: Abhishree <[email protected]>

* Add back resetting of num val samples

Signed-off-by: Abhishree <[email protected]>

* Fix to ensure float limit_val_batches is multiple of num_micro_batches

Signed-off-by: Abhishree <[email protected]>

* Remove forcing eval samples to 1 for float limit_val_batches

Signed-off-by: Abhishree <[email protected]>

* Fix bug wrt 0 limiot_val_batches

Signed-off-by: Abhishree <[email protected]>

* Add missing mock_dataset line

Signed-off-by: Abhishree <[email protected]>

* Avoid ensuring limit_val_batches is a mutliple of microbatches for 1.0

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Restore the hack forcing number of validation and test epochs to 1

Signed-off-by: Jan Baczek <[email protected]>

* Change limit_val_batches to 1.0 for GPT pretraining test. The integer value is covered in other tests

Signed-off-by: Jan Baczek <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Jan Baczek <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jan Baczek <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Fix tutorial links in user guide (#8497)

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Sequence Parallel for LoRA (#8369)

* support lora + sequence parallel

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add more comments

Signed-off-by: Chen Cui <[email protected]>

* add lora SP CI test

Signed-off-by: Chen Cui <[email protected]>

* support lora for all linear modules as in #7988

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Call proper method to replace (#8498)

Signed-off-by: Naga Venkatesh Gavini <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Added memory logger (#8395)

* Added memory logger

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Selvaraj Anandaraj <[email protected]>
Co-authored-by: Selvaraj Anandaraj <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Canary refactor for Riva (#8363)

* initial commit of bleu score tracking

Signed-off-by: Travis Bartley <[email protected]>

* initial commit, refactoring aed models for riva

Signed-off-by: Travis Bartley <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updating Canary to support torch metrics

Signed-off-by: Travis Bartley <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* style fixes

Signed-off-by: Travis Bartley <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* missed an empty batch conditional

Signed-off-by: Travis Bartley <[email protected]>

* Fixing dataloader issues

Signed-off-by: Travis Bartley <[email protected]>

* Finishing merge conflict with transcribe update

Signed-off-by: Travis Bartley <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* style fix

Signed-off-by: Travis Bartley <[email protected]>

* copyright header fix

Signed-off-by: Travis Bartley <[email protected]>

* yet another merge conflict

Signed-off-by: Travis Bartley <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* making paired data management safer

Signed-off-by: Travis Bartley <[email protected]>

* sentencepiece needs bigger tokenizer...

Signed-off-by: Travis Bartley <[email protected]>

* sentencepiece tokenizer vocab needs to be +2 from vocab for canary

Signed-off-by: Travis Bartley <[email protected]>

* Update canary tokenizer to be more generic, updated metrics to manage special tokens removal themselves.

Signed-off-by: Travis Bartley <[email protected]>

* merge conflit

Signed-off-by: Travis Bartley <[email protected]>

* Simplified tokenizer and corrected bug in dataloader

Signed-off-by: Travis Bartley <[email protected]>

* Cleaning up docstrings and fixing inference bug.

Signed-off-by: Travis Bartley <[email protected]>

* adding example scripts

Signed-off-by: Travis Bartley <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* cleaning up useless imports

Signed-off-by: Travis Bartley <[email protected]>

* adding unit tests

Signed-off-by: Travis Bartley <[email protected]>

* fixing unit tests

Signed-off-by: Travis Bartley <[email protected]>

* cfg name change

Signed-off-by: Travis Bartley <[email protected]>

* adding custom check to pass pytests

Signed-off-by: Travis Bartley <[email protected]>

* removing print script

Signed-off-by: Travis Bartley <[email protected]>

* catching bugs regarding tokens.

Signed-off-by: Travis Bartley <[email protected]>

* added docstrings and made examples scripts more generic

Signed-off-by: Travis Bartley <[email protected]>

* docstring deleted by accident

Signed-off-by: Travis Bartley <[email protected]>

* plurals in namespace

Signed-off-by: Travis Bartley <[email protected]>

* changing example script

Signed-off-by: Travis Bartley <[email protected]>

---------

Signed-off-by: Travis Bartley <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add alpha scaling to lora (#8248)

* removed pdeprecated eft model

Signed-off-by: arendu <[email protected]>

* add alpha

Signed-off-by: arendu <[email protected]>

* default for alpha

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add alpha scaling to lora (#8483)

* coldfix (#8412)

Signed-off-by: George Zelenfroynd <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Fixed errors in the CTM gen functions (#8416) (#8420)

Signed-off-by: Taejin Park <[email protected]>
Co-authored-by: Taejin Park <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Add change_vocabulary and save_tokenizers() support to Multitask ASR models (#8357) (#8367)

* Add change_vocabulary and save_tokenizers() support

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update nemo/collections/asr/models/aed_multitask_models.py

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* fix path location and branch (#8314)

* fix path location and branch (#8304)

* fix path location and branch

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* change to a floating point number

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Somshubra Majumdar <[email protected]>

* updat ebranch in tutorial

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Michal Futrega <[email protected]>

* Add TP comm overlap knobs to AutocastTransformerLayer (#8290)

Signed-off-by: Jaemin Choi <[email protected]>
Co-authored-by: Jaemin Choi <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* add deallocate pipeline output optimization (#8279) (#8318)

* add deallocate pipeline output optimization

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jimmy Zhang <[email protected]>
Co-authored-by: JimmyZhang12 <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Michal Futrega <[email protected]>

* remove assertion (#8302) (#8321)

Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Keep max_seqlen and cu_seqlens_argmin for later micro-batches when PP>1 (#8334) (#8346)

Signed-off-by: Sangkug Lym <[email protected]>
Co-authored-by: Sangkug Lym <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Enable megatron core loggers for GPT pretraining (#8354) (#8384)

* Logging changes tested for gpt_pretraining

* Additional args

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Aishwarya Bhandare <[email protected]>
Co-authored-by: ashbhandare <[email protected]>
Co-authored-by: Aishwarya Bhandare <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Fix dreambooth data sampler issue (#8400) (#8413)

* Turn on drop last

* Some neva fixes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Michal Futrega <[email protected]>

* add ensemble decoding fix (#8427) (#8433)

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* NeVA Tutorial Notebook (#8217)

* init commit - neva tutorial

Signed-off-by: Pratyush Muthukumar <[email protected]>

* NeVA tutorial notebook

Signed-off-by: Pratyush Muthukumar <[email protected]>

* init commit - neva tutorial

Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>

* NeVA tutorial notebook

Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>

* requested changes

Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>

* add inference via script

Signed-off-by: Pratyush Muthukumar <[email protected]>

* requested changes

Signed-off-by: Pratyush Muthukumar <[email protected]>

* requested changes

Signed-off-by: Pratyush Muthukumar <[email protected]>

* add codeblocks to run torchrun in notebook

Signed-off-by: Pratyush Muthukumar <[email protected]>

---------

Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>
Co-authored-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* mcore customization doc minor fix (#8421) (#8437)

Signed-off-by: Huiying Li <[email protected]>
Co-authored-by: Huiying <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Add `loop_labels` algorithm for TDT greedy decoding (#8215)

* Add `loop_labels` algorithm for TDT greedy decoding

Signed-off-by: Vladimir Bataev <[email protected]>

* Use `loop_labels` by default

Signed-off-by: Vladimir Bataev <[email protected]>

* Loop labels greedy decoding v2

Signed-off-by: Vladimir Bataev <[email protected]>

* Add comments. Clean up

Signed-off-by: Vladimir Bataev <[email protected]>

* Add comments

Signed-off-by: Vladimir Bataev <[email protected]>

* Add comments

Signed-off-by: Vladimir Bataev <[email protected]>

* Add tests for batched hypotheses

Signed-off-by: Vladimir Bataev <[email protected]>

* Add tests for batched alignments

Signed-off-by: Vladimir Bataev <[email protected]>

* Add comments

Signed-off-by: Vladimir Bataev <[email protected]>

* Fix comment

Signed-off-by: Vladimir Bataev <[email protected]>

* Fix test

Signed-off-by: Vladimir Bataev <[email protected]>

* Add computer for TDT

Signed-off-by: Vladimir Bataev <[email protected]>

* Fix TDT decoding algorithm

Signed-off-by: Vladimir Bataev <[email protected]>

* Use loop frames by default for TDT

Signed-off-by: Vladimir Bataev <[email protected]>

* Remove "loop frames" implementation for TDT

Signed-off-by: Vladimir Bataev <[email protected]>

* Clean up

Signed-off-by: Vladimir Bataev <[email protected]>

* Add comments

Signed-off-by: Vladimir Bataev <[email protected]>

* Fix confidence. Use tensor for durations.

Signed-off-by: Vladimir Bataev <[email protected]>

---------

Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Add dist ckpt support for regular optimizers (#7749) (#8293)

* Add dist ckpt support for regular optimizers

* [tutorial] fixed missing RIR scripts file. (#8257)

* fix imports

* imports fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ci imports fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert asr notebook

* revert asr notebook

---------

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Michal Futrega <[email protected]>

* Multimodal r1.23.0 bug fix  (#8315) (#8339)

* Rename quick-gelu

* ddpm config guard

* Fix ddpm edit api

* Fix insert_image_token cfg issue

* neva updates

* reformat

* Add back jenkins

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix jenkins

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bugs

* Update default neva template

---------

Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Michal Futrega <[email protected]>

* mcore ds fix (#8283) (#8385)

* [tutorial] fixed missing RIR scripts file. (#8257)

* add values to en tts dict (#7879)

* mcore ds fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update mcore

* revert asr files

* add comments

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add support for mcore mock dataset

* update mcore version

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update gpt cfg

* update mcore commit

* fix Bert unit tests

* update bert tests

* fix bert mcore test

* fix gpt jenkins tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update apex & TE commits

* revert apex installation

* turn off the fusion for jenkins

---------

Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* MCore dataset compatibility for tokenizers (#8390) (#8397)

* Add unique_identifiers for all tokenizers and eod for SentencePieceTokenizer

* Add generalized token aliases to TokenizerSpec to conform with MegatronTokenizer's interface. Remove now-redundant individual fixes from AutoTokenizer and SentencePieceTokenizer.

---------

Signed-off-by: Valerie Sarge <[email protected]>
Co-authored-by: Valerie Sarge <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Canary: inference tokenization improvements; preserving custom keys when creating tarred manifests (#8432)

* Improvements for Canary:

- carry over custom keys when creatin tarred manifests
- selectable text field in ASR eval
- get rid of prompt slicing, create proper inference prompts

Signed-off-by: Piotr Żelasko <[email protected]>

* set ensure_ascii=False in tarred conversion to avoid breaking tokenizers trained on UTF-8 encoding

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* add  sbert to IR (#8445)

* add  sbert to IR

Signed-off-by: ataghibakhsh <[email protected]>

* add doc

Signed-off-by: ataghibakhsh <[email protected]>

* fix the  auto_tokenizer property method reset bug

Signed-off-by: ataghibakhsh <[email protected]>

* addressed bot comments

Signed-off-by: ataghibakhsh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: ataghibakhsh <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Michal Futrega <[email protected]>

* Update readme (#8440)

* update

Signed-off-by: eharper <[email protected]>

* udpate

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* landing pages added

* landing page added for vision

* landing pages updated

* some minor changes to the main readme

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* typo fixed

* update

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: ntajbakhsh <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* NeMo-Mistral to HF converter bugfix. (#8353) (#8442)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Fixing mcore bert for TP, PP and SP (#8336) (#8443)

* Fixing mcore bert for TP, PP and SP

* Fixing mcore bert for TP, PP and SP

* Fixing mcore version

* Fixing mcore version

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

---------

Signed-off-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Add LoRA support to all linear layers (#7988)

* Added LoRA support for the Dense layer of Attention

* Added LoRA MLP support to MCore and NeMo models.

* Change LoRA config default to QKV.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed bug with ddp training.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* MCoreMixin chages.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* using new commit of meg-LM

Signed-off-by: arendu <[email protected]>

* add cpu_offloading_num_layers to conversion script until bug in megatron is fixed

Signed-off-by: Chen Cui <[email protected]>

* fix peft mixin arguments to follow mcore 0.5

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update megatron commit to fix ci error

Signed-off-by: Chen Cui <[email protected]>

* try to fix ci

Signed-off-by: Chen Cui <[email protected]>

* try to fix ci

Signed-off-by: Chen Cui <[email protected]>

* add cfg default

Signed-off-by: Chen Cui <[email protected]>

---------

Signed-off-by: Adi Renduchintala <[email protected]>
Signed-off-by: Jiaqi Zeng <[email protected]>
Signed-off-by: arendu <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adi Renduchintala <[email protected]>
Co-authored-by: Jiaqi Zeng <[email protected]>
Co-authored-by: arendu <[email protected]>
Co-authored-by: HeyyyyyyG <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Add Neva Template for NV-DPO Models  (#8358)

* add/rename from nvgpt to nv_steerlm, add nv_dpo template

Signed-off-by: HuiyingLi <[email protected]>

* add nv_dpo conversation to accomendate empty system message

Signed-off-by: HuiyingLi <[email protected]>

* handle nv_dpo template text generation

Signed-off-by: HuiyingLi <[email protected]>

* add prompt string to nvgpt

Signed-off-by: HuiyingLi <[email protected]>

* bugfix for inference prompt template

Signed-off-by: HuiyingLi <[email protected]>

* bug fix for grabbing clean text

Signed-off-by: Huiying Li <[email protected]>

* fix code format

Signed-off-by: Huiying Li <[email protected]>

---------

Signed-off-by: HuiyingLi <[email protected]>
Signed-off-by: Huiying Li <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Rebase scaling alpha

Signed-off-by: Michal Futrega <[email protected]>

* default for alpha

Signed-off-by: arendu <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Rebase scaling alpha

Signed-off-by: Michal Futrega <[email protected]>

---------

Signed-off-by: George Zelenfroynd <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>
Signed-off-by: Taejin Park <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Jaemin Choi <[email protected]>
Signed-off-by: Jimmy Zhang <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Sangkug Lym <[email protected]>
Signed-off-by: Aishwarya Bhandare <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Huiying Li <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: Valerie Sarge <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: ataghibakhsh <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Shanmugam Ramasamy <[email protected]>
Signed-off-by: Adi Renduchintala <[email protected]>
Signed-off-by: Jiaqi Zeng <[email protected]>
Signed-off-by: arendu <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: HuiyingLi <[email protected]>
Co-authored-by: George <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Taejin Park <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: Jaemin Choi <[email protected]>
Co-authored-by: Jaemin Choi <[email protected]>
Co-authored-by: JimmyZhang12 <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Sangkug Lym <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: ashbhandare <[email protected]>
Co-authored-by: Aishwarya Bhandare <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Pratyush Muthukumar <[email protected]>
Co-authored-by: Pratyush Muthukumar <[email protected]>
Co-authored-by: Huiying <[email protected]>
Co-authored-by: Vladimir Bataev <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Valerie Sarge <[email protected]>
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: ntajbakhsh <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Tugrul Konuk <[email protected]>
Co-authored-by: Adi Renduchintala <[email protected]>
Co-authored-by: Jiaqi Zeng <[email protected]>
Co-authored-by: arendu <[email protected]>
Co-authored-by: HeyyyyyyG <[email protected]>
Co-authored-by: Chen Cui <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Signed-off-by: George Zelenfroynd <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>
Signed-off-by: Taejin Park <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Jaemin Choi <[email protected]>
Signed-off-by: Jimmy Zhang <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Sangkug Lym <[email protected]>
Signed-off-by: Aishwarya Bhandare <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Huiying Li <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: Valerie Sarge <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: ataghibakhsh <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Shanmugam Ramasamy <[email protected]>
Signed-off-by: Adi Renduchintala <[email protected]>
Signed-off-by: Jiaqi Zeng <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: HuiyingLi <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Michal Futrega <[email protected]>
Co-authored-by: George <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Taejin Park <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Piotr Żelasko <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: Jaemin Choi <[email protected]>
Co-authored-by: Jaemin Choi <[email protected]>
Co-authored-by: JimmyZhang12 <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Sangkug Lym <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: ashbhandare <[email protected]>
Co-authored-by: Aishwarya Bhandare <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Pratyush Muthukumar <[email protected]>
Co-authored-by: Pratyush Muthukumar <[email protected]>
Co-authored-by: Huiying <[email protected]>
Co-authored-by: Vladimir Bataev <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Valerie Sarge <[email protected]>
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: ntajbakhsh <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Tugrul Konuk <[email protected]>
Co-authored-by: Jiaqi Zeng <[email protected]>
Co-authored-by: HeyyyyyyG <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Update PEFT Doc (#8501)

* update peft doc

Signed-off-by: Chen Cui <[email protected]>

* remove old prompt learning doc and notebook

Signed-off-by: Chen Cui <[email protected]>

* fix table

Signed-off-by: Chen Cui <[email protected]>

* fix table

Signed-off-by: Chen Cui <[email protected]>

* fix table

Signed-off-by: Chen Cui <[email protected]>

* revert accidental commit

Signed-off-by: Chen Cui <[email protected]>

* revert accidental commit

Signed-off-by: Chen Cui <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* release updates (#8394)

* release updates (#8378)

* [tutorial] fixed missing RIR scripts file. (#8257)

Signed-off-by: Xuesong Yang <[email protected]>

* add values to en tts dict (#7879)

Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>

* mcore ds fix

Signed-off-by: Dmytro Pykhtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update mcore

Signed-off-by: dimapihtar <[email protected]>

* revert asr files

Signed-off-by: dimapihtar <[email protected]>

* add comments

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add support for mcore mock dataset

Signed-off-by: dimapihtar <[email protected]>

* update mcore version

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update gpt cfg

Signed-off-by: dimapihtar <[email protected]>

* update mcore commit

Signed-off-by: dimapihtar <[email protected]>

* fix Bert unit tests

Signed-off-by: dimapihtar <[email protected]>

* update bert tests

Signed-off-by: dimapihtar <[email protected]>

* fix bert mcore test

Signed-off-by: dimapihtar <[email protected]>

* fix gpt jenkins tests

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add support for dict data input type

Signed-off-by: dimapihtar <[email protected]>

* add mock ds test

Signed-off-by: dimapihtar <[email protected]>

* add test for dict data input type

Signed-off-by: dimapihtar <[email protected]>

* mcore ds fix

Signed-off-by: dimapihtar <[email protected]>

* data input fix

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Pablo Garay <[email protected]>

* Update megatron_gpt_model.py

Signed-off-by: Dmytro Pykhtar <[email protected]>

---------

Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: jiaqi zeng <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Jiaqi Zeng <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Jan Baczek <[email protected]>
Signed-off-by: jbaczek <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Naga Venkatesh Gavini <[email protected]>
Signed-off-by: Selvaraj Anandaraj <[email protected]>
Signed-off-by: Travis Bartley <[email protected]>
Signed-off-by: arendu <[email protected]>
Signed-off-by: George Zelenfroynd <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>
Signed-off-by: Taejin Park <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Jaemin Choi <[email protected]>
Signed-off-by: Jimmy Zhang <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Sangkug Lym <[email protected]>
Signed-off-by: Aishwarya Bhandare <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Huiying Li <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: Valerie Sarge <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: ataghibakhsh <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: Shanmugam Ramasamy <[email protected]>
Signed-off-by: Adi Renduchintala <[email protected]>
Signed-off-by: HuiyingLi <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: HeyyyyyyG <[email protected]>
Co-authored-by: Tugrul Konuk <[email protected]>
Co-authored-by: Adi Renduchintala <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Co-authored-by: Vladimir Bataev <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Sudhakar Singh <[email protected]>
Co-authored-by: Jaemin Choi <[email protected]>
Co-authored-by: jbaczek <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Jan Baczek <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Naga Venkatesh Gavini <[email protected]>
Co-authored-by: Selvaraj Anandaraj <[email protected]>
Co-authored-by: Selvaraj Anandaraj <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: tbartley94 <[email protected]>
Co-authored-by: Piotr Żelasko <[email protected]>
Co-authored-by: Michal Futrega <[email protected]>
Co-authored-by: George <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Taejin Park <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: Jaemin Choi <[email protected]>
Co-authored-by: JimmyZhang12 <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Sangkug Lym <[email protected]>
Co-authored-by: ashbhandare <[email protected]>
Co-authored-by: Aishwarya Bhandare <[email protected]>
Co-authored-by: Pratyush Muthukumar <[email protected]>
Co-authored-by: Pratyush Muthukumar <[email protected]>
Co-authored-by: Huiying <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Valerie Sarge <[email protected]>
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: ntajbakhsh <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Jiaqi Zeng <[email protected]>

Loading branch information

49 people authored Feb 26, 2024

1 parent 915e011 commit 8cb80eb

examples/nlp/language_modeling/megatron_gpt_eval.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -199,7 +199,9 @@ def main(cfg) -> None: @@
         assert (
             cfg.trainer.devices * cfg.trainer.num_nodes
-            == cfg.tensor_model_parallel_size * cfg.pipeline_model_parallel_size
+            == cfg.tensor_model_parallel_size
+            * cfg.pipeline_model_parallel_size
+            * max(1, cfg.get('expert_model_parallel_size', 1))
         ), "devices * num_nodes should equal tensor_model_parallel_size * pipeline_model_parallel_size"
         if cfg.gpt_model_file:
@@ Expand All / @@ -224,6 +226,8 @@ def main(cfg) -> None: @@
                     # with dist checkpointing we can use the model parallel config specified by the user
                     pretrained_cfg.tensor_model_parallel_size = cfg.tensor_model_parallel_size
                     pretrained_cfg.pipeline_model_parallel_size = cfg.pipeline_model_parallel_size
+                    pretrained_cfg.expert_model_parallel_size = cfg.get('expert_model_parallel_size', 1)
+                    pretrained_cfg.micro_batch_size = 1
                 if trainer.precision == "16":
                     pretrained_cfg.megatron_amp_O2 = False
                 elif trainer.precision in ['bf16', 'bf16-mixed'] and cfg.get('megatron_amp_O2', False):
@@ Expand All / @@ -237,13 +241,23 @@ def main(cfg) -> None: @@
             )
         elif cfg.checkpoint_dir:
             app_state = AppState()
-            if cfg.tensor_model_parallel_size > 1 or cfg.pipeline_model_parallel_size > 1:
-                app_state.model_parallel_size = cfg.tensor_model_parallel_size * cfg.pipeline_model_parallel_size
+            if (
+                cfg.tensor_model_parallel_size > 1
+                or cfg.pipeline_model_parallel_size > 1
+                or cfg.get('expert_model_parallel_size', 1) > 1
+            ):
+                app_state.model_parallel_size = (
+                    cfg.tensor_model_parallel_size
+                    * cfg.pipeline_model_parallel_size
+                    * cfg.get('expert_model_parallel_size', 1)
+                )
                 app_state.tensor_model_parallel_size = cfg.tensor_model_parallel_size
                 app_state.pipeline_model_parallel_size = cfg.pipeline_model_parallel_size
+                app_state.expert_model_parallel_size = cfg.get('expert_model_parallel_size', 1)
                 (
                     app_state.tensor_model_parallel_rank,
                     app_state.pipeline_model_parallel_rank,
+                    app_state.expert_model_parallel_rank,
                     app_state.model_parallel_size,
                     app_state.data_parallel_size,
                     app_state.pipeline_model_parallel_split_rank,
@@ Expand All / @@ -254,6 +268,7 @@ def main(cfg) -> None: @@
                     tensor_model_parallel_size_=cfg.tensor_model_parallel_size,
                     pipeline_model_parallel_size_=cfg.pipeline_model_parallel_size,
                     pipeline_model_parallel_split_rank_=cfg.pipeline_model_parallel_split_rank,
+                    expert_model_parallel_size_=cfg.get('expert_model_parallel_size', 1),
                 )
             checkpoint_path = os.path.join(cfg.checkpoint_dir, cfg.checkpoint_name)
             # checkpoint_path is a dir in case of distributed checkpointing
@@ Expand Down @@

examples/nlp/language_modeling/tuning/megatron_gpt_sft.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -73,6 +73,7 @@ def _modify_config(gpt_cfg, cfg, add_cfg_to_tree=False): @@
             gpt_cfg.ffn_dropout = cfg.model.ffn_dropout
             gpt_cfg.use_flash_attention = cfg.model.get('use_flash_attention', False)
             gpt_cfg.tensor_model_parallel_size = cfg.model.get('tensor_model_parallel_size', 1)
+            gpt_cfg.expert_model_parallel_size = cfg.model.get('expert_model_parallel_size', 1)
             gpt_cfg.pipeline_model_parallel_size = cfg.model.get('pipeline_model_parallel_size', 1)
             gpt_cfg.pipeline_model_parallel_split_rank = cfg.model.get('pipeline_model_parallel_split_rank', 0)
@@ Expand Down @@

nemo/collections/nlp/models/language_modeling/megatron_base_model.py

-Original file line number
+Diff line change
@@ Expand Up @@
             # Overrides used when converting checkpoints
             if os.environ.get(NEMO_MEGATRON_MODEL_PARALLEL_APPSTATE_OVERRIDE, "false").lower() == "true":
                 app_state = AppState()
-                init_world_size = app_state.tensor_model_parallel_size * app_state.pipeline_model_parallel_size
+                init_world_size = (
+                    app_state.tensor_model_parallel_size
+                    * app_state.pipeline_model_parallel_size
+                    * (app_state.expert_model_parallel_size or 1)
+                )
                 init_global_rank = app_state.global_rank
                 init_local_rank = app_state.local_rank
             else:
@@ Expand All @@
                 global_rank=init_global_rank,
                 local_rank=init_local_rank,
                 tensor_model_parallel_size=cfg.get('tensor_model_parallel_size', 1),
+                expert_model_parallel_size=cfg.get('expert_model_parallel_size', 1),
                 pipeline_model_parallel_size=cfg.get('pipeline_model_parallel_size', 1),
                 virtual_pipeline_model_parallel_size=vp_size,
                 pipeline_model_parallel_split_rank=cfg.get('pipeline_model_parallel_split_rank', 0),
@@ Expand Down @@

nemo/collections/nlp/modules/common/megatron/megatron_init.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -33,6 +33,8 @@ @@
         from megatron.core import tensor_parallel
         from megatron.core.parallel_state import (
             get_pipeline_model_parallel_rank,
+            set_expert_model_parallel_rank,
+            set_expert_model_parallel_world_size,
             set_pipeline_model_parallel_rank,
             set_pipeline_model_parallel_split_rank,
             set_pipeline_model_parallel_world_size,
@@ Expand Down Expand Up / @@ -60,6 +62,7 @@ def initialize_model_parallel_for_nemo( @@
         global_rank,
         local_rank,
         tensor_model_parallel_size=1,
+        expert_model_parallel_size=1,
         pipeline_model_parallel_size=1,
         virtual_pipeline_model_parallel_size=None,
         pipeline_model_parallel_split_rank=None,
@@ Expand All / @@ -81,6 +84,7 @@ def initialize_model_parallel_for_nemo( @@
         app_state.global_rank = global_rank
         app_state.world_size = world_size
         app_state.local_rank = local_rank
+        app_state.expert_model_parallel_size = expert_model_parallel_size
         app_state.tensor_model_parallel_size = tensor_model_parallel_size
         app_state.pipeline_model_parallel_size = pipeline_model_parallel_size
         app_state.virtual_pipeline_model_parallel_size = virtual_pipeline_model_parallel_size
@@ Expand All / @@ -90,6 +94,7 @@ def initialize_model_parallel_for_nemo( @@
         (
             app_state.tensor_model_parallel_rank,
             app_state.pipeline_model_parallel_rank,
+            app_state.expert_model_parallel_rank,
             app_state.model_parallel_size,
             app_state.data_parallel_size,
             app_state.pipeline_model_parallel_split_rank,
@@ Expand All / @@ -102,12 +107,16 @@ def initialize_model_parallel_for_nemo( @@
             virtual_pipeline_model_parallel_size_=virtual_pipeline_model_parallel_size,
             pipeline_model_parallel_split_rank_=pipeline_model_parallel_split_rank,
             context_parallel_size_=context_parallel_size,
+            expert_model_parallel_size_=expert_model_parallel_size,
         )
         # update apex.transformer globals
         set_tensor_model_parallel_world_size(app_state.tensor_model_parallel_size)
         set_tensor_model_parallel_rank(app_state.tensor_model_parallel_rank)
+        set_expert_model_parallel_world_size(app_state.expert_model_parallel_size)
+        set_expert_model_parallel_rank(app_state.expert_model_parallel_rank)
         set_pipeline_model_parallel_rank(app_state.pipeline_model_parallel_rank)
         if HAVE_INTERLEAVED:
             set_virtual_pipeline_model_parallel_world_size(app_state.virtual_pipeline_model_parallel_size)
@@ Expand Down Expand Up / @@ -179,6 +188,7 @@ def fake_initialize_model_parallel( @@
         pipeline_model_parallel_size_,
         pipeline_model_parallel_split_rank_=None,
         virtual_pipeline_model_parallel_size_=None,
+        expert_model_parallel_size_=1,
         context_parallel_size_=1,
     ):
         """
@@ Expand Down Expand Up / @@ -302,6 +312,21 @@ def fake_initialize_model_parallel( @@
         logging.info(f'All tensor model parallel group ranks: {all_tensor_model_parallel_group_ranks}')
         logging.info(f'Rank {rank} has tensor model parallel rank: {tensor_model_parallel_rank}')
+        # EP rank
+        expert_model_parallel_rank = 0
+        if expert_model_parallel_size_ is not None and expert_model_parallel_size_ > 1:
+            tensor_and_data_group_size: int = tensor_model_parallel_size * data_parallel_size
+            num_tensor_and_data_groups: int = world_size // tensor_and_data_group_size
+            tensor_and_expert_group_size: int = tensor_model_parallel_size * expert_model_parallel_size_
+            num_expert_groups: int = data_parallel_size // expert_model_parallel_size_
+            for i in range(num_tensor_and_data_groups):
+                for j in range(num_expert_groups):
+                    start_rank = i * tensor_and_data_group_size + j * tensor_and_expert_group_size
+                    end_rank = i * tensor_and_data_group_size + (j + 1) * tensor_and_expert_group_size
+                    ranks = range(start_rank, end_rank)
+                    if rank in ranks:
+                        expert_model_parallel_rank = list(ranks).index(rank)
         # Build the pipeline model-parallel groups and embedding groups
         # (first and last rank in each pipeline model-parallel group).
         all_pipeline_model_parallel_group_ranks = []
@@ Expand Down Expand Up / @@ -340,6 +365,7 @@ def fake_initialize_model_parallel( @@
         return (
             tensor_model_parallel_rank,
             pipeline_model_parallel_rank,
+            expert_model_parallel_rank,
             model_parallel_size,
             data_parallel_size,
             pipeline_model_parallel_split_rank_,
@@ Expand Down @@

nemo/collections/nlp/parts/nlp_overrides.py

-Original file line number
+Diff line change
@@ Expand Up @@
                     context_parallel_size=app_state.context_parallel_size,
                     nccl_communicator_config_path=nccl_communicator_config_path,
                     use_sharp=sharp,
+                    expert_model_parallel_size=app_state.expert_model_parallel_size,
                 )
                 # assert that fake tp and pp rank match after model parallel init
@@ Expand Down @@

nemo/utils/app_state.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -39,13 +39,15 @@ def __init__(self): @@
             self._local_rank = None
             self._global_rank = None
             self._tensor_model_parallel_rank = None
+            self._expert_model_parallel_rank = None
             self._pipeline_model_parallel_rank = None
             self._data_parallel_rank = None
             self._world_size = None
             self._model_parallel_size = None
             self._tensor_model_parallel_size = None
             self._tensor_model_parallel_group = None
+            self._expert_model_parallel_size = None
             self._pipeline_model_parallel_size = None
             self._virtual_pipeline_model_parallel_size = None
             self._pipeline_model_parallel_group = None
@@ Expand Down Expand Up / @@ -141,6 +143,38 @@ def tensor_model_parallel_size(self, size): @@
             """
             self._tensor_model_parallel_size = size
+        @property
+        def expert_model_parallel_rank(self):
+            """ Property returns the expert model parallel rank.
+                Returns:
+                    Tensor model parallel rank.
+            """
+            return self._expert_model_parallel_rank
+        @expert_model_parallel_rank.setter
+        def expert_model_parallel_rank(self, rank):
+            """ Property sets the expert model parallel rank.
+                Args:
+                    rank (int):  Tensor model parallel rank.
+            """
+            self._expert_model_parallel_rank = rank
+        @property
+        def expert_model_parallel_size(self):
+            """ Property returns the number of GPUs in each expert parallel group.
+                Returns:
+                    Number of GPUs in each expert parallel group.
+            """
+            return self._expert_model_parallel_size
+        @expert_model_parallel_size.setter
+        def expert_model_parallel_size(self, size):
+            """ Property sets the number of GPUs in each expert parallel group.
+                Args:
+                    size (int):  Number of GPUs in each expert parallel group.
+            """
+            self._expert_model_parallel_size = size
         @property
         def pipeline_model_parallel_size(self):
             """ Property returns the number of GPUs in each model parallel group.
@@ Expand Down @@

0 comments on commit `8cb80eb`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `8cb80eb`

Commit

There are no files selected for viewing

0 comments on commit 8cb80eb

0 comments on commit `8cb80eb`