Skip to content

Commit

Permalink
MoE parameter passing (#8490)
Browse files Browse the repository at this point in the history
* MoE parameter passing (#8255)

* MoE parameter passing

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Pass EP/MoE params in consumer scripts.

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* PR fixes

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Use latest commit of mcore-0.5

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* CI fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Jiaqiz/option to disable adapters & merge all lora layers (#8029)

* Added LoRA support for the Dense layer of Attention

* Added LoRA MLP support to MCore and NeMo models.

* Change LoRA config default to QKV.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed bug with ddp training.

* use adapter only when it is enabled

Signed-off-by: jiaqi zeng <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix lora merge script (#8113)

Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: Adi Renduchintala <[email protected]>

* add peft ckpt to nemo

Signed-off-by: Jiaqi Zeng <[email protected]>

* merge lora weights for all layers, mcore only

Signed-off-by: Jiaqi Zeng <[email protected]>

* support/fix cpu initialization

Signed-off-by: Chen Cui <[email protected]>

* add example usage

Signed-off-by: Chen Cui <[email protected]>

* fix TP due to distributed checkpoint

Signed-off-by: Chen Cui <[email protected]>

* updating the logic of merging lora weights for all layers, mcore only

Signed-off-by: Jiaqi Zeng <[email protected]>

* MCoreMixin chages.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* merge in fp32 then cast back

Signed-off-by: Jiaqi Zeng <[email protected]>

* remove ckpt to nemo

Signed-off-by: Jiaqi Zeng <[email protected]>

* fix import

Signed-off-by: Jiaqi Zeng <[email protected]>

---------

Signed-off-by: jiaqi zeng <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Jiaqi Zeng <[email protected]>
Co-authored-by: Tugrul Konuk <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adi Renduchintala <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Update k2 version (#8478)

Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add mcore full TE transformer layer spec (#8328)

* Add spec and implement autocast layer

Signed-off-by: Jan Baczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jan Baczek <[email protected]>

* remove try-catchs, these dependecies are mandatory for this file

Signed-off-by: Jan Baczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jan Baczek <[email protected]>

* Check out this cool try/except clause

Signed-off-by: Jan Baczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unused import

Signed-off-by: Jan Baczek <[email protected]>

* Add import tests to Jenkinsfile

Signed-off-by: Jan Baczek <[email protected]>

* Move import tests to Jenkins and remove code that is developed only for passing tests

Signed-off-by: Jan Baczek <[email protected]>

* Make test robust to faulty base configs

Signed-off-by: Jan Baczek <[email protected]>

* Use proper GPT implementation in the test

Signed-off-by: Jan Baczek <[email protected]>

* Update nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

Co-authored-by: Sudhakar Singh <[email protected]>
Signed-off-by: jbaczek <[email protected]>

* Update nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

Co-authored-by: Sudhakar Singh <[email protected]>
Signed-off-by: jbaczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

Co-authored-by: Jaemin Choi <[email protected]>
Signed-off-by: jbaczek <[email protected]>

* Update nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

Co-authored-by: Jaemin Choi <[email protected]>
Signed-off-by: jbaczek <[email protected]>

* Add TE knobs to the copy of AutocastTransformerLayer

Signed-off-by: Jan Baczek <[email protected]>

* Add TE knobs to the copy of AutocastTransformerLayer

Signed-off-by: Jan Baczek <[email protected]>

* Add dummy parameter to accomodated for the changes in mcore

Signed-off-by: Jan Baczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update mcore to 0.5.0 in Jenkins pipeline

Signed-off-by: Jan Baczek <[email protected]>

* Bump mcore commit. This is commit from tot, not any release.

Signed-off-by: Jan Baczek <[email protected]>

* Remove from the test config option that is incompatible with bias_activation_fusion

Signed-off-by: Jan Baczek <[email protected]>

* Bump TE version in CI to 1.4

Signed-off-by: Jan Baczek <[email protected]>

* Update test

Signed-off-by: Jan Baczek <[email protected]>

* Change precision for the test - current runnens don't support bf16

Signed-off-by: Jan Baczek <[email protected]>

---------

Signed-off-by: Jan Baczek <[email protected]>
Signed-off-by: jbaczek <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Sudhakar Singh <[email protected]>
Co-authored-by: Jaemin Choi <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add mcore full TE transformer layer spec (#8328)

* Add spec and implement autocast layer

Signed-off-by: Jan Baczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jan Baczek <[email protected]>

* remove try-catchs, these dependecies are mandatory for this file

Signed-off-by: Jan Baczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jan Baczek <[email protected]>

* Check out this cool try/except clause

Signed-off-by: Jan Baczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unused import

Signed-off-by: Jan Baczek <[email protected]>

* Add import tests to Jenkinsfile

Signed-off-by: Jan Baczek <[email protected]>

* Move import tests to Jenkins and remove code that is developed only for passing tests

Signed-off-by: Jan Baczek <[email protected]>

* Make test robust to faulty base configs

Signed-off-by: Jan Baczek <[email protected]>

* Use proper GPT implementation in the test

Signed-off-by: Jan Baczek <[email protected]>

* Update nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

Co-authored-by: Sudhakar Singh <[email protected]>
Signed-off-by: jbaczek <[email protected]>

* Update nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

Co-authored-by: Sudhakar Singh <[email protected]>
Signed-off-by: jbaczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

Co-authored-by: Jaemin Choi <[email protected]>
Signed-off-by: jbaczek <[email protected]>

* Update nemo/collections/nlp/models/language_modeling/megatron/gpt_full_te_layer_autocast_spec.py

Co-authored-by: Jaemin Choi <[email protected]>
Signed-off-by: jbaczek <[email protected]>

* Add TE knobs to the copy of AutocastTransformerLayer

Signed-off-by: Jan Baczek <[email protected]>

* Add TE knobs to the copy of AutocastTransformerLayer

Signed-off-by: Jan Baczek <[email protected]>

* Add dummy parameter to accomodated for the changes in mcore

Signed-off-by: Jan Baczek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update mcore to 0.5.0 in Jenkins pipeline

Signed-off-by: Jan Baczek <[email protected]>

* Bump mcore commit. This is commit from tot, not any release.

Signed-off-by: Jan Baczek <[email protected]>

* Remove from the test config option that is incompatible with bias_activation_fusion

Signed-off-by: Jan Baczek <[email protected]>

* Bump TE version in CI to 1.4

Signed-off-by: Jan Baczek <[email protected]>

* Update test

Signed-off-by: Jan Baczek <[email protected]>

* Change precision for the test - current runnens don't support bf16

Signed-off-by: Jan Baczek <[email protected]>

---------

Signed-off-by: Jan Baczek <[email protected]>
Signed-off-by: jbaczek <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Sudhakar Singh <[email protected]>
Co-authored-by: Jaemin Choi <[email protected]>

* Handle float limit_val_batches (#8426)

* Handle float limit_val_batches

Signed-off-by: Abhishree <[email protected]>

* Rectify reconfiguration of float limit_val_batches

Signed-off-by: Abhishree <[email protected]>

* Remove unused imports

Signed-off-by: Abhishree <[email protected]>

* Scale len(val_dataloader) with float limit_val_batches

Signed-off-by: Abhishree <[email protected]>

* Return len(dataloader) in microbatches

Signed-off-by: Abhishree <[email protected]>

* Add back resetting of num val samples

Signed-off-by: Abhishree <[email protected]>

* Fix to ensure float limit_val_batches is multiple of num_micro_batches

Signed-off-by: Abhishree <[email protected]>

* Remove forcing eval samples to 1 for float limit_val_batches

Signed-off-by: Abhishree <[email protected]>

* Fix bug wrt 0 limiot_val_batches

Signed-off-by: Abhishree <[email protected]>

* Add missing mock_dataset line

Signed-off-by: Abhishree <[email protected]>

* Avoid ensuring limit_val_batches is a mutliple of microbatches for 1.0

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Restore the hack forcing number of validation and test epochs to 1

Signed-off-by: Jan Baczek <[email protected]>

* Change limit_val_batches to 1.0 for GPT pretraining test. The integer value is covered in other tests

Signed-off-by: Jan Baczek <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Jan Baczek <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jan Baczek <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Fix tutorial links in user guide (#8497)

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Sequence Parallel for LoRA (#8369)

* support lora + sequence parallel

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add more comments

Signed-off-by: Chen Cui <[email protected]>

* add lora SP CI test

Signed-off-by: Chen Cui <[email protected]>

* support lora for all linear modules as in #7988

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Call proper method to replace (#8498)

Signed-off-by: Naga Venkatesh Gavini <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Added memory logger (#8395)

* Added memory logger

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Selvaraj Anandaraj <[email protected]>
Co-authored-by: Selvaraj Anandaraj <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Canary refactor for Riva (#8363)

* initial commit of bleu score tracking

Signed-off-by: Travis Bartley <[email protected]>

* initial commit, refactoring aed models for riva

Signed-off-by: Travis Bartley <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updating Canary to support torch metrics

Signed-off-by: Travis Bartley <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* style fixes

Signed-off-by: Travis Bartley <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* missed an empty batch conditional

Signed-off-by: Travis Bartley <[email protected]>

* Fixing dataloader issues

Signed-off-by: Travis Bartley <[email protected]>

* Finishing merge conflict with transcribe update

Signed-off-by: Travis Bartley <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* style fix

Signed-off-by: Travis Bartley <[email protected]>

* copyright header fix

Signed-off-by: Travis Bartley <[email protected]>

* yet another merge conflict

Signed-off-by: Travis Bartley <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* making paired data management safer

Signed-off-by: Travis Bartley <[email protected]>

* sentencepiece needs bigger tokenizer...

Signed-off-by: Travis Bartley <[email protected]>

* sentencepiece tokenizer vocab needs to be +2 from vocab for canary

Signed-off-by: Travis Bartley <[email protected]>

* Update canary tokenizer to be more generic, updated metrics to manage special tokens removal themselves.

Signed-off-by: Travis Bartley <[email protected]>

* merge conflit

Signed-off-by: Travis Bartley <[email protected]>

* Simplified tokenizer and corrected bug in dataloader

Signed-off-by: Travis Bartley <[email protected]>

* Cleaning up docstrings and fixing inference bug.

Signed-off-by: Travis Bartley <[email protected]>

* adding example scripts

Signed-off-by: Travis Bartley <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* cleaning up useless imports

Signed-off-by: Travis Bartley <[email protected]>

* adding unit tests

Signed-off-by: Travis Bartley <[email protected]>

* fixing unit tests

Signed-off-by: Travis Bartley <[email protected]>

* cfg name change

Signed-off-by: Travis Bartley <[email protected]>

* adding custom check to pass pytests

Signed-off-by: Travis Bartley <[email protected]>

* removing print script

Signed-off-by: Travis Bartley <[email protected]>

* catching bugs regarding tokens.

Signed-off-by: Travis Bartley <[email protected]>

* added docstrings and made examples scripts more generic

Signed-off-by: Travis Bartley <[email protected]>

* docstring deleted by accident

Signed-off-by: Travis Bartley <[email protected]>

* plurals in namespace

Signed-off-by: Travis Bartley <[email protected]>

* changing example script

Signed-off-by: Travis Bartley <[email protected]>

---------

Signed-off-by: Travis Bartley <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add alpha scaling to lora (#8248)

* removed pdeprecated eft model

Signed-off-by: arendu <[email protected]>

* add alpha

Signed-off-by: arendu <[email protected]>

* default for alpha

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add alpha scaling to lora (#8483)

* coldfix (#8412)

Signed-off-by: George Zelenfroynd <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Fixed errors in the CTM gen functions (#8416) (#8420)

Signed-off-by: Taejin Park <[email protected]>
Co-authored-by: Taejin Park <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Add change_vocabulary and save_tokenizers() support to Multitask ASR models (#8357) (#8367)

* Add change_vocabulary and save_tokenizers() support

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update nemo/collections/asr/models/aed_multitask_models.py

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* fix path location and branch (#8314)

* fix path location and branch (#8304)

* fix path location and branch

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* change to a floating point number

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Somshubra Majumdar <[email protected]>

* updat ebranch in tutorial

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Michal Futrega <[email protected]>

* Add TP comm overlap knobs to AutocastTransformerLayer (#8290)

Signed-off-by: Jaemin Choi <[email protected]>
Co-authored-by: Jaemin Choi <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* add deallocate pipeline output optimization (#8279) (#8318)

* add deallocate pipeline output optimization

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jimmy Zhang <[email protected]>
Co-authored-by: JimmyZhang12 <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Michal Futrega <[email protected]>

* remove assertion (#8302) (#8321)

Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Keep max_seqlen and cu_seqlens_argmin for later micro-batches when PP>1 (#8334) (#8346)

Signed-off-by: Sangkug Lym <[email protected]>
Co-authored-by: Sangkug Lym <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Enable megatron core loggers for GPT pretraining (#8354) (#8384)

* Logging changes tested for gpt_pretraining

* Additional args

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Aishwarya Bhandare <[email protected]>
Co-authored-by: ashbhandare <[email protected]>
Co-authored-by: Aishwarya Bhandare <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Fix dreambooth data sampler issue (#8400) (#8413)

* Turn on drop last

* Some neva fixes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Michal Futrega <[email protected]>

* add ensemble decoding fix (#8427) (#8433)

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* NeVA Tutorial Notebook (#8217)

* init commit - neva tutorial

Signed-off-by: Pratyush Muthukumar <[email protected]>

* NeVA tutorial notebook

Signed-off-by: Pratyush Muthukumar <[email protected]>

* init commit - neva tutorial

Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>

* NeVA tutorial notebook

Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>

* requested changes

Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>

* add inference via script

Signed-off-by: Pratyush Muthukumar <[email protected]>

* requested changes

Signed-off-by: Pratyush Muthukumar <[email protected]>

* requested changes

Signed-off-by: Pratyush Muthukumar <[email protected]>

* add codeblocks to run torchrun in notebook

Signed-off-by: Pratyush Muthukumar <[email protected]>

---------

Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>
Co-authored-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* mcore customization doc minor fix (#8421) (#8437)

Signed-off-by: Huiying Li <[email protected]>
Co-authored-by: Huiying <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Add `loop_labels` algorithm for TDT greedy decoding (#8215)

* Add `loop_labels` algorithm for TDT greedy decoding

Signed-off-by: Vladimir Bataev <[email protected]>

* Use `loop_labels` by default

Signed-off-by: Vladimir Bataev <[email protected]>

* Loop labels greedy decoding v2

Signed-off-by: Vladimir Bataev <[email protected]>

* Add comments. Clean up

Signed-off-by: Vladimir Bataev <[email protected]>

* Add comments

Signed-off-by: Vladimir Bataev <[email protected]>

* Add comments

Signed-off-by: Vladimir Bataev <[email protected]>

* Add tests for batched hypotheses

Signed-off-by: Vladimir Bataev <[email protected]>

* Add tests for batched alignments

Signed-off-by: Vladimir Bataev <[email protected]>

* Add comments

Signed-off-by: Vladimir Bataev <[email protected]>

* Fix comment

Signed-off-by: Vladimir Bataev <[email protected]>

* Fix test

Signed-off-by: Vladimir Bataev <[email protected]>

* Add computer for TDT

Signed-off-by: Vladimir Bataev <[email protected]>

* Fix TDT decoding algorithm

Signed-off-by: Vladimir Bataev <[email protected]>

* Use loop frames by default for TDT

Signed-off-by: Vladimir Bataev <[email protected]>

* Remove "loop frames" implementation for TDT

Signed-off-by: Vladimir Bataev <[email protected]>

* Clean up

Signed-off-by: Vladimir Bataev <[email protected]>

* Add comments

Signed-off-by: Vladimir Bataev <[email protected]>

* Fix confidence. Use tensor for durations.

Signed-off-by: Vladimir Bataev <[email protected]>

---------

Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Add dist ckpt support for regular optimizers (#7749) (#8293)

* Add dist ckpt support for regular optimizers

* [tutorial] fixed missing RIR scripts file. (#8257)

* fix imports

* imports fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ci imports fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert asr notebook

* revert asr notebook

---------

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Michal Futrega <[email protected]>

* Multimodal r1.23.0 bug fix  (#8315) (#8339)

* Rename quick-gelu

* ddpm config guard

* Fix ddpm edit api

* Fix insert_image_token cfg issue

* neva updates

* reformat

* Add back jenkins

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix jenkins

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bugs

* Update default neva template

---------

Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Michal Futrega <[email protected]>

* mcore ds fix (#8283) (#8385)

* [tutorial] fixed missing RIR scripts file. (#8257)

* add values to en tts dict (#7879)

* mcore ds fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update mcore

* revert asr files

* add comments

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add support for mcore mock dataset

* update mcore version

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update gpt cfg

* update mcore commit

* fix Bert unit tests

* update bert tests

* fix bert mcore test

* fix gpt jenkins tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update apex & TE commits

* revert apex installation

* turn off the fusion for jenkins

---------

Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* MCore dataset compatibility for tokenizers (#8390) (#8397)

* Add unique_identifiers for all tokenizers and eod for SentencePieceTokenizer

* Add generalized token aliases to TokenizerSpec to conform with MegatronTokenizer's interface. Remove now-redundant individual fixes from AutoTokenizer and SentencePieceTokenizer.

---------

Signed-off-by: Valerie Sarge <[email protected]>
Co-authored-by: Valerie Sarge <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Canary: inference tokenization improvements; preserving custom keys when creating tarred manifests (#8432)

* Improvements for Canary:

- carry over custom keys when creatin tarred manifests
- selectable text field in ASR eval
- get rid of prompt slicing, create proper inference prompts

Signed-off-by: Piotr Żelasko <[email protected]>

* set ensure_ascii=False in tarred conversion to avoid breaking tokenizers trained on UTF-8 encoding

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* add  sbert to IR (#8445)

* add  sbert to IR

Signed-off-by: ataghibakhsh <[email protected]>

* add doc

Signed-off-by: ataghibakhsh <[email protected]>

* fix the  auto_tokenizer property method reset bug

Signed-off-by: ataghibakhsh <[email protected]>

* addressed bot comments

Signed-off-by: ataghibakhsh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: ataghibakhsh <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Michal Futrega <[email protected]>

* Update readme (#8440)

* update

Signed-off-by: eharper <[email protected]>

* udpate

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* landing pages added

* landing page added for vision

* landing pages updated

* some minor changes to the main readme

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* typo fixed

* update

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: ntajbakhsh <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* NeMo-Mistral to HF converter bugfix. (#8353) (#8442)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Fixing mcore bert for TP, PP and SP (#8336) (#8443)

* Fixing mcore bert for TP, PP and SP

* Fixing mcore bert for TP, PP and SP

* Fixing mcore version

* Fixing mcore version

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

---------

Signed-off-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Add LoRA support to all linear layers (#7988)

* Added LoRA support for the Dense layer of Attention

* Added LoRA MLP support to MCore and NeMo models.

* Change LoRA config default to QKV.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed bug with ddp training.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* MCoreMixin chages.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* using new commit of meg-LM

Signed-off-by: arendu <[email protected]>

* add cpu_offloading_num_layers to conversion script until bug in megatron is fixed

Signed-off-by: Chen Cui <[email protected]>

* fix peft mixin arguments to follow mcore 0.5

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update megatron commit to fix ci error

Signed-off-by: Chen Cui <[email protected]>

* try to fix ci

Signed-off-by: Chen Cui <[email protected]>

* try to fix ci

Signed-off-by: Chen Cui <[email protected]>

* add cfg default

Signed-off-by: Chen Cui <[email protected]>

---------

Signed-off-by: Adi Renduchintala <[email protected]>
Signed-off-by: Jiaqi Zeng <[email protected]>
Signed-off-by: arendu <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adi Renduchintala <[email protected]>
Co-authored-by: Jiaqi Zeng <[email protected]>
Co-authored-by: arendu <[email protected]>
Co-authored-by: HeyyyyyyG <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Add Neva Template for NV-DPO Models  (#8358)

* add/rename from nvgpt to nv_steerlm, add nv_dpo template

Signed-off-by: HuiyingLi <[email protected]>

* add nv_dpo conversation to accomendate empty system message

Signed-off-by: HuiyingLi <[email protected]>

* handle nv_dpo template text generation

Signed-off-by: HuiyingLi <[email protected]>

* add prompt string to nvgpt

Signed-off-by: HuiyingLi <[email protected]>

* bugfix for inference prompt template

Signed-off-by: HuiyingLi <[email protected]>

* bug fix for grabbing clean text

Signed-off-by: Huiying Li <[email protected]>

* fix code format

Signed-off-by: Huiying Li <[email protected]>

---------

Signed-off-by: HuiyingLi <[email protected]>
Signed-off-by: Huiying Li <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Rebase scaling alpha

Signed-off-by: Michal Futrega <[email protected]>

* default for alpha

Signed-off-by: arendu <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>

* Rebase scaling alpha

Signed-off-by: Michal Futrega <[email protected]>

---------

Signed-off-by: George Zelenfroynd <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>
Signed-off-by: Taejin Park <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Jaemin Choi <[email protected]>
Signed-off-by: Jimmy Zhang <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Sangkug Lym <[email protected]>
Signed-off-by: Aishwarya Bhandare <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Huiying Li <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: Valerie Sarge <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: ataghibakhsh <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Shanmugam Ramasamy <[email protected]>
Signed-off-by: Adi Renduchintala <[email protected]>
Signed-off-by: Jiaqi Zeng <[email protected]>
Signed-off-by: arendu <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: HuiyingLi <[email protected]>
Co-authored-by: George <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Taejin Park <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: Jaemin Choi <[email protected]>
Co-authored-by: Jaemin Choi <[email protected]>
Co-authored-by: JimmyZhang12 <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Sangkug Lym <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: ashbhandare <[email protected]>
Co-authored-by: Aishwarya Bhandare <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Pratyush Muthukumar <[email protected]>
Co-authored-by: Pratyush Muthukumar <[email protected]>
Co-authored-by: Huiying <[email protected]>
Co-authored-by: Vladimir Bataev <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Valerie Sarge <[email protected]>
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: ntajbakhsh <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Tugrul Konuk <[email protected]>
Co-authored-by: Adi Renduchintala <[email protected]>
Co-authored-by: Jiaqi Zeng <[email protected]>
Co-authored-by: arendu <[email protected]>
Co-authored-by: HeyyyyyyG <[email protected]>
Co-authored-by: Chen Cui <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Signed-off-by: George Zelenfroynd <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>
Signed-off-by: Taejin Park <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Jaemin Choi <[email protected]>
Signed-off-by: Jimmy Zhang <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Sangkug Lym <[email protected]>
Signed-off-by: Aishwarya Bhandare <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Huiying Li <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: Valerie Sarge <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: ataghibakhsh <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Shanmugam Ramasamy <[email protected]>
Signed-off-by: Adi Renduchintala <[email protected]>
Signed-off-by: Jiaqi Zeng <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: HuiyingLi <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Michal Futrega <[email protected]>
Co-authored-by: George <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Taejin Park <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Piotr Żelasko <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: Jaemin Choi <[email protected]>
Co-authored-by: Jaemin Choi <[email protected]>
Co-authored-by: JimmyZhang12 <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Sangkug Lym <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: ashbhandare <[email protected]>
Co-authored-by: Aishwarya Bhandare <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Pratyush Muthukumar <[email protected]>
Co-authored-by: Pratyush Muthukumar <[email protected]>
Co-authored-by: Huiying <[email protected]>
Co-authored-by: Vladimir Bataev <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Valerie Sarge <[email protected]>
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: ntajbakhsh <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Tugrul Konuk <[email protected]>
Co-authored-by: Jiaqi Zeng <[email protected]>
Co-authored-by: HeyyyyyyG <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Update PEFT Doc (#8501)

* update peft doc

Signed-off-by: Chen Cui <[email protected]>

* remove old prompt learning doc and notebook

Signed-off-by: Chen Cui <[email protected]>

* fix table

Signed-off-by: Chen Cui <[email protected]>

* fix table

Signed-off-by: Chen Cui <[email protected]>

* fix table

Signed-off-by: Chen Cui <[email protected]>

* revert accidental commit

Signed-off-by: Chen Cui <[email protected]>

* revert accidental commit

Signed-off-by: Chen Cui <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* release updates (#8394)

* release updates (#8378)

* [tutorial] fixed missing RIR scripts file. (#8257)

Signed-off-by: Xuesong Yang <[email protected]>

* add values to en tts dict (#7879)

Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>

* mcore ds fix

Signed-off-by: Dmytro Pykhtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update mcore

Signed-off-by: dimapihtar <[email protected]>

* revert asr files

Signed-off-by: dimapihtar <[email protected]>

* add comments

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add support for mcore mock dataset

Signed-off-by: dimapihtar <[email protected]>

* update mcore version

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update gpt cfg

Signed-off-by: dimapihtar <[email protected]>

* update mcore commit

Signed-off-by: dimapihtar <[email protected]>

* fix Bert unit tests

Signed-off-by: dimapihtar <[email protected]>

* update bert tests

Signed-off-by: dimapihtar <[email protected]>

* fix bert mcore test

Signed-off-by: dimapihtar <[email protected]>

* fix gpt jenkins tests

Signed-off-by: dimapihtar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add support for dict data input type

Signed-off-by: dimapihtar <[email protected]>

* add mock ds test

Signed-off-by: dimapihtar <[email protected]>

* add test for dict data input type

Signed-off-by: dimapihtar <[email protected]>

* mcore ds fix

Signed-off-by: dimapihtar <[email protected]>

* data input fix

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Pablo Garay <[email protected]>

* Update megatron_gpt_model.py

Signed-off-by: Dmytro Pykhtar <[email protected]>

---------

Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: jiaqi zeng <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Jiaqi Zeng <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Jan Baczek <[email protected]>
Signed-off-by: jbaczek <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Naga Venkatesh Gavini <[email protected]>
Signed-off-by: Selvaraj Anandaraj <[email protected]>
Signed-off-by: Travis Bartley <[email protected]>
Signed-off-by: arendu <[email protected]>
Signed-off-by: George Zelenfroynd <[email protected]>
Signed-off-by: Michal Futrega <[email protected]>
Signed-off-by: Taejin Park <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Jaemin Choi <[email protected]>
Signed-off-by: Jimmy Zhang <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: Sangkug Lym <[email protected]>
Signed-off-by: Aishwarya Bhandare <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Pratyush Muthukumar <[email protected]>
Signed-off-by: Huiying Li <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: Valerie Sarge <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: ataghibakhsh <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: Shanmugam Ramasamy <[email protected]>
Signed-off-by: Adi Renduchintala <[email protected]>
Signed-off-by: HuiyingLi <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: akoumpa <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: HeyyyyyyG <[email protected]>
Co-authored-by: Tugrul Konuk <[email protected]>
Co-authored-by: Adi Renduchintala <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Co-authored-by: Vladimir Bataev <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Sudhakar Singh <[email protected]>
Co-authored-by: Jaemin Choi <[email protected]>
Co-authored-by: jbaczek <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Jan Baczek <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Naga Venkatesh Gavini <[email protected]>
Co-authored-by: Selvaraj Anandaraj <[email protected]>
Co-authored-by: Selvaraj Anandaraj <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: tbartley94 <[email protected]>
Co-authored-by: Piotr Żelasko <[email protected]>
Co-authored-by: Michal Futrega <[email protected]>
Co-authored-by: George <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Taejin Park <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: Jaemin Choi <[email protected]>
Co-authored-by: JimmyZhang12 <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Sangkug Lym <[email protected]>
Co-authored-by: ashbhandare <[email protected]>
Co-authored-by: Aishwarya Bhandare <[email protected]>
Co-authored-by: Pratyush Muthukumar <[email protected]>
Co-authored-by: Pratyush Muthukumar <[email protected]>
Co-authored-by: Huiying <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: dimapihtar <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Co-authored-by: Valerie Sarge <[email protected]>
Co-authored-by: Ali Taghibakhshi <[email protected]>
Co-authored-by: ntajbakhsh <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Shanmugam Ramasamy <[email protected]>
Co-authored-by: Jiaqi Zeng <[email protected]>
  • Loading branch information
Show file tree
Hide file tree
Showing 6 changed files with 86 additions and 4 deletions.
21 changes: 18 additions & 3 deletions examples/nlp/language_modeling/megatron_gpt_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,9 @@ def main(cfg) -> None:

assert (
cfg.trainer.devices * cfg.trainer.num_nodes
== cfg.tensor_model_parallel_size * cfg.pipeline_model_parallel_size
== cfg.tensor_model_parallel_size
* cfg.pipeline_model_parallel_size
* max(1, cfg.get('expert_model_parallel_size', 1))
), "devices * num_nodes should equal tensor_model_parallel_size * pipeline_model_parallel_size"

if cfg.gpt_model_file:
Expand All @@ -224,6 +226,8 @@ def main(cfg) -> None:
# with dist checkpointing we can use the model parallel config specified by the user
pretrained_cfg.tensor_model_parallel_size = cfg.tensor_model_parallel_size
pretrained_cfg.pipeline_model_parallel_size = cfg.pipeline_model_parallel_size
pretrained_cfg.expert_model_parallel_size = cfg.get('expert_model_parallel_size', 1)
pretrained_cfg.micro_batch_size = 1
if trainer.precision == "16":
pretrained_cfg.megatron_amp_O2 = False
elif trainer.precision in ['bf16', 'bf16-mixed'] and cfg.get('megatron_amp_O2', False):
Expand All @@ -237,13 +241,23 @@ def main(cfg) -> None:
)
elif cfg.checkpoint_dir:
app_state = AppState()
if cfg.tensor_model_parallel_size > 1 or cfg.pipeline_model_parallel_size > 1:
app_state.model_parallel_size = cfg.tensor_model_parallel_size * cfg.pipeline_model_parallel_size
if (
cfg.tensor_model_parallel_size > 1
or cfg.pipeline_model_parallel_size > 1
or cfg.get('expert_model_parallel_size', 1) > 1
):
app_state.model_parallel_size = (
cfg.tensor_model_parallel_size
* cfg.pipeline_model_parallel_size
* cfg.get('expert_model_parallel_size', 1)
)
app_state.tensor_model_parallel_size = cfg.tensor_model_parallel_size
app_state.pipeline_model_parallel_size = cfg.pipeline_model_parallel_size
app_state.expert_model_parallel_size = cfg.get('expert_model_parallel_size', 1)
(
app_state.tensor_model_parallel_rank,
app_state.pipeline_model_parallel_rank,
app_state.expert_model_parallel_rank,
app_state.model_parallel_size,
app_state.data_parallel_size,
app_state.pipeline_model_parallel_split_rank,
Expand All @@ -254,6 +268,7 @@ def main(cfg) -> None:
tensor_model_parallel_size_=cfg.tensor_model_parallel_size,
pipeline_model_parallel_size_=cfg.pipeline_model_parallel_size,
pipeline_model_parallel_split_rank_=cfg.pipeline_model_parallel_split_rank,
expert_model_parallel_size_=cfg.get('expert_model_parallel_size', 1),
)
checkpoint_path = os.path.join(cfg.checkpoint_dir, cfg.checkpoint_name)
# checkpoint_path is a dir in case of distributed checkpointing
Expand Down
1 change: 1 addition & 0 deletions examples/nlp/language_modeling/tuning/megatron_gpt_sft.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ def _modify_config(gpt_cfg, cfg, add_cfg_to_tree=False):
gpt_cfg.ffn_dropout = cfg.model.ffn_dropout
gpt_cfg.use_flash_attention = cfg.model.get('use_flash_attention', False)
gpt_cfg.tensor_model_parallel_size = cfg.model.get('tensor_model_parallel_size', 1)
gpt_cfg.expert_model_parallel_size = cfg.model.get('expert_model_parallel_size', 1)
gpt_cfg.pipeline_model_parallel_size = cfg.model.get('pipeline_model_parallel_size', 1)
gpt_cfg.pipeline_model_parallel_split_rank = cfg.model.get('pipeline_model_parallel_split_rank', 0)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,11 @@ def __init__(self, cfg: DictConfig, trainer: Trainer, no_lm_init=True):
# Overrides used when converting checkpoints
if os.environ.get(NEMO_MEGATRON_MODEL_PARALLEL_APPSTATE_OVERRIDE, "false").lower() == "true":
app_state = AppState()
init_world_size = app_state.tensor_model_parallel_size * app_state.pipeline_model_parallel_size
init_world_size = (
app_state.tensor_model_parallel_size
* app_state.pipeline_model_parallel_size
* (app_state.expert_model_parallel_size or 1)
)
init_global_rank = app_state.global_rank
init_local_rank = app_state.local_rank
else:
Expand All @@ -186,6 +190,7 @@ def __init__(self, cfg: DictConfig, trainer: Trainer, no_lm_init=True):
global_rank=init_global_rank,
local_rank=init_local_rank,
tensor_model_parallel_size=cfg.get('tensor_model_parallel_size', 1),
expert_model_parallel_size=cfg.get('expert_model_parallel_size', 1),
pipeline_model_parallel_size=cfg.get('pipeline_model_parallel_size', 1),
virtual_pipeline_model_parallel_size=vp_size,
pipeline_model_parallel_split_rank=cfg.get('pipeline_model_parallel_split_rank', 0),
Expand Down
26 changes: 26 additions & 0 deletions nemo/collections/nlp/modules/common/megatron/megatron_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@
from megatron.core import tensor_parallel
from megatron.core.parallel_state import (
get_pipeline_model_parallel_rank,
set_expert_model_parallel_rank,
set_expert_model_parallel_world_size,
set_pipeline_model_parallel_rank,
set_pipeline_model_parallel_split_rank,
set_pipeline_model_parallel_world_size,
Expand Down Expand Up @@ -60,6 +62,7 @@ def initialize_model_parallel_for_nemo(
global_rank,
local_rank,
tensor_model_parallel_size=1,
expert_model_parallel_size=1,
pipeline_model_parallel_size=1,
virtual_pipeline_model_parallel_size=None,
pipeline_model_parallel_split_rank=None,
Expand All @@ -81,6 +84,7 @@ def initialize_model_parallel_for_nemo(
app_state.global_rank = global_rank
app_state.world_size = world_size
app_state.local_rank = local_rank
app_state.expert_model_parallel_size = expert_model_parallel_size
app_state.tensor_model_parallel_size = tensor_model_parallel_size
app_state.pipeline_model_parallel_size = pipeline_model_parallel_size
app_state.virtual_pipeline_model_parallel_size = virtual_pipeline_model_parallel_size
Expand All @@ -90,6 +94,7 @@ def initialize_model_parallel_for_nemo(
(
app_state.tensor_model_parallel_rank,
app_state.pipeline_model_parallel_rank,
app_state.expert_model_parallel_rank,
app_state.model_parallel_size,
app_state.data_parallel_size,
app_state.pipeline_model_parallel_split_rank,
Expand All @@ -102,12 +107,16 @@ def initialize_model_parallel_for_nemo(
virtual_pipeline_model_parallel_size_=virtual_pipeline_model_parallel_size,
pipeline_model_parallel_split_rank_=pipeline_model_parallel_split_rank,
context_parallel_size_=context_parallel_size,
expert_model_parallel_size_=expert_model_parallel_size,
)

# update apex.transformer globals
set_tensor_model_parallel_world_size(app_state.tensor_model_parallel_size)
set_tensor_model_parallel_rank(app_state.tensor_model_parallel_rank)

set_expert_model_parallel_world_size(app_state.expert_model_parallel_size)
set_expert_model_parallel_rank(app_state.expert_model_parallel_rank)

set_pipeline_model_parallel_rank(app_state.pipeline_model_parallel_rank)
if HAVE_INTERLEAVED:
set_virtual_pipeline_model_parallel_world_size(app_state.virtual_pipeline_model_parallel_size)
Expand Down Expand Up @@ -179,6 +188,7 @@ def fake_initialize_model_parallel(
pipeline_model_parallel_size_,
pipeline_model_parallel_split_rank_=None,
virtual_pipeline_model_parallel_size_=None,
expert_model_parallel_size_=1,
context_parallel_size_=1,
):
"""
Expand Down Expand Up @@ -302,6 +312,21 @@ def fake_initialize_model_parallel(
logging.info(f'All tensor model parallel group ranks: {all_tensor_model_parallel_group_ranks}')
logging.info(f'Rank {rank} has tensor model parallel rank: {tensor_model_parallel_rank}')

# EP rank
expert_model_parallel_rank = 0
if expert_model_parallel_size_ is not None and expert_model_parallel_size_ > 1:
tensor_and_data_group_size: int = tensor_model_parallel_size * data_parallel_size
num_tensor_and_data_groups: int = world_size // tensor_and_data_group_size
tensor_and_expert_group_size: int = tensor_model_parallel_size * expert_model_parallel_size_
num_expert_groups: int = data_parallel_size // expert_model_parallel_size_
for i in range(num_tensor_and_data_groups):
for j in range(num_expert_groups):
start_rank = i * tensor_and_data_group_size + j * tensor_and_expert_group_size
end_rank = i * tensor_and_data_group_size + (j + 1) * tensor_and_expert_group_size
ranks = range(start_rank, end_rank)
if rank in ranks:
expert_model_parallel_rank = list(ranks).index(rank)

# Build the pipeline model-parallel groups and embedding groups
# (first and last rank in each pipeline model-parallel group).
all_pipeline_model_parallel_group_ranks = []
Expand Down Expand Up @@ -340,6 +365,7 @@ def fake_initialize_model_parallel(
return (
tensor_model_parallel_rank,
pipeline_model_parallel_rank,
expert_model_parallel_rank,
model_parallel_size,
data_parallel_size,
pipeline_model_parallel_split_rank_,
Expand Down
1 change: 1 addition & 0 deletions nemo/collections/nlp/parts/nlp_overrides.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,7 @@ def init_model_parallel(sharp: bool, nccl_communicator_config_path: str = None)
context_parallel_size=app_state.context_parallel_size,
nccl_communicator_config_path=nccl_communicator_config_path,
use_sharp=sharp,
expert_model_parallel_size=app_state.expert_model_parallel_size,
)

# assert that fake tp and pp rank match after model parallel init
Expand Down
34 changes: 34 additions & 0 deletions nemo/utils/app_state.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,13 +39,15 @@ def __init__(self):
self._local_rank = None
self._global_rank = None
self._tensor_model_parallel_rank = None
self._expert_model_parallel_rank = None
self._pipeline_model_parallel_rank = None
self._data_parallel_rank = None

self._world_size = None
self._model_parallel_size = None
self._tensor_model_parallel_size = None
self._tensor_model_parallel_group = None
self._expert_model_parallel_size = None
self._pipeline_model_parallel_size = None
self._virtual_pipeline_model_parallel_size = None
self._pipeline_model_parallel_group = None
Expand Down Expand Up @@ -141,6 +143,38 @@ def tensor_model_parallel_size(self, size):
"""
self._tensor_model_parallel_size = size

@property
def expert_model_parallel_rank(self):
""" Property returns the expert model parallel rank.
Returns:
Tensor model parallel rank.
"""
return self._expert_model_parallel_rank

@expert_model_parallel_rank.setter
def expert_model_parallel_rank(self, rank):
""" Property sets the expert model parallel rank.
Args:
rank (int): Tensor model parallel rank.
"""
self._expert_model_parallel_rank = rank

@property
def expert_model_parallel_size(self):
""" Property returns the number of GPUs in each expert parallel group.
Returns:
Number of GPUs in each expert parallel group.
"""
return self._expert_model_parallel_size

@expert_model_parallel_size.setter
def expert_model_parallel_size(self, size):
""" Property sets the number of GPUs in each expert parallel group.
Args:
size (int): Number of GPUs in each expert parallel group.
"""
self._expert_model_parallel_size = size

@property
def pipeline_model_parallel_size(self):
""" Property returns the number of GPUs in each model parallel group.
Expand Down

0 comments on commit 8cb80eb

Please sign in to comment.