Skip to content

Commit f375d51

Browse files
marcromeynakoumpagithub-actions[bot]rachitgarg91Rachit Garg
authored
[NeMo-UX] Integrating mcore's DistributedDataParallel into MegatronStrategy (#9387)
* Integrating mcore's DistributedDataParallel into MegatronStrategy Signed-off-by: Marc Romeyn <[email protected]> * Apply isort and black reformatting Signed-off-by: marcromeyn <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * Apply ddp-hooks from pytorch only when needed Signed-off-by: Marc Romeyn <[email protected]> * bugfix if using mcore distOpt with sft (#9356) * bugfix if using mcore distOpt Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * fix typo infer_seq_lenght -> infer_seq_length (#9370) Signed-off-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Marc Romeyn <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * Rachitg/ag (#9083) * Rachitg/ag (#9081) * disable overlap for qkv Signed-off-by: Rachit Garg <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bug fix * bugfix --------- Signed-off-by: Rachit Garg <[email protected]> Signed-off-by: Rachit Garg <[email protected]> Co-authored-by: Rachit Garg <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: michal2409 <[email protected]> --------- Signed-off-by: Rachit Garg <[email protected]> Signed-off-by: Rachit Garg <[email protected]> Signed-off-by: michal2409 <[email protected]> Co-authored-by: Rachit Garg <[email protected]> Co-authored-by: Rachit Garg <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Michal Futrega <[email protected]> Co-authored-by: michal2409 <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * Adding the original change made for label_models (#9377) (#9378) Signed-off-by: Taejin Park <[email protected]> Co-authored-by: Taejin Park <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * Dgalvez/fix greedy batch strategy name r2.0.0rc0 (#9243) (#9253) * Lazily warn about using greedy strategy instead of greedy_batch strategy. Previously, the warning would often run spuriously, since several existing code paths simply call "change_decoding_strategy()" after having first initialized a Module, rather than changing the config before initializing the Module. This can be confusing. The only problem I can see with this is that using logging inside a forward() method might interfere with some compiler toolkits like Torchscript or thunder.compile. Presumably it would be easy to add a conditional statement to avoid this statement in a compiler context if necessary. Signed-off-by: Daniel Galvez <[email protected]> Co-authored-by: Daniel Galvez <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * Update README.rst (#9393) Revised content per https://gitlab-master.nvidia.com/nemo-framework-tme/documentation/-/issues/25. Also removed reference to NIMs in LLMs and MMs Deployment and Optimization. It should be NVIDIA NeMo Microservices and not NIM. Removed nemo:24.03.framework and nemo:24.01.speech in Docker Containers section and replaced with 24.05 . Please verify all changes. Signed-off-by: jgerh <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * a2a fix removed tp world size and group from init (#8944) (#8952) Signed-off-by: Anmol Gupta <[email protected]> Co-authored-by: anmolgupt <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * Add config option for FP32 embedding grads (#8953) * Add config option for FP32 embedding grads (#8946) Signed-off-by: Tim Moon <[email protected]> * Apply isort and black reformatting Signed-off-by: ericharper <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Signed-off-by: ericharper <[email protected]> Co-authored-by: Tim Moon <[email protected]> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: ericharper <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * Changes to enable CUDA graph for LLM (#8955) * Changes to enable CUDA graph for LLM (#8751) * Use next instead of get_batch Signed-off-by: Vasudevan Rengasamy <[email protected]> * CUDA graph changes Signed-off-by: Vasudevan Rengasamy <[email protected]> * Change to enable CG with weight caching Signed-off-by: Vasudevan Rengasamy <[email protected]> * Revert "Use next instead of get_batch" This reverts commit 0021bb444cdd1b27674fc0cfea909c1a42475336. Signed-off-by: Vasudevan Rengasamy <[email protected]> * Copy jbaczek/mcore_parallel_state_api_change branch leaving out changes to nemo/export/quantize/quantizer.py Signed-off-by: Jan Baczek <[email protected]> Signed-off-by: Vasudevan Rengasamy <[email protected]> * Revert "Copy jbaczek/mcore_parallel_state_api_change branch leaving out changes to nemo/export/quantize/quantizer.py" This reverts commit b4f736ed2b39f6c48d2868ac3febb82c763ab3fb. Signed-off-by: Vasudevan Rengasamy <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <[email protected]> * Remove skip_weight_update argument Signed-off-by: Vasudevan Rengasamy <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <[email protected]> * Bug fix + cleanup Signed-off-by: Vasudevan Rengasamy <[email protected]> * Cleanup Signed-off-by: Vasudevan Rengasamy <[email protected]> * Use new TE API for FP8 Param transpose Signed-off-by: Vasudevan Rengasamy <[email protected]> * Change config param cuda_graph to enable_cuda_graph Signed-off-by: Vasudevan Rengasamy <[email protected]> * Enable TE RNGStatesTracker through config Signed-off-by: Vasudevan Rengasamy <[email protected]> * Change te_rng_tracker to use_te_rng_tracker Signed-off-by: Vasudevan Rengasamy <[email protected]> * FP8 weight transpose handled inside TE Signed-off-by: Vasudevan Rengasamy <[email protected]> * Cleanup Signed-off-by: Vasudevan Rengasamy <[email protected]> * Revert "Revert "Copy jbaczek/mcore_parallel_state_api_change branch leaving out changes to nemo/export/quantize/quantizer.py"" This reverts commit e31862481216f9adf7fa584a0c0262916c935639. Signed-off-by: Vasudevan Rengasamy <[email protected]> * Fix merge conflicts Signed-off-by: Vasudevan Rengasamy <[email protected]> * Fix merge conflicts Signed-off-by: Vasudevan Rengasamy <[email protected]> * Fix merge conflicts Signed-off-by: Vasudevan Rengasamy <[email protected]> --------- Signed-off-by: Vasudevan Rengasamy <[email protected]> Signed-off-by: Jan Baczek <[email protected]> Co-authored-by: Jaemin Choi <[email protected]> Co-authored-by: Jan Baczek <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: ericharper <[email protected]> --------- Signed-off-by: Vasudevan Rengasamy <[email protected]> Signed-off-by: Jan Baczek <[email protected]> Signed-off-by: ericharper <[email protected]> Co-authored-by: vasunvidia <[email protected]> Co-authored-by: Jaemin Choi <[email protected]> Co-authored-by: Jan Baczek <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: ericharper <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * Enhance Distributed Adam (#9051) * Enhance Distributed Adam (#9037) * Fix deprecated env. Signed-off-by: Wil Kong <[email protected]> * Use user desired value for distributed adam. Signed-off-by: Wil Kong <[email protected]> * Preserve memory format in parameter buffer of distributed adam. Signed-off-by: Wil Kong <[email protected]> * Fix the contiguous_param_buffer bug about bprop overlap and redundant copy after all-gather. Signed-off-by: Wil Kong <[email protected]> * Provide API to lock SHArP tree for distributed adam within nodes. Signed-off-by: Wil Kong <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Wil Kong <[email protected]> --------- Signed-off-by: Wil Kong <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: ericharper <[email protected]> --------- Signed-off-by: Wil Kong <[email protected]> Signed-off-by: ericharper <[email protected]> Co-authored-by: Wil Kong <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: ericharper <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * Force diarizer to use CUDA if cuda is available and if device=None. (#9380) (#9390) * Fixed clustering diarizer to load MSDD to GPU by default if cuda on * Fixed clustering diarizer to load MSDD to GPU by default if cuda on * Apply isort and black reformatting --------- Signed-off-by: Taejin Park <[email protected]> Signed-off-by: tango4j <[email protected]> Co-authored-by: Taejin Park <[email protected]> Co-authored-by: tango4j <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * ci: Properly catch failed tests by introduction of workflow templates (#9324) * ci: Refactor tests into reusable template Signed-off-by: Oliver Koenig <[email protected]> * ci: Fix sending alerts on failure Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * disable slack Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix alerting Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * ci: Increase timeout for `L0_Unit_Tests_CPU` Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * increase timeout Signed-off-by: Oliver Koenig <[email protected]> * increase timeout for `Speech_Checkpoints_tests` Signed-off-by: Oliver Koenig <[email protected]> * improve readability Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * test Signed-off-by: Oliver Koenig <[email protected]> * test Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * finalize Signed-off-by: Oliver Koenig <[email protected]> * fix Signed-off-by: Oliver Koenig <[email protected]> * add missing rm statement for `L2_PTQ_Llama2_Export_Only` Signed-off-by: Oliver Koenig <[email protected]> * all your comments are belong to us Signed-off-by: Oliver Koenig <[email protected]> * remove github output Signed-off-by: Oliver Koenig <[email protected]> * revive more comments Signed-off-by: Oliver Koenig <[email protected]> * add L2: ASR dev run - part two Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Pablo Garay <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * Fix T5 G2P Input and Output Types (#9224) (#9269) * fix t5 g2p model * Apply isort and black reformatting --------- Signed-off-by: Jason <[email protected]> Signed-off-by: blisc <[email protected]> Co-authored-by: Jason <[email protected]> Co-authored-by: blisc <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * Use model-cast-to-bfloat16 rather than AMP-to-bfloat16 for inference. (#9198) * Fix the "cast ping pong" problem when we run AMP inference. This has been tested only for Parakeet-CTC-1.1B right now. This problem certainly exists elsewhere. Automatic mixed precision and inference do not play well together. First, automatic mixed precision was created back when neural networks were much simpler. In particular, they did not have softmax and layer norm as frequent operations. In the era of transformers, softmax and layer norm are very common. AMP will uncoditionally output fp32 outputs from these operations, even if their inputs are fp16. See here: https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32 This is no longer necessary, now that layer norm does accumulation in fp32 in pytorch, even if the input is fp16: https://github.com/pytorch/pytorch/issues/66707 Do infernece by casting model to bfloat16, not by using AMP. Do feature preprocessing in float32 for accuracy. Warn if someone tries to input a non-float32 tensor. Always create the output in the type the rest of the model expects. Sort manifests by duration. Signed-off-by: Daniel Galvez <[email protected]> * Always cast softmax inputs to float32 when in training mode. While we don't need this for accurate results in b/float16, this is a safety precaution to make sure that training accuracy does not regress. Signed-off-by: Daniel Galvez <[email protected]> --------- Signed-off-by: Daniel Galvez <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * Huvu/rag pipeline citest (#9384) * huvu/NeMo_rag_citest first commit * adding llama-index to dependency * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adjusting data/models path in ci-test to dependency * putting llama-index to optional * update cicd-main.yml --------- Co-authored-by: Huy Vu2 <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Marc Romeyn <[email protected]> * Re-org export code (#9353) * reorg the export code Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> * replaced log with raise Signed-off-by: Onur Yilmaz <[email protected]> * add converter and loader folders Signed-off-by: Onur Yilmaz <[email protected]> * move nemo_ckpt_convert into the converter folder Signed-off-by: Onur Yilmaz <[email protected]> * move nemo_file into loader folder Signed-off-by: Onur Yilmaz <[email protected]> * reorg converter Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> * continue to reorg converter Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> * continue to reorg Signed-off-by: Onur Yilmaz <[email protected]> * move nemo file back into nemo folder Signed-off-by: Onur Yilmaz <[email protected]> * renamed nemo folder to nemo_ckpt_loader Signed-off-by: Onur Yilmaz <[email protected]> * remove unused function Signed-off-by: Onur Yilmaz <[email protected]> * removed nemo file Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> * moved a function to tensorrt_llm_run file Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> * Remove unused imports Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> * import csv added Signed-off-by: Onur Yilmaz <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: oyilmaz-nvidia <[email protected]> Co-authored-by: oyilmaz-nvidia <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * ci: Fix `L2_Segmentation_Tool_Parallel_ctc_segmentation_test_L2_Eng_CitriNet_with_wav` (#9399) Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * disable overlap for qkv (#9079) * disable overlap for qkv (#9072) * disable overlap for qkv Signed-off-by: Rachit Garg <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Rachit Garg <[email protected]> Co-authored-by: Rachit Garg <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: michal2409 <[email protected]> --------- Signed-off-by: Rachit Garg <[email protected]> Signed-off-by: michal2409 <[email protected]> Signed-off-by: Michal Futrega <[email protected]> Co-authored-by: Rachit Garg <[email protected]> Co-authored-by: Rachit Garg <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Michal Futrega <[email protected]> Co-authored-by: michal2409 <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * Fix circular import for MM dataprep notebook (#9287) (#9292) * update launcher name and fix mm circular import * Apply isort and black reformatting --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: Chen Cui <[email protected]> Co-authored-by: cuichenx <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * add check if num layers is divisible by pp size (#9208) (#9298) * add check if num_layers % pp == 0 * Apply isort and black reformatting * move num_layers / pp check to build_transformer_config --------- Signed-off-by: dimapihtar <[email protected]> Signed-off-by: dimapihtar <[email protected]> Co-authored-by: Dmytro Pykhtar <[email protected]> Co-authored-by: dimapihtar <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * Add HF siglip vision encoder (#9185) * temp save Signed-off-by: yaoyu-33 <[email protected]> * temp save 2 Signed-off-by: yaoyu-33 <[email protected]> * update code Signed-off-by: yaoyu-33 <[email protected]> * enable seq packing Signed-off-by: yaoyu-33 <[email protected]> * fix neva and clip Signed-off-by: yaoyu-33 <[email protected]> * Enable parallel seq packing algo and few other fixes Signed-off-by: yaoyu-33 <[email protected]> * Pipeline parallel support Signed-off-by: yaoyu-33 <[email protected]> * Update data preprocess Signed-off-by: yaoyu-33 <[email protected]> * fix few pp issues Signed-off-by: yaoyu-33 <[email protected]> * enable sequence packing w/ PP Signed-off-by: yaoyu-33 <[email protected]> * Fix cu_seqlens in inputs Signed-off-by: yaoyu-33 <[email protected]> * add assert Signed-off-by: yaoyu-33 <[email protected]> * Depend on PP to decide whether do padding Signed-off-by: yaoyu-33 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add docstring Signed-off-by: yaoyu-33 <[email protected]> * Fix few evaluation issues Signed-off-by: yaoyu-33 <[email protected]> * Fix few PP evaluation issues Signed-off-by: yaoyu-33 <[email protected]> * Address comments Signed-off-by: yaoyu-33 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add llama3 template Signed-off-by: yaoyu-33 <[email protected]> * address comments Signed-off-by: yaoyu-33 <[email protected]> * Fix license Signed-off-by: yaoyu-33 <[email protected]> * Fix llama3 Signed-off-by: yaoyu-33 <[email protected]> * Few fixes Signed-off-by: yaoyu-33 <[email protected]> * Few neva bugs Signed-off-by: yaoyu-33 <[email protected]> * Few neva bugs Signed-off-by: yaoyu-33 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Few neva bugs Signed-off-by: yaoyu-33 <[email protected]> * llama3 inference fix Signed-off-by: yaoyu-33 <[email protected]> * Force vision encoder to run in fp32 Signed-off-by: yaoyu-33 <[email protected]> * Revert "Force vision encoder to run in fp32" This reverts commit 9d2160d96cb3e2a27a18538950ef43b4482c04da. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Try adding distributed format of checkpoint Signed-off-by: yaoyu-33 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Allow dist checkpoint to be non-strict Signed-off-by: yaoyu-33 <[email protected]> * Fix Signed-off-by: yaoyu-33 <[email protected]> * Some fixes for PP + dist ckpt in Neva Signed-off-by: yaoyu-33 <[email protected]> * fix peft Signed-off-by: yaoyu-33 <[email protected]> * few fixes for lora Signed-off-by: yaoyu-33 <[email protected]> * checkpoint updates Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * bug fix Signed-off-by: yaoyu-33 <[email protected]> * Add HF siglip vision encoder Signed-off-by: HuiyingLi <[email protected]> * handle steerlm label in nv_dpo template Signed-off-by: HuiyingLi <[email protected]> * Add neva dist checkpoint converter Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix CLEAN RESPONSE logic to not use last EOS Signed-off-by: HuiyingLi <[email protected]> * strip extra_id_1 from clean response Signed-off-by: HuiyingLi <[email protected]> * change inference time image processor Signed-off-by: HuiyingLi <[email protected]> * resolve comments Signed-off-by: yaoyu-33 <[email protected]> * remove open_clip vision encoder for siglip Signed-off-by: HuiyingLi <[email protected]> * update neva dist ckpt apis Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix return Signed-off-by: yaoyu-33 <[email protected]> * resolve CLEAN RESPONSE multiturn issue Signed-off-by: HuiyingLi <[email protected]> * code format Signed-off-by: HuiyingLi <[email protected]> * fixes for isort Signed-off-by: HuiyingLi <[email protected]> * refac image processor loading to util Signed-off-by: HuiyingLi <[email protected]> * black and isort Signed-off-by: HuiyingLi <[email protected]> * move crop size assertion Signed-off-by: HuiyingLi <[email protected]> * few neva fixes Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: HuiyingLi <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: HuiyingLi <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: yaoyu-33 <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * [Nemo CICD] timeouts fix (#9407) * timeouts fix * timeouts fix Signed-off-by: Marc Romeyn <[email protected]> * Removing un-used ModelConfig class (#9389) Co-authored-by: Chen Cui <[email protected]> Signed-off-by: Marc Romeyn <[email protected]> * Extend multimodal/speech_llm with lhotse, t5 and bestow supports (#9169) * Fixes * Docs fix * Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom) * Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support distributed_fused_adam Signed-off-by: zhehuaichen <[email protected]> * support distributed_fused_adam Signed-off-by: zhehuaichen <[email protected]> * Add support for sharded NeMo manifest files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support megatron_amp_O2 Signed-off-by: zhehuaichen <[email protected]> * Support heterogeneous sampling rates in non tarred NeMo manifests * migrate to PTL2.0 Signed-off-by: stevehuang52 <[email protected]> * clean up Signed-off-by: stevehuang52 <[email protected]> * update manifest util Signed-off-by: stevehuang52 <[email protected]> * Support multiple tokenizer/parser types, aggregate tokenizers, and custom language fields * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * agg and normal tokenizers actually work * Support weights for NeMo tarred manifests * Temporarily hardcoded pnc stripping/lowercasing * fix * make pnc hack configurable from the config and disabled by default * fix the hack * migrate to ptl2.1 to support multiple dataloaders Signed-off-by: stevehuang52 <[email protected]> * support encoder overwrite Signed-off-by: zhehuaichen <[email protected]> * update misc Signed-off-by: stevehuang52 <[email protected]> * fix eval and clean up Signed-off-by: stevehuang52 <[email protected]> * support add_sep for perception model Signed-off-by: zhehuaichen <[email protected]> * fix https://github.com/Lightning-AI/pytorch-lightning/issues/18803 Signed-off-by: zhehuaichen <[email protected]> * add_bos Signed-off-by: zhehuaichen <[email protected]> * Transformer decoder with conditioning for canary (#8091) * initial commit for multi-task conf-enc transf-dec for canary Signed-off-by: Krishna Puvvada <[email protected]> * removing decoder states caching during training Signed-off-by: Krishna Puvvada <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Krishna Puvvada <[email protected]> Co-authored-by: Krishna Puvvada <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Option to limit the number of open streams (#8095) * audio signal support in multi Signed-off-by: zhehuaichen <[email protected]> * update asr evaluator Signed-off-by: stevehuang52 <[email protected]> * fix from https://github.com/NVIDIA/NeMo/commit/fcc0f9f6ff7947c3c7fba3ed17d8ec8af6391397 and https://github.com/NVIDIA/NeMo/commit/f97c9016e6438ca4174b66bf9c3e248b28197aaa Signed-off-by: zhehuaichen <[email protected]> * transcribe fn for Canary models (#8110) * improve readability Signed-off-by: Krishna Puvvada <[email protected]> * adding context in transcribe function for ConfTransfModels Signed-off-by: Krishna Puvvada <[email protected]> * supporting relative paths in transcribe function for canary Signed-off-by: Krishna Puvvada <[email protected]> * removing cuts.sort_by_duration in __getitem__ to maintain manifest order during inference Signed-off-by: Krishna Puvvada <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Krishna Puvvada <[email protected]> Co-authored-by: Krishna Puvvada <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update for evaluation Signed-off-by: stevehuang52 <[email protected]> * update for eval Signed-off-by: stevehuang52 <[email protected]> * update for evaluation Signed-off-by: stevehuang52 <[email protected]> * fix bleu Signed-off-by: stevehuang52 <[email protected]> * fix typo Signed-off-by: stevehuang52 <[email protected]> * Add missing audio_filepath validation for Canary (#8119) * Add missing audio_filepath validation for Canary * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add default concat_sampling_probabilities Signed-off-by: zhehuaichen <[email protected]> * support lhotse dataset in speechllm Signed-off-by: zhehuaichen <[email protected]> * bypass get_iterator_k_split Signed-off-by: zhehuaichen <[email protected]> * tmp fix Signed-off-by: zhehuaichen <[email protected]> * try to use fixed batch with megatron Signed-off-by: zhehuaichen <[email protected]> * add batch logging Signed-off-by: zhehuaichen <[email protected]> * support unfrozen llm Signed-off-by: zhehuaichen <[email protected]> * Create README.md Signed-off-by: He Huang (Steve) <[email protected]> * Update README.md Signed-off-by: He Huang (Steve) <[email protected]> * Update README.md Signed-off-by: He Huang (Steve) <[email protected]> * update Signed-off-by: stevehuang52 <[email protected]> * rename Signed-off-by: stevehuang52 <[email protected]> * add llama prompt template Signed-off-by: zhehuaichen <[email protected]> * update and refactor Signed-off-by: stevehuang52 <[email protected]> * support sample alpha Signed-off-by: zhehuaichen <[email protected]> * support lhotse validation set and canary pretrained ckpt with pseudo label Signed-off-by: zhehuaichen <[email protected]> * make sure backward compatibility Signed-off-by: zhehuaichen <[email protected]> * remove pad Signed-off-by: zhehuaichen <[email protected]> * make sure asr_model is frozen Signed-off-by: zhehuaichen <[email protected]> * support greedy decoding Signed-off-by: zhehuaichen <[email protected]> * valid on lhotse Signed-off-by: zhehuaichen <[email protected]> * fix multi dataloader in val case for lhotse SALM; add default data names; keep asr model tokenizer by default to enable adding canary dataset Signed-off-by: zhehuaichen <[email protected]> * remove the bruteforce _keep_special_tokens implementation Signed-off-by: zhehuaichen <[email protected]> * decoding_ratio and convert_canary_prompt_to_text support Signed-off-by: zhehuaichen <[email protected]> * canary_tokens_augment_ratio Signed-off-by: zhehuaichen <[email protected]> * debug Signed-off-by: zhehuaichen <[email protected]> * bug fix Signed-off-by: zhehuaichen <[email protected]> * fix lhotse based eval of llama canary model Signed-off-by: zhehuaichen <[email protected]> * support some overwrite for eval Signed-off-by: zhehuaichen <[email protected]> * support zero shot prompt in training Signed-off-by: zhehuaichen <[email protected]> * support cross attention based SALM Signed-off-by: zhehuaichen <[email protected]> * support cross attention based SALM Signed-off-by: zhehuaichen <[email protected]> * fix for batch train/valid of cross Signed-off-by: zhehuaichen <[email protected]> * support learnable gate and plotting Signed-off-by: zhehuaichen <[email protected]> * support using pseudo label in prompt rather than cross att Signed-off-by: zhehuaichen <[email protected]> * bug fix for perception cfg and context tokens shift Signed-off-by: zhehuaichen <[email protected]> * DentityConnectorsAdd Signed-off-by: zhehuaichen <[email protected]> * fix ckpt saving Signed-off-by: zhehuaichen <[email protected]> * Support RnnGatedCrossAttention Signed-off-by: zhehuaichen <[email protected]> * add include_ffw and fix _optimizer_param_groups for all unfrozen run Signed-off-by: zhehuaichen <[email protected]> * support grad acc when using bucket Signed-off-by: zhehuaichen <[email protected]> * support TransformerCrossAttention Signed-off-by: zhehuaichen <[email protected]> * support ProjectTransformerCrossAttention Signed-off-by: zhehuaichen <[email protected]> * support ++model.use_am_tokenizer ++model.override_vocab_size ++model.override.hidden_size Signed-off-by: zhehuaichen <[email protected]> * support question set on val without canary Signed-off-by: zhehuaichen <[email protected]> * support load_audio_encoder and wip in optim_param_groups Signed-off-by: zhehuaichen <[email protected]> * minor fix for audio pretrain model init Signed-off-by: zhehuaichen <[email protected]> * simplify canary_tokens_augment Signed-off-by: zhehuaichen <[email protected]> * use question in the manifest if it exists Signed-off-by: zhehuaichen <[email protected]> * support dataset weighting for non tar Signed-off-by: zhehuaichen <[email protected]> * Update SpeechLLM code (#8475) * add pleasefixme marker for potential failed nightly tests. (#7678) Signed-off-by: Xuesong Yang <[email protected]> * Add new text segmentation library for better TTS quality (#7645) * Add new text segmentation library for better TTS quality * Update zh_cn_pinyin.py added detailed instruction on how to install pkuseg. Signed-off-by: Xuesong Yang <[email protected]> * Update requirements_tts.txt remove pkuseg as the default dependency of NeMo TTS, and instead, direct users to manually install pkuseg if they really need. Signed-off-by: Xuesong Yang <[email protected]> --------- Signed-off-by: Xuesong Yang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xuesong Yang <[email protected]> * Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer (#7767) (#7774) * Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer * Add ddp_find_unused_parameters_true for punctuation_capitalization_train_evaluate.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add '32-true' for precision values --------- Signed-off-by: Abhishree <[email protected]> Signed-off-by: Abhishree Thittenamane <[email protected]> Co-authored-by: Abhishree Thittenamane <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix(clustering_diarizer.py): fix typo (#7772) Signed-off-by: Jean-Louis Queguiner <[email protected]> * fix(diarization-README): typo (#7771) Signed-off-by: Jean-Louis Queguiner <[email protected]> * Fix bug wrt change decoding strategy for bpe models (#7762) (#7764) * Fix bug wrt change decoding strategy for bpe models * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: smajumdar <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Remove incorrect extra argument for load_from_checkpoint_dir() (#7500) Signed-off-by: Robin Dong <[email protected]> Co-authored-by: Eric Harper <[email protected]> * Add nemo to mcore GPT conversion script (#7730) * add conversion script Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove references to 'ckpt' Signed-off-by: Chen Cui <[email protected]> * add one more sanity check to make sure there is no unexpected keys in state dict Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make cpu loading work Signed-off-by: Chen Cui <[email protected]> * make script work for llama2 models Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address code check Signed-off-by: Chen Cui <[email protected]> * remove trainer precision (was for old sanity check) Signed-off-by: Chen Cui <[email protected]> * fix script for llama2 model Signed-off-by: Chen Cui <[email protected]> * remove commented code Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Chen Cui <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> * Fix bug in ConditionalInput: cat along the feature dim, not the batch dim (#7785) Signed-off-by: anferico <[email protected]> * Add some docs and update scripts for ASR (#7790) * Add some docs and update scripts Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: smajumdar <[email protected]> Signed-off-by: Somshubra Majumdar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * set context for text memmap to fork (#7784) * set context for text memmap to fork Signed-off-by: arendu <[email protected]> * typo Signed-off-by: arendu <[email protected]> --------- Signed-off-by: arendu <[email protected]> * add training with multiple audios Signed-off-by: stevehuang52 <[email protected]> * Support flash decoding (#7744) * Add flash-decoding Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Fix Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Cheng-Ping Hsieh <[email protected]> --------- Signed-off-by: Cheng-Ping Hsieh <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Yang Zhang <[email protected]> * Change accelerator to 'auto' in nlp_checkpoint_port.py (#7761) * Change accelerator to 'auto' in nlp_checkpoint_port.py (#7747) * Change accelerator to auto Signed-off-by: Abhishree <[email protected]> * Pass omegaconf object to trainer in nlp_checkpoint_port.py Signed-off-by: Abhishree <[email protected]> * Pass omegaconf object to trainer in export.py Signed-off-by: Abhishree <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhishree <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Abhishree <[email protected]> * docs: fix typos (#7758) Signed-off-by: shuoer86 <[email protected]> Co-authored-by: Xuesong Yang <[email protected]> Signed-off-by: Abhishree <[email protected]> * Snake act (#7736) Signed-off-by: Abhishree <[email protected]> * Update gpt_dataset.py (#6963) Signed-off-by: Xin Yao <[email protected]> Co-authored-by: Sandeep Subramanian <[email protected]> Signed-off-by: Abhishree <[email protected]> --------- Signed-off-by: Abhishree <[email protected]> Signed-off-by: shuoer86 <[email protected]> Signed-off-by: Xin Yao <[email protected]> Co-authored-by: Abhishree Thittenamane <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: shuoer86 <[email protected]> Co-authored-by: Xuesong Yang <[email protected]> Co-authored-by: Nithin Rao <[email protected]> Co-authored-by: Xin Yao <[email protected]> Co-authored-by: Sandeep Subramanian <[email protected]> * Add selection criteria for reference audios in the `GlobalStyleToken` submodule (#7788) * add selection criteria for reference audios Signed-off-by: anferico <[email protected]> * Update configuration files Signed-off-by: anferico <[email protected]> * add informative comment in config files Signed-off-by: anferico <[email protected]> * sample random index for reference audio selection Signed-off-by: anferico <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: anferico <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update text server to support compute logprobs (#7733) * update text server to support compute logprobs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typo --------- Signed-off-by: Zhilin Wang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add multi-layer feat extract and fix random question insertion Signed-off-by: stevehuang52 <[email protected]> * Configure MCore logger (#7781) Signed-off-by: Mikołaj Błaż <[email protected]> * Revert "PEFT eval fix (#7626) (#7638)" (#7693) This reverts commit f03dd660bd26d88fd569e76c6f74b83a7c203ff9. * remove TN from ctc_segm tut (#7807) Signed-off-by: Evelina <[email protected]> * [TTS] Support audio offsets in TTS data loaders (#7156) * [TTS] Support audio offsets in TTS data loaders Signed-off-by: Ryan <[email protected]> * [TTS] Change docstring mentions of .pt to .npy Signed-off-by: Ryan <[email protected]> --------- Signed-off-by: Ryan <[email protected]> * Update Apex install command in Dockerfile (#7794) (#7804) * move core install to /workspace (#7706) * update apex install in dockerfile * use fetch head --------- Signed-off-by: Abhinav Khattar <[email protected]> Signed-off-by: eharper <[email protected]> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> * fix typo Signed-off-by: stevehuang52 <[email protected]> * Nemo to HF converter for LLaMA model (#7770) * Create config_llama_truncate.yaml Signed-off-by: Utkarsh <[email protected]> * Add files via upload Signed-off-by: Utkarsh <[email protected]> * Update convert_nemo_llama_to_hf.py Signed-off-by: Utkarsh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update config_llama_truncate.yaml Signed-off-by: Utkarsh <[email protected]> * Update convert_nemo_llama_to_hf.py Signed-off-by: Utkarsh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update convert_nemo_llama_to_hf.py Signed-off-by: Utkarsh <[email protected]> * clean up trainer * remove dependency on yaml config. load config from nemo file instead. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * enable ckpt saving into other precision formats * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support 70b + cleanup qkv slice logic * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bug * move hf model folder code from comment to function and add instruction to run * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Utkarsh <[email protected]> Signed-off-by: Chen Cui <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: Chen Cui <[email protected]> * Save best NeMo model only when necessary (#7836) Signed-off-by: Ante Jukić <[email protected]> * add guard if its a distributed checkpoint (#7845) Signed-off-by: Gerald Shen <[email protected]> * Fix tn duplex (#7808) * fix duplex tn infer Signed-off-by: Evelina <[email protected]> * fix typo Signed-off-by: Evelina <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix TN docs Signed-off-by: Evelina <[email protected]> --------- Signed-off-by: Evelina <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Update transformers cache on Jenkins (#7854) * update transformers cache Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * add cd Signed-off-by: eharper <[email protected]> --------- Signed-off-by: eharper <[email protected]> * Update README.rst for container update (#7844) Signed-off-by: fayejf <[email protected]> * Add support for finetuning with huggingface datasets (#7834) * add finetune with huggingface dataset Signed-off-by: stevehuang52 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update yaml Signed-off-by: stevehuang52 <[email protected]> * update Signed-off-by: stevehuang52 <[email protected]> * update and refactor Signed-off-by: stevehuang52 <[email protected]> * add extrac hf text and update Signed-off-by: stevehuang52 <[email protected]> * update and refactor Signed-off-by: stevehuang52 <[email protected]> * move dataset dependency to common Signed-off-by: stevehuang52 <[email protected]> * add docstring Signed-off-by: stevehuang52 <[email protected]> * Add to Dics Signed-off-by: Nithin Rao Koluguri <nithinraok> * add ci test Signed-off-by: Nithin Rao Koluguri <nithinraok> * add max steps in jenkins Signed-off-by: Nithin Rao Koluguri <nithinraok> * reduce max steps Signed-off-by: Nithin Rao Koluguri <nithinraok> * jenkins test Signed-off-by: Nithin Rao Koluguri <nithinraok> * add bs=2 Signed-off-by: Nithin Rao Koluguri <nithinraok> --------- Signed-off-by: stevehuang52 <[email protected]> Signed-off-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: Nithin Rao <[email protected]> * Multimodal merge (#7728) * ControlNet TRT export * Final MR before release * SD2 update * Fixed export issue * Fix for instruct p2p and reformat * Fix SD export issue * Add nemo clip export for DB * Fix ins pix2pix * fix sd2 config * [Mingyuan Ma] BF16 and SD conversion script * [Imagen] NHWC Feature * Fix .nemo loading issue for NeMo CLIP in SD * NeMo r1.20.0 Multimodal Merge * fix the inductor issue in inference * Fix inductor loading .nemo issue * Add Neva Model Support * Imagen Optimizations * Neva inference code * NeMo TOT 1.21 to Internal/main * Update neva_inference.yaml * REBASING for latest code changes * Update internal/main to main tot * Parallel DDIM implementation * 1. Fixing indentation bug. (#7352) Signed-off-by: Micha Livne <[email protected]> * NeMo MCore llama2 support + MCore PEFT adapters (#7299) * start adding gpt from megatron core path Signed-off-by: ericharper <[email protected]> * set model parallel config Signed-off-by: ericharper <[email protected]> * use model parallel config object Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * add TransformerConfig Signed-off-by: ericharper <[email protected]> * start updating to TransformerConfig Signed-off-by: ericharper <[email protected]> * add todo Signed-off-by: ericharper <[email protected]> * revert to model parallel config Signed-off-by: ericharper <[email protected]> * add hidden_size to model_parallel_config Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove imports Signed-off-by: ericharper <[email protected]> * revert Signed-off-by: ericharper <[email protected]> * remove import Signed-off-by: ericharper <[email protected]> * small clean up Signed-off-by: ericharper <[email protected]> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update module args Signed-off-by: ericharper <[email protected]> * add config obj to flash attention tests Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * add config to self Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * add config to test Signed-off-by: ericharper <[email protected]> * get hidden_size from config Signed-off-by: ericharper <[email protected]> * add try except Signed-off-by: ericharper <[email protected]> * use default Signed-off-by: ericharper <[email protected]> * update config with hidden size Signed-off-by: ericharper <[email protected]> * remove arg Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * comment out jenkins test Signed-off-by: ericharper <[email protected]> * revert import Signed-off-by: ericharper <[email protected]> * build transformer config Signed-off-by: ericharper <[email protected]> * add model to provider func Signed-off-by: ericharper <[email protected]> * update forward and float16 wrapper Signed-off-by: ericharper <[email protected]> * instantiate model parallel config after init model parallel Signed-off-by: ericharper <[email protected]> * set virtual rank Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add GQA config to megatron gpt model (#7096) * Add GQA config in gpt config file Signed-off-by: jasonwan <[email protected]> * Verify mcore is enabled when using GQA Signed-off-by: jasonwan <[email protected]> --------- Signed-off-by: jasonwan <[email protected]> * revert Signed-off-by: ericharper <[email protected]> * mcore llama2 ckpt conversion & small fix Signed-off-by: jasonwan <[email protected]> * Add inference & sft config by Hongbin Co-authored-by: Hongbin Liu <[email protected]> Signed-off-by: jasonwan <[email protected]> * fix config Signed-off-by: jasonwan <[email protected]> * add inference param. update TP/PP script to support mcore gpt Signed-off-by: jasonwan <[email protected]> * p-tuning Signed-off-by: jasonwan <[email protected]> * modify ckpt conversion script (adding model cast) Signed-off-by: jasonwan <[email protected]> * ckpt conversion use relative path for config Signed-off-by: jasonwan <[email protected]> * start adding gpt from megatron core path Signed-off-by: ericharper <[email protected]> * set model parallel config Signed-off-by: ericharper <[email protected]> * use model parallel config object Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add TransformerConfig Signed-off-by: ericharper <[email protected]> * start updating to TransformerConfig Signed-off-by: ericharper <[email protected]> * add todo Signed-off-by: ericharper <[email protected]> * revert to model parallel config Signed-off-by: ericharper <[email protected]> * add hidden_size to model_parallel_config Signed-off-by: ericharper <[email protected]> * remove imports Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: ericharper <[email protected]> * small clean up Signed-off-by: ericharper <[email protected]> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <[email protected]> * update module args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add config obj to flash attention tests Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <[email protected]> * add config to self Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * add config to test Signed-off-by: ericharper <[email protected]> * get hidden_size from config Signed-off-by: ericharper <[email protected]> * add try except Signed-off-by: ericharper <[email protected]> * use default Signed-off-by: ericharper <[email protected]> * update config with hidden size Signed-off-by: ericharper <[email protected]> * remove arg Signed-off-by: ericharper <[email protected]> * comment out jenkins test Signed-off-by: ericharper <[email protected]> * revert import Signed-off-by: ericharper <[email protected]> * remove optimizer_idx Signed-off-by: eharper <[email protected]> * prefetch num microbatches Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * start adding gpt from megatron core path Signed-off-by: ericharper <[email protected]> * set model parallel config Signed-off-by: ericharper <[email protected]> * use model parallel config object Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * fix for p-tuning sequence parallel Signed-off-by: jasonwan <[email protected]> * support SFT/distOpt mcore (#7207) * add inference param. update TP/PP script to support mcore gpt * p-tuning Signed-off-by: jasonwan <[email protected]> * change layer names for SFT Signed-off-by: Hongbin Liu <[email protected]> * fix bug in SFT Signed-off-by: Hongbin Liu <[email protected]> --------- Signed-off-by: jasonwan <[email protected]> Signed-off-by: Hongbin Liu <[email protected]> Co-authored-by: Hongbin Liu <[email protected]> Co-authored-by: jasonwan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * start updating to TransformerConfig Signed-off-by: ericharper <[email protected]> * revert to model parallel config Signed-off-by: ericharper <[email protected]> * add hidden_size to model_parallel_config Signed-off-by: ericharper <[email protected]> * remove imports Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update module args Signed-off-by: ericharper <[email protected]> * add config to self Signed-off-by: ericharper <[email protected]> * build transformer config Signed-off-by: ericharper <[email protected]> * add model to provider func Signed-off-by: ericharper <[email protected]> * update forward and float16 wrapper Signed-off-by: ericharper <[email protected]> * instantiate model parallel config after init model parallel Signed-off-by: ericharper <[email protected]> * set virtual rank Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add GQA config to megatron gpt model (#7096) * Add GQA config in gpt config file Signed-off-by: jasonwan <[email protected]> * Verify mcore is enabled when using GQA Signed-off-by: jasonwan <[email protected]> --------- Signed-off-by: jasonwan <[email protected]> * revert Signed-off-by: ericharper <[email protected]> * remove import Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rollback model cast for p-tuning Signed-off-by: jasonwan <[email protected]> * update for dist adam Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use get_gpt_module_list Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update ckpt conversion script Signed-off-by: jasonwan <[email protected]> * ptl2.0 patch for llama config Signed-off-by: jasonwan <[email protected]> * add plugins to trainer in scripts Signed-off-by: jasonwan <[email protected]> * fix activation checkpointing mcore Signed-off-by: jasonwan <[email protected]> * fix variable names Signed-off-by: jasonwan <[email protected]> * overwrite normalization type for mcore/te Signed-off-by: jasonwan <[email protected]> * Update megatron_llama_sft.yaml Signed-off-by: Jason Wang <[email protected]> * add PEFT adapter support for mcore gpt path (#7276) * implementation for mcore adapter/mxins Signed-off-by: jasonwan <[email protected]> * small fix for lora and ptuning Signed-off-by: jasonwan <[email protected]> * support layerwise peft Signed-off-by: jasonwan <[email protected]> * support multiple target layers Signed-off-by: jasonwan <[email protected]> * support lora GQA Signed-off-by: jasonwan <[email protected]> * support amp O2 Signed-off-by: jasonwan <[email protected]> * revert & more O2 fix Signed-off-by: jasonwan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lora inject to attention Signed-off-by: jasonwan <[email protected]> * support …
1 parent 8c58e13 commit f375d51

File tree

2 files changed

+64
-18
lines changed

2 files changed

+64
-18
lines changed

nemo/lightning/megatron_parallel.py

+20
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424

2525
import torch
2626
import torch.distributed
27+
from megatron.core.distributed import DistributedDataParallelConfig
2728
from torch import Tensor, nn
2829

2930
DataT = TypeVar("DataT", Tensor, Dict[str, Tensor], Sequence[Tensor])
@@ -105,6 +106,7 @@ def __init__(
105106
forward_step: Optional[Callable[[nn.Module, DataT], Tensor]] = None,
106107
loss_reduction: Optional[Callable[[nn.Module], "MegatronLossReduction"]] = None,
107108
vp_size: Optional[int] = None,
109+
ddp_config: Optional[DistributedDataParallelConfig] = None,
108110
cpu: bool = False,
109111
) -> None:
110112
from apex.transformer.tensor_parallel.layers import set_defaults_if_not_set_tensor_model_parallel_attributes
@@ -130,6 +132,23 @@ def __init__(
130132
_model.configure_model()
131133
_pipeline.append(_model)
132134

135+
if isinstance(ddp_config, DistributedDataParallelConfig):
136+
from megatron.core.distributed import DistributedDataParallel as McoreDDP
137+
138+
_pipeline = [
139+
McoreDDP(
140+
model_chunk.config,
141+
ddp_config,
142+
model_chunk,
143+
data_parallel_group=parallel_state.get_data_parallel_group(with_context_parallel=True),
144+
expert_data_parallel_group=parallel_state.get_data_modulo_expert_parallel_group(),
145+
# Turn off bucketing for model_chunk 2 onwards, since communication for these
146+
# model chunks is overlapped with compute anyway.
147+
disable_bucketing=(model_chunk_idx > 0),
148+
)
149+
for (model_chunk_idx, model_chunk) in enumerate(_pipeline)
150+
]
151+
133152
for i, model_module in enumerate(_pipeline):
134153
if not cpu:
135154
model_module.cuda(torch.cuda.current_device())
@@ -162,6 +181,7 @@ def __init__(
162181
self.data_step = data_step or default_data_step
163182
self.forward_step = forward_step or default_forward_step
164183
self.loss_reduction: MegatronLossReduction = loss_reduction
184+
self.ddp_config = ddp_config
165185

166186
def forward(
167187
self,

nemo/lightning/pytorch/strategies.py

+44-18
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,14 @@
44
from collections import OrderedDict
55
from contextlib import ExitStack
66
from pathlib import Path
7-
from typing import TYPE_CHECKING, Any, ContextManager, Dict, List, Mapping, Optional, TypeVar, Union, cast
7+
from typing import TYPE_CHECKING, Any, ContextManager, Dict, List, Literal, Mapping, Optional, TypeVar, Union, cast
88

99
import pytorch_lightning as pl
1010
import torch
1111
import torch.distributed
1212
from lightning_fabric.plugins import CheckpointIO, ClusterEnvironment
1313
from lightning_fabric.utilities.optimizer import _optimizers_to_device
14+
from megatron.core.distributed import DistributedDataParallelConfig
1415
from pytorch_lightning.accelerators import CPUAccelerator
1516
from pytorch_lightning.callbacks.progress import TQDMProgressBar
1617
from pytorch_lightning.loops import _AutomaticOptimization, evaluation_loop, fit_loop, prediction_loop
@@ -38,6 +39,9 @@
3839
ConfigT = TypeVar("ConfigT")
3940

4041

42+
DDPLiteral = Literal["megatron", "pytorch"]
43+
44+
4145
class MegatronStrategy(DDPStrategy, io.IOMixin):
4246
"""Megatron plugin for Pytorch Lightning.
4347
@@ -58,11 +62,11 @@ def __init__(
5862
parallel_devices: Optional[List[torch.device]] = None,
5963
cluster_environment=None, # TODO: Add type-hint
6064
checkpoint_io=None, # TODO: Add type-hint
61-
no_ddp_communication_hook: bool = True,
6265
find_unused_parameters: bool = False,
6366
enable_nemo_ckpt_io: bool = True,
6467
ckpt_type: TrainerCkptProtocol = TrainerCheckpoint,
6568
ckpt_include_optimizer: bool = False,
69+
ddp: Union[DDPLiteral, DistributedDataParallelConfig] = "megatron",
6670
lazy_init: bool = False,
6771
**kwargs,
6872
) -> None:
@@ -73,7 +77,7 @@ def __init__(
7377
find_unused_parameters=find_unused_parameters,
7478
**kwargs,
7579
)
76-
self.no_ddp_communication_hook = no_ddp_communication_hook
80+
7781
self.megatron_callbacks = CallbackConnector()
7882
self.data_sampler: Optional['DataSampler'] = data_sampler
7983
self.tensor_model_parallel_size = tensor_model_parallel_size
@@ -85,6 +89,16 @@ def __init__(
8589
self.lazy_init = lazy_init
8690
self.ckpt_include_optimizer = ckpt_include_optimizer
8791

92+
if ddp == "megatron":
93+
self.ddp_config = DistributedDataParallelConfig()
94+
elif isinstance(ddp, DistributedDataParallelConfig):
95+
self.ddp_config = ddp
96+
elif ddp == "pytorch":
97+
self.ddp_config = None
98+
self.no_ddp_communication_hook = False
99+
else:
100+
raise ValueError(f"Invalid DDP type: {ddp}")
101+
88102
# used in NVIDIA NGC PyTorch containers
89103
_strategy_lib.enable_nvidia_optimizations()
90104

@@ -153,6 +167,9 @@ def setup(self, trainer: pl.Trainer) -> None:
153167

154168
# set up optimizers after the wrapped module has been moved to the device
155169
self.setup_optimizers(trainer)
170+
171+
# TODO: Throw an execption if we have a mcore optimizer and no ddp_config
172+
156173
if hasattr(self.precision_plugin, "convert_optimizer"):
157174
_optimizers = [*self.optimizers]
158175
_optimizers[0] = self.precision_plugin.convert_optimizer(self.optimizers[0])
@@ -204,6 +221,7 @@ def setup_megatron_parallel(self, trainer: pl.Trainer) -> None:
204221
precision_plugin=self.precision_plugin,
205222
vp_size=self.virtual_pipeline_model_parallel_size,
206223
cpu=isinstance(trainer.accelerator, CPUAccelerator),
224+
ddp_config=self.ddp_config,
207225
)
208226
self.model = self.megatron_parallel
209227
self.model.trainer = trainer
@@ -212,6 +230,10 @@ def setup_megatron_parallel(self, trainer: pl.Trainer) -> None:
212230
self.model = self.precision_plugin.convert_module(self.model)
213231
self.model.callbacks.add(getattr(trainer, "callbacks"))
214232

233+
if hasattr(self, "optimizers") and self.optimizers:
234+
for optimizer in self.optimizers:
235+
self.model.callbacks.add(optimizer)
236+
215237
if self.data_sampler:
216238
self.model.callbacks.add(self.data_sampler)
217239

@@ -223,10 +245,11 @@ def setup_megatron_parallel(self, trainer: pl.Trainer) -> None:
223245
def configure_ddp(self) -> None:
224246
logging.debug(f"{self.__class__.__name__}: configuring MegatronParallel")
225247
self.model = self._setup_model(self.model)
226-
self._register_ddp_hooks()
248+
if self.ddp_config is None:
249+
self._register_ddp_hooks()
227250

228251
@override
229-
def _setup_model(self, model: nn.Module) -> DistributedDataParallel:
252+
def _setup_model(self, model: nn.Module) -> nn.Module:
230253
"""Only called when we need to wrap the model for pytorch's ddp."""
231254
from megatron.core import parallel_state
232255

@@ -236,16 +259,19 @@ def _setup_model(self, model: nn.Module) -> DistributedDataParallel:
236259
if app_state.model_parallel_size is not None:
237260
self._ddp_kwargs["process_group"] = parallel_state.get_data_parallel_group()
238261

239-
dist_data_parallel: DistributedDataParallel = super()._setup_model(model)
240-
if self.no_ddp_communication_hook:
241-
# When using custom gradient accumulation and allreduce, disable
242-
# DDP communication hook that works on the gradient bucket.
243-
# Instead, use the custom gradient function and communication hook,
244-
# which is defined in the master optimizer wrapper.
245-
dist_data_parallel.require_backward_grad_sync = False
246-
dist_data_parallel.register_comm_hook(None, noop_hook)
262+
# Only wrap the model if we are not using Megatron's DDP
263+
if not self.ddp_config:
264+
dist_data_parallel: DistributedDataParallel = super()._setup_model(model)
265+
if self.no_ddp_communication_hook:
266+
# When using custom gradient accumulation and allreduce, disable
267+
# DDP communication hook that works on the gradient bucket.
268+
# Instead, use the custom gradient function and communication hook,
269+
# which is defined in the master optimizer wrapper.
270+
dist_data_parallel.require_backward_grad_sync = False
271+
dist_data_parallel.register_comm_hook(None, noop_hook)
272+
model = dist_data_parallel
247273

248-
return dist_data_parallel
274+
return model
249275

250276
def _setup_parallel_ranks(self) -> None:
251277
self.set_world_ranks()
@@ -260,7 +286,7 @@ def training_step(self, dataloader_iter, *args: Any, **kwargs: Any) -> STEP_OUTP
260286
kwargs = self._update_step_kwargs(dataloader_iter, kwargs, "training")
261287

262288
with self.precision_plugin.train_step_context(): # TODO: Do we need this?
263-
return self.model(dataloader_iter, *args, **kwargs)
289+
return self.model(dataloader_iter, forward_only=False, *args, **kwargs)
264290

265291
@override
266292
def validation_step(self, dataloader_iter, *args: Any, **kwargs: Any) -> STEP_OUTPUT:
@@ -269,7 +295,7 @@ def validation_step(self, dataloader_iter, *args: Any, **kwargs: Any) -> STEP_OU
269295
kwargs = self._update_step_kwargs(dataloader_iter, kwargs, "validation")
270296

271297
with self.precision_plugin.val_step_context(): # TODO: Do we need this?
272-
return self.model(dataloader_iter, *args, **kwargs)
298+
return self.model(dataloader_iter, forward_only=True, *args, **kwargs)
273299

274300
@override
275301
def test_step(self, dataloader_iter, *args: Any, **kwargs: Any) -> STEP_OUTPUT:
@@ -278,7 +304,7 @@ def test_step(self, dataloader_iter, *args: Any, **kwargs: Any) -> STEP_OUTPUT:
278304
kwargs = self._update_step_kwargs(dataloader_iter, kwargs, "test")
279305

280306
with self.precision_plugin.test_step_context(): # TODO: Do we need this?
281-
return self.model(dataloader_iter, *args, **kwargs)
307+
return self.model(dataloader_iter, forward_only=True, *args, **kwargs)
282308

283309
@override
284310
def predict_step(self, dataloader_iter, *args: Any, **kwargs: Any) -> STEP_OUTPUT:
@@ -287,7 +313,7 @@ def predict_step(self, dataloader_iter, *args: Any, **kwargs: Any) -> STEP_OUTPU
287313
kwargs = self._update_step_kwargs(dataloader_iter, kwargs, "predict")
288314

289315
with self.precision_plugin.predict_step_context(): # TODO: Do we need this?
290-
return self.model(dataloader_iter, *args, **kwargs)
316+
return self.model(dataloader_iter, forward_only=True, *args, **kwargs)
291317

292318
@override
293319
def teardown(self) -> None:

0 commit comments

Comments
 (0)