Skip to content

Commit 94bd346

Browse files
stevehuang52XuesongYangRobinDongpre-commit-ci[bot]github-actions[bot]
authored
Update SpeechLLM code (#8475)
* add pleasefixme marker for potential failed nightly tests. (#7678) Signed-off-by: Xuesong Yang <[email protected]> * Add new text segmentation library for better TTS quality (#7645) * Add new text segmentation library for better TTS quality * Update zh_cn_pinyin.py added detailed instruction on how to install pkuseg. Signed-off-by: Xuesong Yang <[email protected]> * Update requirements_tts.txt remove pkuseg as the default dependency of NeMo TTS, and instead, direct users to manually install pkuseg if they really need. Signed-off-by: Xuesong Yang <[email protected]> --------- Signed-off-by: Xuesong Yang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xuesong Yang <[email protected]> * Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer (#7767) (#7774) * Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer * Add ddp_find_unused_parameters_true for punctuation_capitalization_train_evaluate.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add '32-true' for precision values --------- Signed-off-by: Abhishree <[email protected]> Signed-off-by: Abhishree Thittenamane <[email protected]> Co-authored-by: Abhishree Thittenamane <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix(clustering_diarizer.py): fix typo (#7772) Signed-off-by: Jean-Louis Queguiner <[email protected]> * fix(diarization-README): typo (#7771) Signed-off-by: Jean-Louis Queguiner <[email protected]> * Fix bug wrt change decoding strategy for bpe models (#7762) (#7764) * Fix bug wrt change decoding strategy for bpe models * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: smajumdar <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Remove incorrect extra argument for load_from_checkpoint_dir() (#7500) Signed-off-by: Robin Dong <[email protected]> Co-authored-by: Eric Harper <[email protected]> * Add nemo to mcore GPT conversion script (#7730) * add conversion script Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove references to 'ckpt' Signed-off-by: Chen Cui <[email protected]> * add one more sanity check to make sure there is no unexpected keys in state dict Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make cpu loading work Signed-off-by: Chen Cui <[email protected]> * make script work for llama2 models Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address code check Signed-off-by: Chen Cui <[email protected]> * remove trainer precision (was for old sanity check) Signed-off-by: Chen Cui <[email protected]> * fix script for llama2 model Signed-off-by: Chen Cui <[email protected]> * remove commented code Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Chen Cui <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> * Fix bug in ConditionalInput: cat along the feature dim, not the batch dim (#7785) Signed-off-by: anferico <[email protected]> * Add some docs and update scripts for ASR (#7790) * Add some docs and update scripts Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: smajumdar <[email protected]> Signed-off-by: Somshubra Majumdar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * set context for text memmap to fork (#7784) * set context for text memmap to fork Signed-off-by: arendu <[email protected]> * typo Signed-off-by: arendu <[email protected]> --------- Signed-off-by: arendu <[email protected]> * add training with multiple audios Signed-off-by: stevehuang52 <[email protected]> * Support flash decoding (#7744) * Add flash-decoding Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Fix Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Cheng-Ping Hsieh <[email protected]> --------- Signed-off-by: Cheng-Ping Hsieh <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Yang Zhang <[email protected]> * Change accelerator to 'auto' in nlp_checkpoint_port.py (#7761) * Change accelerator to 'auto' in nlp_checkpoint_port.py (#7747) * Change accelerator to auto Signed-off-by: Abhishree <[email protected]> * Pass omegaconf object to trainer in nlp_checkpoint_port.py Signed-off-by: Abhishree <[email protected]> * Pass omegaconf object to trainer in export.py Signed-off-by: Abhishree <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhishree <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Abhishree <[email protected]> * docs: fix typos (#7758) Signed-off-by: shuoer86 <[email protected]> Co-authored-by: Xuesong Yang <[email protected]> Signed-off-by: Abhishree <[email protected]> * Snake act (#7736) Signed-off-by: Abhishree <[email protected]> * Update gpt_dataset.py (#6963) Signed-off-by: Xin Yao <[email protected]> Co-authored-by: Sandeep Subramanian <[email protected]> Signed-off-by: Abhishree <[email protected]> --------- Signed-off-by: Abhishree <[email protected]> Signed-off-by: shuoer86 <[email protected]> Signed-off-by: Xin Yao <[email protected]> Co-authored-by: Abhishree Thittenamane <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: shuoer86 <[email protected]> Co-authored-by: Xuesong Yang <[email protected]> Co-authored-by: Nithin Rao <[email protected]> Co-authored-by: Xin Yao <[email protected]> Co-authored-by: Sandeep Subramanian <[email protected]> * Add selection criteria for reference audios in the `GlobalStyleToken` submodule (#7788) * add selection criteria for reference audios Signed-off-by: anferico <[email protected]> * Update configuration files Signed-off-by: anferico <[email protected]> * add informative comment in config files Signed-off-by: anferico <[email protected]> * sample random index for reference audio selection Signed-off-by: anferico <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: anferico <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update text server to support compute logprobs (#7733) * update text server to support compute logprobs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typo --------- Signed-off-by: Zhilin Wang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add multi-layer feat extract and fix random question insertion Signed-off-by: stevehuang52 <[email protected]> * Configure MCore logger (#7781) Signed-off-by: Mikołaj Błaż <[email protected]> * Revert "PEFT eval fix (#7626) (#7638)" (#7693) This reverts commit f03dd660bd26d88fd569e76c6f74b83a7c203ff9. * remove TN from ctc_segm tut (#7807) Signed-off-by: Evelina <[email protected]> * [TTS] Support audio offsets in TTS data loaders (#7156) * [TTS] Support audio offsets in TTS data loaders Signed-off-by: Ryan <[email protected]> * [TTS] Change docstring mentions of .pt to .npy Signed-off-by: Ryan <[email protected]> --------- Signed-off-by: Ryan <[email protected]> * Update Apex install command in Dockerfile (#7794) (#7804) * move core install to /workspace (#7706) * update apex install in dockerfile * use fetch head --------- Signed-off-by: Abhinav Khattar <[email protected]> Signed-off-by: eharper <[email protected]> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> * fix typo Signed-off-by: stevehuang52 <[email protected]> * Nemo to HF converter for LLaMA model (#7770) * Create config_llama_truncate.yaml Signed-off-by: Utkarsh <[email protected]> * Add files via upload Signed-off-by: Utkarsh <[email protected]> * Update convert_nemo_llama_to_hf.py Signed-off-by: Utkarsh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update config_llama_truncate.yaml Signed-off-by: Utkarsh <[email protected]> * Update convert_nemo_llama_to_hf.py Signed-off-by: Utkarsh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update convert_nemo_llama_to_hf.py Signed-off-by: Utkarsh <[email protected]> * clean up trainer * remove dependency on yaml config. load config from nemo file instead. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * enable ckpt saving into other precision formats * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support 70b + cleanup qkv slice logic * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bug * move hf model folder code from comment to function and add instruction to run * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Utkarsh <[email protected]> Signed-off-by: Chen Cui <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: Chen Cui <[email protected]> * Save best NeMo model only when necessary (#7836) Signed-off-by: Ante Jukić <[email protected]> * add guard if its a distributed checkpoint (#7845) Signed-off-by: Gerald Shen <[email protected]> * Fix tn duplex (#7808) * fix duplex tn infer Signed-off-by: Evelina <[email protected]> * fix typo Signed-off-by: Evelina <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix TN docs Signed-off-by: Evelina <[email protected]> --------- Signed-off-by: Evelina <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Update transformers cache on Jenkins (#7854) * update transformers cache Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * add cd Signed-off-by: eharper <[email protected]> --------- Signed-off-by: eharper <[email protected]> * Update README.rst for container update (#7844) Signed-off-by: fayejf <[email protected]> * Add support for finetuning with huggingface datasets (#7834) * add finetune with huggingface dataset Signed-off-by: stevehuang52 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update yaml Signed-off-by: stevehuang52 <[email protected]> * update Signed-off-by: stevehuang52 <[email protected]> * update and refactor Signed-off-by: stevehuang52 <[email protected]> * add extrac hf text and update Signed-off-by: stevehuang52 <[email protected]> * update and refactor Signed-off-by: stevehuang52 <[email protected]> * move dataset dependency to common Signed-off-by: stevehuang52 <[email protected]> * add docstring Signed-off-by: stevehuang52 <[email protected]> * Add to Dics Signed-off-by: Nithin Rao Koluguri <nithinraok> * add ci test Signed-off-by: Nithin Rao Koluguri <nithinraok> * add max steps in jenkins Signed-off-by: Nithin Rao Koluguri <nithinraok> * reduce max steps Signed-off-by: Nithin Rao Koluguri <nithinraok> * jenkins test Signed-off-by: Nithin Rao Koluguri <nithinraok> * add bs=2 Signed-off-by: Nithin Rao Koluguri <nithinraok> --------- Signed-off-by: stevehuang52 <[email protected]> Signed-off-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: Nithin Rao <[email protected]> * Multimodal merge (#7728) * ControlNet TRT export * Final MR before release * SD2 update * Fixed export issue * Fix for instruct p2p and reformat * Fix SD export issue * Add nemo clip export for DB * Fix ins pix2pix * fix sd2 config * [Mingyuan Ma] BF16 and SD conversion script * [Imagen] NHWC Feature * Fix .nemo loading issue for NeMo CLIP in SD * NeMo r1.20.0 Multimodal Merge * fix the inductor issue in inference * Fix inductor loading .nemo issue * Add Neva Model Support * Imagen Optimizations * Neva inference code * NeMo TOT 1.21 to Internal/main * Update neva_inference.yaml * REBASING for latest code changes * Update internal/main to main tot * Parallel DDIM implementation * 1. Fixing indentation bug. (#7352) Signed-off-by: Micha Livne <[email protected]> * NeMo MCore llama2 support + MCore PEFT adapters (#7299) * start adding gpt from megatron core path Signed-off-by: ericharper <[email protected]> * set model parallel config Signed-off-by: ericharper <[email protected]> * use model parallel config object Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * add TransformerConfig Signed-off-by: ericharper <[email protected]> * start updating to TransformerConfig Signed-off-by: ericharper <[email protected]> * add todo Signed-off-by: ericharper <[email protected]> * revert to model parallel config Signed-off-by: ericharper <[email protected]> * add hidden_size to model_parallel_config Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove imports Signed-off-by: ericharper <[email protected]> * revert Signed-off-by: ericharper <[email protected]> * remove import Signed-off-by: ericharper <[email protected]> * small clean up Signed-off-by: ericharper <[email protected]> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update module args Signed-off-by: ericharper <[email protected]> * add config obj to flash attention tests Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * add config to self Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * add config to test Signed-off-by: ericharper <[email protected]> * get hidden_size from config Signed-off-by: ericharper <[email protected]> * add try except Signed-off-by: ericharper <[email protected]> * use default Signed-off-by: ericharper <[email protected]> * update config with hidden size Signed-off-by: ericharper <[email protected]> * remove arg Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * comment out jenkins test Signed-off-by: ericharper <[email protected]> * revert import Signed-off-by: ericharper <[email protected]> * build transformer config Signed-off-by: ericharper <[email protected]> * add model to provider func Signed-off-by: ericharper <[email protected]> * update forward and float16 wrapper Signed-off-by: ericharper <[email protected]> * instantiate model parallel config after init model parallel Signed-off-by: ericharper <[email protected]> * set virtual rank Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add GQA config to megatron gpt model (#7096) * Add GQA config in gpt config file Signed-off-by: jasonwan <[email protected]> * Verify mcore is enabled when using GQA Signed-off-by: jasonwan <[email protected]> --------- Signed-off-by: jasonwan <[email protected]> * revert Signed-off-by: ericharper <[email protected]> * mcore llama2 ckpt conversion & small fix Signed-off-by: jasonwan <[email protected]> * Add inference & sft config by Hongbin Co-authored-by: Hongbin Liu <[email protected]> Signed-off-by: jasonwan <[email protected]> * fix config Signed-off-by: jasonwan <[email protected]> * add inference param. update TP/PP script to support mcore gpt Signed-off-by: jasonwan <[email protected]> * p-tuning Signed-off-by: jasonwan <[email protected]> * modify ckpt conversion script (adding model cast) Signed-off-by: jasonwan <[email protected]> * ckpt conversion use relative path for config Signed-off-by: jasonwan <[email protected]> * start adding gpt from megatron core path Signed-off-by: ericharper <[email protected]> * set model parallel config Signed-off-by: ericharper <[email protected]> * use model parallel config object Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * set vp size to none if it is 1 Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add TransformerConfig Signed-off-by: ericharper <[email protected]> * start updating to TransformerConfig Signed-off-by: ericharper <[email protected]> * add todo Signed-off-by: ericharper <[email protected]> * revert to model parallel config Signed-off-by: ericharper <[email protected]> * add hidden_size to model_parallel_config Signed-off-by: ericharper <[email protected]> * remove imports Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: ericharper <[email protected]> * small clean up Signed-off-by: ericharper <[email protected]> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <[email protected]> * update module args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add config obj to flash attention tests Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <[email protected]> * add config to self Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * add config to test Signed-off-by: ericharper <[email protected]> * get hidden_size from config Signed-off-by: ericharper <[email protected]> * add try except Signed-off-by: ericharper <[email protected]> * use default Signed-off-by: ericharper <[email protected]> * update config with hidden size Signed-off-by: ericharper <[email protected]> * remove arg Signed-off-by: ericharper <[email protected]> * comment out jenkins test Signed-off-by: ericharper <[email protected]> * revert import Signed-off-by: ericharper <[email protected]> * remove optimizer_idx Signed-off-by: eharper <[email protected]> * prefetch num microbatches Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * start adding gpt from megatron core path Signed-off-by: ericharper <[email protected]> * set model parallel config Signed-off-by: ericharper <[email protected]> * use model parallel config object Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * fix for p-tuning sequence parallel Signed-off-by: jasonwan <[email protected]> * support SFT/distOpt mcore (#7207) * add inference param. update TP/PP script to support mcore gpt * p-tuning Signed-off-by: jasonwan <[email protected]> * change layer names for SFT Signed-off-by: Hongbin Liu <[email protected]> * fix bug in SFT Signed-off-by: Hongbin Liu <[email protected]> --------- Signed-off-by: jasonwan <[email protected]> Signed-off-by: Hongbin Liu <[email protected]> Co-authored-by: Hongbin Liu <[email protected]> Co-authored-by: jasonwan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * start updating to TransformerConfig Signed-off-by: ericharper <[email protected]> * revert to model parallel config Signed-off-by: ericharper <[email protected]> * add hidden_size to model_parallel_config Signed-off-by: ericharper <[email protected]> * remove imports Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update module args Signed-off-by: ericharper <[email protected]> * add config to self Signed-off-by: ericharper <[email protected]> * build transformer config Signed-off-by: ericharper <[email protected]> * add model to provider func Signed-off-by: ericharper <[email protected]> * update forward and float16 wrapper Signed-off-by: ericharper <[email protected]> * instantiate model parallel config after init model parallel Signed-off-by: ericharper <[email protected]> * set virtual rank Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add GQA config to megatron gpt model (#7096) * Add GQA config in gpt config file Signed-off-by: jasonwan <[email protected]> * Verify mcore is enabled when using GQA Signed-off-by: jasonwan <[email protected]> --------- Signed-off-by: jasonwan <[email protected]> * revert Signed-off-by: ericharper <[email protected]> * remove import Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rollback model cast for p-tuning Signed-off-by: jasonwan <[email protected]> * update for dist adam Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use get_gpt_module_list Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update ckpt conversion script Signed-off-by: jasonwan <[email protected]> * ptl2.0 patch for llama config Signed-off-by: jasonwan <[email protected]> * add plugins to trainer in scripts Signed-off-by: jasonwan <[email protected]> * fix activation checkpointing mcore Signed-off-by: jasonwan <[email protected]> * fix variable names Signed-off-by: jasonwan <[email protected]> * overwrite normalization type for mcore/te Signed-off-by: jasonwan <[email protected]> * Update megatron_llama_sft.yaml Signed-off-by: Jason Wang <[email protected]> * add PEFT adapter support for mcore gpt path (#7276) * implementation for mcore adapter/mxins Signed-off-by: jasonwan <[email protected]> * small fix for lora and ptuning Signed-off-by: jasonwan <[email protected]> * support layerwise peft Signed-off-by: jasonwan <[email protected]> * support multiple target layers Signed-off-by: jasonwan <[email protected]> * support lora GQA Signed-off-by: jasonwan <[email protected]> * support amp O2 Signed-off-by: jasonwan <[email protected]> * revert & more O2 fix Signed-off-by: jasonwan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lora inject to attention Signed-off-by: jasonwan <[email protected]> * support lora weight tying Signed-off-by: jasonwan <[email protected]> * add copyright header Signed-off-by: jasonwan <[email protected]> * rollback ptuning name change. full string match mcore target Signed-off-by: jasonwan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove comment Signed-off-by: jasonwan <[email protected]> --------- Signed-off-by: jasonwan <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * clean up config Signed-off-by: jasonwan <[email protected]> * Sync llama branch (#7297) * add inference param. update TP/PP script to support mcore gpt * p-tuning Signed-off-by: jasonwan <[email protected]> * change layer names for SFT Signed-off-by: Hongbin Liu <[email protected]> * fix bug in SFT Signed-off-by: Hongbin Liu <[email protected]> * fix bug: cpu initialization is not really enabled Signed-off-by: Hongbin Liu <[email protected]> * add use_cpu_initialization to TransformerConfig Signed-off-by: Hongbin Liu <[email protected]> * fix bug: wrong config path when using relative cjpt path Signed-off-by: Hongbin Liu <[email protected]> * revert mcore config change Signed-off-by: Jason Wang <[email protected]> --------- Signed-off-by: jasonwan <[email protected]> Signed-off-by: Hongbin Liu <[email protected]> Signed-off-by: Jason Wang <[email protected]> Co-authored-by: Hongbin Liu <[email protected]> * clean up ckpt conversion script Signed-off-by: jasonwan <[email protected]> * rollback git merge errors Signed-off-by: jasonwan <[email protected]> * update mcore, add check for mcore+te Signed-off-by: jasonwan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * formatting Signed-off-by: jasonwan <[email protected]> * make sft test dataset optional. fix indentation in config Signed-off-by: jasonwan <[email protected]> * one more fix for optional test set Signed-off-by: jasonwan <[email protected]> * support merging lora weights in mcore Signed-off-by: jasonwan <[email protected]> * update mcore for cpu init Signed-off-by: jasonwan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update ckpt conversion for code llama Signed-off-by: jasonwan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add seq_len_interpolation_factor support for long-context llama ckpts (#7312) * add inference param. update TP/PP script to support mcore gpt * p-tuning Signed-off-by: jasonwan <[email protected]> * add seq_len_interpolation_factor Signed-off-by: Hongbin Liu <[email protected]> --------- Signed-off-by: jasonwan <[email protected]> Signed-off-by: Hongbin Liu <[email protected]> Co-authored-by: jasonwan <[email protected]> Co-authored-by: Hongbin Liu <[email protected]> * fix old ptuning model, update mcore to support seq_len_interpolation_factor Signed-off-by: jasonwan <[email protected]> * support fused layernorm linear, fix ptuning O2 Signed-off-by: jasonwan <[email protected]> * drop loss mask for mcore for now Signed-off-by: jasonwan <[email protected]> * disable dist ckpt in peft Signed-off-by: jasonwan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix loading non dist ckpt Signed-off-by: jasonwan <[email protected]> * add ckpt conversion to CI Signed-off-by: jasonwan <[email protected]> * update CI Signed-off-by: jasonwan <[email protected]> * mcore_mixin docstring Signed-off-by: jasonwan <[email protected]> * minor change in mcore peft error message Signed-off-by: jasonwan <[email protected]> * fix amp o2 in lora weight tying Signed-off-by: jasonwan <[email protected]> * correct mcore fp8 config Signed-off-by: jasonwan <[email protected]> * add TE installation Signed-off-by: jasonwan <[email protected]> * support mcore adapter tuning Signed-off-by: jasonwan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * comment out new CI test. rollback docker image Signed-off-by: jasonwan <[email protected]> * ignore FA tests, try new CI on 23.08 Signed-off-by: jasonwan <[email protected]> * mark new CI as L2, put to beginning to test Signed-off-by: jasonwan <[email protected]> * minor fix for prompt learning Signed-off-by: jasonwan <[email protected]> * rollback to 23.06. comment out CI Signed-off-by: jasonwan <[email protected]> * minor fix ckpt conversion script Signed-off-by: jasonwan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor rollback gpt model change Signed-off-by: jasonwan <[email protected]> --------- Signed-off-by: ericharper <[email protected]> Signed-off-by: jasonwan <[email protected]> Signed-off-by: eharper <[email protected]> Signed-off-by: Hongbin Liu <[email protected]> Signed-off-by: Jason Wang <[email protected]> Co-authored-by: ericharper <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: eharper <[email protected]> Co-authored-by: Hongbin Liu <[email protected]> Co-authored-by: Kelvin Liu <[email protected]> * Hiddens modules documentation (#7303) * 1. Changed hiddens transformations module from `transformations` to `hiddens`. Signed-off-by: Micha Livne <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 1. Debugging. Signed-off-by: Micha Livne <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 1. Finished doc. Signed-off-by: Micha Livne <[email protected]> * 1. Debugging. Signed-off-by: Micha Livne <[email protected]> * 1. Debugging. Signed-off-by: Micha Livne <[email protected]> * 1. Debugging. Signed-off-by: Micha Livne <[email protected]> * 1. Debugging. Signed-off-by: Micha Livne <[email protected]> * 1. Debugging. Signed-off-by: Micha Livne <[email protected]> * 1. Debugging. Signed-off-by: Micha Livne <[email protected]> * 1. Debugging. Signed-off-by: Micha Livne <[email protected]> --------- Signed-off-by: Micha Livne <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> * Support for flash attention 2.0 (#7063) * Add flash attn 2 Signed-off-by: MaximumEntropy <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add FA2 feature Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Remove debugging Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: MaximumEntropy <[email protected]> Signed-off-by: Cheng-Ping Hsieh <[email protected]> Signed-off-by: Cheng-Ping Hsieh <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Cheng-Ping Hsieh <[email protected]> Co-authored-by: Cheng-Ping Hsieh <[email protected]> * lora merge fix for O2 names (#7325) * wip Signed-off-by: arendu <[email protected]> * adjust key names based on O2 Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by: arendu <[email protected]> * minor Signed-off-by: arendu <[email protected]> --------- Signed-off-by: arendu <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * multiple fields can form a context (#7147) * list of context fields and flexible prompt template Signed-off-by: arendu <[email protected]> * list of fields for context Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix bug Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Fix bug Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Add multiple truncation fields and middle truncation Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Compatible to old ckpt Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix tokenize detokenize issue Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove detokenization, add truncation augmentation Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Resolve comments Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Remove unused import Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert eos Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Add tokenizer space_sensitive attribute Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix error Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Fix erorr and use re Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix bug Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Change assert logic Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Follow adi suggestion Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove merge function Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add example and comment Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Remove context_key and add comment Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Remove random truncation Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix bug Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix template none Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix bug Signed-off-by: Cheng-Ping Hsieh <[email protected]> --------- Signed-off-by: arendu <[email protected]> Signed-off-by: Cheng-Ping Hsieh <[email protected]> Signed-off-by: Cheng-Ping Hsieh <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Cheng-Ping Hsieh <[email protected]> Co-authored-by: Cheng-Ping Hsieh <[email protected]> * Load buffers in checkpoint (#7357) Signed-off-by: Jason Wang <[email protected]> * Add migration guide for lightning 2.0 upgrade (#7360) * Add lightning 2.0 migration guide in NeMo docs Signed-off-by: Abhishree <[email protected]> * Add remaining guide for lightning 2.0 upgrade Signed-off-by: Abhishree <[email protected]> * Remove line spill over and continue in next line Signed-off-by: Abhishree <[email protected]> * Add missing dataloader_iter in the guide Signed-off-by: Abhishree <[email protected]> * Fix minor typo Signed-off-by: Abhishree <[email protected]> --------- Signed-off-by: Abhishree <[email protected]> * adding bias_dropout_add_fusion option for BERT (#7332) Signed-off-by: Alexander Jipa <[email protected]> Co-authored-by: Alexander Jipa <[email protected]> * [TTS] Change audio codec token type to TokenIndex (#7356) Signed-off-by: Ryan <[email protected]> * enable selective unfreeze (#7326) * wip Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * avoid PTL method conflicts Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: arendu <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix typos (#7361) * fix typos Signed-off-by: omahs <[email protected]> * fix typo Signed-off-by: omahs <[email protected]> * fix typos Signed-off-by: omahs <[email protected]> * fix typos Signed-off-by: omahs <[email protected]> * fix typo Signed-off-by: omahs <[email protected]> * fix typos Signed-off-by: omahs <[email protected]> * fix typo Signed-off-by: omahs <[email protected]> * fix typo Signed-off-by: omahs <[email protected]> * fix typo Signed-off-by: omahs <[email protected]> --------- Signed-off-by: omahs <[email protected]> * pin numba=0.57.1 to fix reinstall.sh error (#7366) Signed-off-by: Xuesong Yang <[email protected]> * Update new conversion script for converting safetensors. * Upgrade pytorch container to 23.08 (#7353) * upgrade pytorch container Signed-off-by: eharper <[email protected]> * use mcore Signed-off-by: eharper <[email protected]> * revert test change Signed-off-by: eharper <[email protected]> * pleasefixme Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * check for ampere Signed-off-by: eharper <[email protected]> * comment test temporarily Signed-off-by: eharper <[email protected]> --------- Signed-off-by: eharper <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * enable fp32 optimizer for output_layer in mcore (#7355) Signed-off-by: lhb8125 <[email protected]> * revert comment (#7368) Signed-off-by: eharper <[email protected]> * Update to core 23.08 branch ToT (#7371) Signed-off-by: Abhinav Khattar <[email protected]> * upper bounding ptl (#7370) Signed-off-by: eharper <[email protected]> * fix pipeline parallel inference (#7367) * fix pp inference Signed-off-by: jasonwan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: jasonwan <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix for peft tied weights (#7372) Signed-off-by: arendu <[email protected]> * fixed trainer.strategy=auto from None. (#7369) Signed-off-by: Xuesong Yang <[email protected]> * add O2 option in gpt eval (#7358) * add O2 option in eval Signed-off-by: jasonwan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add doc for O2 config Signed-off-by: jasonwan <[email protected]> * add to llama inference config Signed-off-by: jasonwan <[email protected]> --------- Signed-off-by: jasonwan <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> * Move model precision copy (#7336) * move cfg precision set to megatron base model Signed-off-by: Maanu Grover <[email protected]> * remove copy from other models Signed-off-by: Maanu Grover <[email protected]> * modify attribute not arg Signed-off-by: Maanu Grover <[email protected]> * fix gpt model test for ptl 2.0 Signed-off-by: Maanu Grover <[email protected]> * rename function and add docstring Signed-off-by: Maanu Grover <[email protected]> * replace precision to dtype conditionals with func call Signed-off-by: Maanu Grover <[email protected]> * unnecessary function and cfg reset Signed-off-by: Maanu Grover <[email protected]> * set default value Signed-off-by: Maanu Grover <[email protected]> * fix precision lookup in a few more places Signed-off-by: Maanu Grover <[email protected]> * rename mapping function Signed-off-by: Maanu Grover <[email protected]> * ununsed import Signed-off-by: Maanu Grover <[email protected]> * save torch datatype to model Signed-off-by: Maanu Grover <[email protected]> * set weights precision wrt amp o2 Signed-off-by: Maanu Grover <[email protected]> * Revert "set weights precision wrt amp o2" This reverts commit 313a4bfe5eb69d771a6d2433898c0685836aef5c. Signed-off-by: Maanu Grover <[email protected]> * revert half precision at inference attempt Signed-off-by: Maanu Grover <[email protected]> * move autocast dtype to base model Signed-off-by: Maanu Grover <[email protected]> * move params dtype to base model, enable fp16 O2 inf Signed-off-by: Maanu Grover <[email protected]> * unused imports Signed-off-by: Maanu Grover <[email protected]> --------- Signed-off-by: Maanu Grover <[email protected]> * Fix PEFT checkpoint loading (#7388) * Fix PEFT checkpoint loading Signed-off-by: Jason Wang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jason Wang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Use distributed optimizer support for multiple dtypes (#7359) * Update distopt wrapper with multiple dtype support Remove manual handling of separate FP32 optimizer. Signed-off-by: Tim Moon <[email protected]> * Use distopt support for contiguous buffers with multiple dtypes Signed-off-by: Tim Moon <[email protected]> * Fix typo Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Separate distopt buckets for first GPT layer and non-overlapped params Signed-off-by: Tim Moon <[email protected]> * Add distopt logic for int dtypes Signed-off-by: Tim Moon <[email protected]> * Update Apex commit Signed-off-by: Tim Moon <[email protected]> * Remove unused variables Signed-off-by: Tim Moon <[email protected]> * Update Apex commit in README and Jenkensfile Signed-off-by: Tim Moon <[email protected]> * Debug Dockerfile and Jenkinsfile Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> * minor fix for llama ckpt conversion script (#7387) * minor fix for llama ckpt conversion script Signed-off-by: Jason Wang <[email protected]> * Update Jenkinsfile Signed-off-by: Jason Wang <[email protected]> * remove fast_swiglu configuration Signed-off-by: Jason Wang <[email protected]> --------- Signed-off-by: Jason Wang <[email protected]> Co-authored-by: Eric Harper <[email protected]> * Fix wrong calling of librosa.get_duration() in notebook (#7376) Signed-off-by: Robin Dong <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> * [PATCH] PEFT import mcore (#7393) * [PATCH] PEFT import mcore Signed-off-by: Jason Wang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jason Wang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [TTS] Added a callback for logging initial data (#7384) Signed-off-by: Ante Jukić <[email protected]> * Update Core Commit (#7402) * Update Core Commit Signed-off-by: Abhinav Khattar <[email protected]> * update commit Signed-off-by: Abhinav Khattar <[email protected]> --------- Signed-off-by: Abhinav Khattar <[email protected]> * Use cfg attribute in bert (#7394) * use cfg attribute instead of arg Signed-off-by: Maanu Grover <[email protected]> * use torch_dtype in place of cfg.precision Signed-off-by: Maanu Grover <[email protected]> * move precision copy before super constructor Signed-off-by: Maanu Grover <[email protected]> * use trainer arg Signed-off-by: Maanu Grover <[email protected]> --------- Signed-off-by: Maanu Grover <[email protected]> * Add support for bias conversion in Swiglu models (#7386) * Add support for bias conversion in Swiglu models Signed-off-by: smajumdar <[email protected]> * Add support for auto extracting tokenizer model Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add support for auto extracting tokenizer model Signed-off-by: smajumdar <[email protected]> * Fix issue with missing tokenizer Signed-off-by: smajumdar <[email protected]> * Refactor Signed-off-by: smajumdar <[email protected]> * Refactor Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: smajumdar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Update save_to and restore_from for dist checkpointing (#7343) * add dist ckpt to save to, in progress Signed-off-by: eharper <[email protected]> * move dist ckpt Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * clean up Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update restore from, need to figure out how to initialize distributed Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * launch distrib if needed when restoring dist ckpt Signed-off-by: eharper <[email protected]> * when using mcore we can change tp pp on the fly Signed-off-by: eharper <[email protected]> * add load_from_checkpoint support for dist ckpt Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update llama convert script to save dist .nemo Signed-off-by: eharper <[email protected]> * fix load dist ckpt Signed-off-by: jasonwan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * setup TE TP groups if needed Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * setup te tp groups if needed Signed-off-by: eharper <[email protected]> * remove import Signed-off-by: eharper <[email protected]> --------- Signed-off-by: eharper <[email protected]> Signed-off-by: jasonwan <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: jasonwan <[email protected]> * fix forward for with mcore=false (#7403) Signed-off-by: Jimmy Zhang <[email protected]> Co-authored-by: Jimmy Zhang <[email protected]> * Fix logging to remove 's/it' from progress bar in Megatron models and add train_step_timing (#7374) * Add CustomProgressBar class to exp_manager and trainer callbacks Signed-off-by: Abhishree <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix the progress bar to reflect total microbatch cnt Signed-off-by: Abhishree <[email protected]> * Modify CustomProgressBar class 1) Modify CustomProgressBar class to update progress bar per global_step instead of per microbatch 2) Add the callback to other megatron training/finetuning files that are not using MegatronTrainerBuilder Signed-off-by: Abhishree <[email protected]> * Add CustomProgressBar callback to tuning files Signed-off-by: Abhishree <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhishree <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Set Activation Checkpointing Defaults (#7404) * Set Activation Checkpointing Defaults Signed-off-by: Abhinav Khattar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * check for None Signed-off-by: Abhinav Khattar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * make loss mask default to false (#7407) Signed-off-by: eharper <[email protected]> * Add dummy userbuffer config files (#7408) Signed-off-by: Sangkug Lym <[email protected]> * add missing ubconf files (#7412) Signed-off-by: Abhinav Khattar <[email protected]> * New tutorial on Speech Data Explorer (#7405) * Added Google Colab based tutorial on Speech Data Explorer Signed-off-by: George Zelenfroynd <[email protected]> * Update ptl training ckpt conversion script to work with dist ckpt (#7416) * update ptl convert script Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * don't break legacy Signed-off-by: eharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: eharper <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Allow disabling sanity checking when num_sanity_val_steps=0 (#7413) * Allow disabling sanity checking when num_sanity_val_steps=0 Signed-off-by: Abhishree <[email protected]> * Update num_sanity_val_steps to be a multiple of num_microbatches Signed-off-by: Abhishree Thittenamane <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhishree <[email protected]> Signed-off-by: Abhishree Thittenamane <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Add comprehensive error messages (#7261) Signed-off-by: Anton Peganov <[email protected]> * check NEMO_PATH (#7418) Signed-off-by: Nikolay Karpov <[email protected]> * layer selection for ia3 (#7417) * layer selection for ia3 Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: arendu <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix missing pip package 'einops' (#7397) Signed-off-by: Robin Dong <[email protected]> * Fix failure of pyaudio in Google Colab (#7396) Signed-off-by: Robin Dong <[email protected]> * Update README.md: output_path --> output_manifest_filepath (#7442) Signed-off-by: Samuele Cornell <[email protected]> * Updating FlashAttention API to match FlashAttentionV2 * Multiple fixes for mm * Fix CI inductor issue and update to torch compile * Remove suppress error * Fix when conversion config uses fp16 and it complains about precision plugin * Fixing FAv2 API usage * Initial release of content filtering model * Added synthetic dataloader for precached and online mode * Mingyuanm/dreambooth opt * Add llama2 support in neva training * Fix sampler length * Fix all precision issues in nemo multimodal * Add rope dynamic linear scaling (#7437) * Add dynamic linear scaling Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix bug Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Cheng-Ping Hsieh <[email protected]> --------- Signed-off-by: Cheng-Ping Hsieh <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Yang Zhang <[email protected]> * Fix None dataloader issue in PTL2.0 (#7455) * Fix None dataloader issue in PTL2.0 Signed-off-by: KunalDhawan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updating values of self._validation_dl and self._test_dl as well Signed-off-by: KunalDhawan <[email protected]> * updating values of self._validation_dl and self._test_dl as well Signed-off-by: KunalDhawan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: KunalDhawan <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [ASR] Confidence measure -> method renames (#7434) * measure -> method Signed-off-by: Aleksandr Laptev <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Aleksandr Laptev <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Add steps for document of getting dataset 'SF Bilingual Speech' (#7378) * Add steps for document of getting dataset 'SF Bilingual Speech' Signed-off-by: Robin Dong <[email protected]> * Update datasets.rst added a link from a tutorial demonstrating detailed data prep steps. Signed-off-by: Xuesong Yang <[email protected]> --------- Signed-off-by: Robin Dong <[email protected]> Signed-off-by: Xuesong Yang <[email protected]> Co-authored-by: Xuesong Yang <[email protected]> * RNN-T confidence and alignment bugfix (#7381) * new frame_confidence and alignments lists are now always created after the while loop Signed-off-by: Aleksandr Laptev <[email protected]> * tests added Signed-off-by: Aleksandr Laptev <[email protected]> --------- Signed-off-by: Aleksandr Laptev <[email protected]> * Fix resume from checkpoint in exp_manager (#7424) (#7426) Signed-off-by: Abhishree <[email protected]> Co-authored-by: Abhishree Thittenamane <[email protected]> Co-authored-by: Eric Harper <[email protected]> * Fix checking of cuda/cpu device for inputs of Decoder (#7444) * Fix checking of cuda/cpu device for inputs of Decoder Signed-off-by: Robin Dong <[email protected]> * Update tacotron2.py Signed-off-by: Jason <[email protected]> --------- Signed-off-by: Robin Dong <[email protected]> Signed-off-by: Jason <[email protected]> Co-authored-by: Jason <[email protected]> * Fix failure of ljspeech's get_data.py (#7430) * Fix failure of ljspeech's get_data.py Signed-off-by: Robin Dong <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Robin Dong <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [TTS] Fix audio codec type checks (#7373) * [TTS] Fix audio codec type checks Signed-off-by: Ryan <[email protected]> * [TTS] Fix audio codec tests Signed-off-by: Ryan <[email protected]> --------- Signed-off-by: Ryan <[email protected]> * [TTS] Add dataset to path of logged artifacts (#7462) * [TTS] Add dataset to path of logged artifacts Signed-off-by: Ryan <[email protected]> * [TTS] Revert axis name back to Audio Frames Signed-off-by: Ryan <[email protected]> --------- Signed-off-by: Ryan <[email protected]> * Fix sft dataset truncation (#7464) * Add fix Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Cheng-Ping Hsieh <[email protected]> --------- Signed-off-by: Cheng-Ping Hsieh <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Automatic Lip Reading Recognition (ALR) - ASR/CV (Visual ASR) (#7330) * striding_conv1d_k5 and dw_striding_conv1d_k5 subsampling Signed-off-by: mburchi <[email protected]> * transpose conv1d inputs Signed-off-by: mburchi <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, s…
1 parent 73736ad commit 94bd346

File tree

1,217 files changed

+125197
-15420
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,217 files changed

+125197
-15420
lines changed

.dockerignore

+2
Original file line numberDiff line numberDiff line change
@@ -17,3 +17,5 @@ coverage.xml
1717
.git
1818
**/*.nemo
1919
**/*.ckpt
20+
workspace
21+
nemo_experiments

.github/PULL_REQUEST_TEMPLATE.md

+3
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,9 @@ Add a one line overview of what this PR aims to accomplish.
1414
# Add a code snippet demonstrating how to use this
1515
```
1616

17+
# Jenkins CI
18+
To run Jenkins, a NeMo User with write access must comment `jenkins` on the PR.
19+
1720
# Before your PR is "Ready for review"
1821
**Pre checks**:
1922
- [ ] Make sure you read and followed [Contributor guidelines](https://github.com/NVIDIA/NeMo/blob/main/CONTRIBUTING.md)

.github/labeler.yml

+8
Original file line numberDiff line numberDiff line change
@@ -3,25 +3,33 @@ ASR:
33
- examples/asr/**/*
44
- tutorials/asr/**/*
55
- docs/source/asr/**/*
6+
- tests/collections/asr/**
67

78
NLP:
89
- nemo/collections/nlp/**/*
910
- examples/nlp/**/*
1011
- tutorials/nlp/**/*
1112
- docs/source/nlp/**/*
13+
- tests/collections/nlp/**
1214

1315
Speaker Tasks:
1416
- examples/speaker_tasks/**/*
1517
- tutorials/speaker_tasks/**/*
1618

1719
TTS:
1820
- nemo/collections/tts/**/*
21+
- nemo/collections/common/tokenizers/text_to_speech/**
1922
- examples/tts/**/*
2023
- tutorials/tts/**/*
2124
- docs/source/tts/**/*
25+
- scripts/dataset_processing/tts/**
26+
- scripts/tts_dataset_files/**
27+
- tests/collections/tts/**
28+
- tests/collections/common/tokenizers/text_to_speech/**
2229

2330
core:
2431
- nemo/core/**/*
32+
- tests/core/**
2533

2634
common:
2735
- nemo/collections/common/**/*

.pre-commit-config.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ repos:
2828
- id: check-case-conflict
2929
- id: detect-private-key
3030
- id: check-added-large-files
31-
args: ['--maxkb=1000']
31+
args: ['--maxkb=5000']
3232
- id: requirements-txt-fixer
3333

3434
- repo: https://github.com/PyCQA/isort

.readthedocs.yml

+5-1
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,16 @@
2020
# Required field.
2121
version: 2
2222

23+
build:
24+
os: ubuntu-22.04
25+
tools:
26+
python: "3.10"
27+
2328
# Build documentation in the docs/ directory with Sphinx.
2429
sphinx:
2530
configuration: docs/source/conf.py
2631

2732
# Set the version of Python and requirements required to build your docs
2833
python:
29-
version: 3.8
3034
install:
3135
- requirements: requirements/requirements_docs.txt

Dockerfile

+65-20
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
# See the License for the specific language governing permissions and
1515
# limitations under the License.
1616

17-
ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:23.06-py3
17+
ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:23.12-py3
1818

1919
# build an image that includes only the nemo dependencies, ensures that dependencies
2020
# are included first for optimal caching, and useful for building a development
@@ -31,7 +31,7 @@ ARG REQUIRE_AIS_CLI=false
3131

3232
# Ensure apt-get won't prompt for selecting options
3333
ENV DEBIAN_FRONTEND=noninteractive
34-
# libavdevice-dev rerquired for latest torchaudio
34+
# libavdevice-dev required for latest torchaudio
3535
RUN apt-get update && \
3636
apt-get upgrade -y && \
3737
apt-get install -y \
@@ -42,15 +42,48 @@ RUN apt-get update && \
4242
libavdevice-dev && \
4343
rm -rf /var/lib/apt/lists/*
4444

45-
WORKDIR /workspace/
45+
# libtool, ... , libgts-dev are required for graphviz
46+
# graphviz is required for k2 and pynini visualization
47+
RUN apt-get update && \
48+
apt-get install -y \
49+
libtool \
50+
libltdl-dev \
51+
automake \
52+
autoconf \
53+
bison \
54+
flex \
55+
tcl \
56+
ghostscript \
57+
libgd-dev \
58+
fontconfig \
59+
libcairo2-dev \
60+
libpango1.0-dev \
61+
libgts-dev && \
62+
rm -rf /var/lib/apt/lists/*
4663

47-
WORKDIR /tmp/
48-
# TODO: Remove once this Apex commit (5/12/23) is included in PyTorch
49-
# container
64+
WORKDIR /workspace/
65+
# Install megatron core, this can be removed once 0.3 pip package is released
66+
# We leave it here in case we need to work off of a specific commit in main
67+
RUN git clone https://github.com/NVIDIA/Megatron-LM.git && \
68+
cd Megatron-LM && \
69+
git checkout 27cbe46714a50c43ed290f1b1472db8d2780c55c && \
70+
pip install .
71+
72+
# Performance optimizations for distributed optimizer: https://github.com/NVIDIA/apex/pull/1771
5073
RUN git clone https://github.com/NVIDIA/apex.git && \
5174
cd apex && \
52-
git checkout 8b7a1ff183741dd8f9b87e7bafd04cfde99cea28 && \
53-
pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./
75+
git checkout b496d85fb88a801d8e680872a12822de310951fd && \
76+
pip install -v --no-build-isolation --disable-pip-version-check --no-cache-dir --config-settings "--build-option=--cpp_ext --cuda_ext --fast_layer_norm --distributed_adam --deprecated_fused_adam" ./
77+
78+
# Transformer Engine 1.2.0
79+
RUN git clone https://github.com/NVIDIA/TransformerEngine.git && \
80+
cd TransformerEngine && \
81+
git fetch origin 4f9662fbe621671f5f905e772fc1138953af77f6 && \
82+
git checkout FETCH_HEAD && \
83+
git submodule init && git submodule update && \
84+
NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install .
85+
86+
WORKDIR /tmp/
5487

5588
# uninstall stuff from base container
5689
RUN pip3 uninstall -y sacrebleu torchtext
@@ -67,19 +100,20 @@ RUN INSTALL_MSG=$(/bin/bash /tmp/torchaudio_build/scripts/installers/install_tor
67100
else echo "Skipping failed torchaudio installation"; fi \
68101
else echo "torchaudio installed successfully"; fi
69102

70-
# install nemo dependencies
71-
WORKDIR /tmp/nemo
72-
COPY requirements .
73-
RUN for f in $(ls requirements*.txt); do pip3 install --disable-pip-version-check --no-cache-dir -r $f; done
74-
75-
# install flash attention dependencies
76-
RUN pip install flash-attn
77-
# pinned triton version for flash-attention https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py#L3
78-
RUN pip install triton==2.0.0.dev20221202
103+
COPY scripts /tmp/nemo/scripts/
104+
# install correct graphviz version (k2 and pynini visualization tool), skip if installation fails
105+
RUN INSTALL_MSG=$(/bin/bash /tmp/nemo/scripts/installers/install_graphviz.sh --docker); INSTALL_CODE=$?; \
106+
echo ${INSTALL_MSG}; \
107+
if [ ${INSTALL_CODE} -ne 0 ]; then \
108+
echo "graphviz installation failed"; \
109+
if [ "${REQUIRE_K2}" = true ]; then \
110+
exit ${INSTALL_CODE}; \
111+
else echo "Skipping failed graphviz installation"; fi \
112+
else echo "graphviz installed successfully"; fi
79113

80114
# install k2, skip if installation fails
81115
COPY scripts /tmp/nemo/scripts/
82-
RUN INSTALL_MSG=$(/bin/bash /tmp/nemo/scripts/speech_recognition/k2/setup.sh); INSTALL_CODE=$?; \
116+
RUN INSTALL_MSG=$(/bin/bash /tmp/nemo/scripts/installers/install_k2.sh); INSTALL_CODE=$?; \
83117
echo ${INSTALL_MSG}; \
84118
if [ ${INSTALL_CODE} -ne 0 ]; then \
85119
echo "k2 installation failed"; \
@@ -88,13 +122,24 @@ RUN INSTALL_MSG=$(/bin/bash /tmp/nemo/scripts/speech_recognition/k2/setup.sh); I
88122
else echo "Skipping failed k2 installation"; fi \
89123
else echo "k2 installed successfully"; fi
90124

125+
# install nemo dependencies
126+
WORKDIR /tmp/nemo
127+
ENV LHOTSE_REQUIRE_TORCHAUDIO=0
128+
COPY requirements .
129+
RUN for f in $(ls requirements*.txt); do pip3 install --disable-pip-version-check --no-cache-dir -r $f; done
130+
131+
# install flash attention
132+
RUN pip install flash-attn
133+
# install numba for latest containers
134+
RUN pip install numba>=0.57.1
135+
91136
# copy nemo source into a scratch image
92137
FROM scratch as nemo-src
93138
COPY . .
94139

95140
# start building the final container
96141
FROM nemo-deps as nemo
97-
ARG NEMO_VERSION=1.20.0
142+
ARG NEMO_VERSION=1.23.0
98143

99144
# Check that NEMO_VERSION is set. Build will fail without this. Expose NEMO and base container
100145
# version information as runtime environment variable for introspection purposes
@@ -103,7 +148,7 @@ RUN /usr/bin/test -n "$NEMO_VERSION" && \
103148
/bin/echo "export BASE_IMAGE=${BASE_IMAGE}" >> /root/.bashrc
104149

105150
# Install NeMo
106-
RUN --mount=from=nemo-src,target=/tmp/nemo cd /tmp/nemo && pip install ".[all]"
151+
RUN --mount=from=nemo-src,target=/tmp/nemo,rw cd /tmp/nemo && pip install ".[all]"
107152

108153
# Check install
109154
RUN python -c "import nemo.collections.nlp as nemo_nlp" && \

0 commit comments

Comments
 (0)