Separate punctuation by whitespace #6574

karpnv · 2023-05-05T17:56:55Z

What does this PR do ?

Fix separate_punctuation=True option

Collection: ASR

Changelog

Do

'some text.' -> 'some text .'
'some text .' -> some text .'

instead of

'some text.' -> 'some text .'
'some text .' -> some text.'

PR Type:

[V ] Bugfix

Who can review?

@Kipok

Signed-off-by: Nikolay Karpov <[email protected]>

Kipok · 2023-05-05T18:00:36Z

Thanks @karpnv! Is it possible to add a couple of unit tests for this class to ensure that it works as expected on a few examples?

titu1994

Thanks !

titu1994 · 2023-05-05T18:18:02Z

For unittest - best case would be to wait for the ASR post processor class to be developed and then we can add the regex + test it in isolation. This full script is hard to unittest

* Add FastConformer Hybrid ASR models for EN, ES, IT, DE, PL, HR, UA, BY (#6549) (#6553) * Added fastconfomer hybrid asr models for en, es, it, de, pl, hr, ua, by * updated ASR docs with the fastconformer hybrid checkpoints * added the fastconformer RNNT and CTC models --------- Signed-off-by: KunalDhawan <[email protected]> Co-authored-by: Kunal Dhawan <[email protected]> * Add scores for FastConformer models (#6557) (#6558) Signed-off-by: smajumdar <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> * Fix fp16 (#6543) (#6544) Signed-off-by: MaximumEntropy <[email protected]> Co-authored-by: Sandeep Subramanian <[email protected]> * Patch transcribe and support offline transcribe for hybrid model (#6550) (#6559) Signed-off-by: fayejf <[email protected]> Co-authored-by: fayejf <[email protected]> * Fix notebook bad json (#6561) Signed-off-by: smajumdar <[email protected]> * Change Megatron Enc Dec model to use persistent_workers (#6548) (#6552) * persistent workers * fix --------- Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Co-authored-by: Eric Harper <[email protected]> * Make KenLM with PC for AggregateTokenizer and merge it (#6081) * do_lowercase, rm_punctuation Signed-off-by: Nikolay Karpov <[email protected]> * support beam_strategy = beam Signed-off-by: Nikolay Karpov <[email protected]> * black Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix config and^Cunctuation capitalization Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rm math Signed-off-by: Nikolay Karpov <[email protected]> * update kenlm Signed-off-by: Nikolay Karpov <[email protected]> * black Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add opengrm Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * mv install_beamsearch_decoders Signed-off-by: Nikolay Karpov <[email protected]> * punctuation_to_preserve Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Only tikenizer opion Signed-off-by: Nikolay Karpov <[email protected]> * Black Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * DEFAULT_TOKEN_OFFSET Signed-off-by: Nikolay Karpov <[email protected]> * aggregate_tokenizer Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * install kenlm with more than 5gram Signed-off-by: Nikolay Karpov <[email protected]> * install_beamsearch_decoders Signed-off-by: Nikolay Karpov <[email protected]> * ngram_bin_path kenlm_bin_path Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * black Signed-off-by: Nikolay Karpov <[email protected]> * fix greedy PC bug Signed-off-by: Nikolay Karpov <[email protected]> * move global params Signed-off-by: Nikolay Karpov <[email protected]> * fix description and perplexity Signed-off-by: Nikolay Karpov <[email protected]> * fix description Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * NEMO_PATH Signed-off-by: Nikolay Karpov <[email protected]> * nemo:23.01 Signed-off-by: Nikolay Karpov <[email protected]> * License Signed-off-by: Nikolay Karpov <[email protected]> * description Signed-off-by: Nikolay Karpov <[email protected]> * isinstance Signed-off-by: Nikolay Karpov <[email protected]> * refactor kenlm stdin Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * black Signed-off-by: Nikolay Karpov <[email protected]> * add cmd arg Signed-off-by: Nikolay Karpov <[email protected]> * use new iter_files Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * EncDecHybridRNNTCTCModel Signed-off-by: Nikolay Karpov <[email protected]> * punctuation Signed-off-by: Nikolay Karpov <[email protected]> * train_kenlm args Signed-off-by: Nikolay Karpov <[email protected]> * add docstrings Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add ngram_merge docs Signed-off-by: Nikolay Karpov <[email protected]> * ngram_prune Signed-off-by: Nikolay Karpov <[email protected]> * rename to ngram_merge Signed-off-by: Nikolay Karpov <[email protected]> * rename to ngram Signed-off-by: Nikolay Karpov <[email protected]> * add comments Signed-off-by: Nikolay Karpov <[email protected]> * Ngram Signed-off-by: Nikolay Karpov <[email protected]> * nemo_model_file Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * install_opengrm_ngram Signed-off-by: Nikolay Karpov <[email protected]> * install opengrm Signed-off-by: Nikolay Karpov <[email protected]> * rename to install_opengrm.sh Signed-off-by: Nikolay Karpov <[email protected]> * rm extra import Signed-off-by: Nikolay Karpov <[email protected]> * train_paths Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * text_processing Signed-off-by: Nikolay Karpov <[email protected]> * fix ngram_bin_path Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * DECODERS_PATH Signed-off-by: Nikolay Karpov <[email protected]> * farcompile Signed-off-by: Nikolay Karpov <[email protected]> * rm text processing Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * text_processing Signed-off-by: Nikolay Karpov <[email protected]> * AggregateTokenizer.DummyTokenizer Signed-off-by: Nikolay Karpov <[email protected]> * comments Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * TextProcessingConfig Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * typo Signed-off-by: Nikolay Karpov <[email protected]> * doc Signed-off-by: Nikolay Karpov <[email protected]> * types Signed-off-by: Nikolay Karpov <[email protected]> * nemo_model_file Signed-off-by: Nikolay Karpov <[email protected]> * rm assert Signed-off-by: Nikolay Karpov <[email protected]> * import kenlm_utils Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * return None Signed-off-by: Nikolay Karpov <[email protected]> * Copyright Signed-off-by: Nikolay Karpov <[email protected]> * 2022 Signed-off-by: Nikolay Karpov <[email protected]> * 2023 Signed-off-by: Nikolay Karpov <[email protected]> --------- Signed-off-by: Nikolay Karpov <[email protected]> Signed-off-by: Nikolay Karpov <[email protected]> Co-authored-by: Nikolay Karpov <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * temp rtd fix (#6568) (#6569) Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> * [TTS] Add script for mapping speaker names to indices (#6509) Signed-off-by: Ryan <[email protected]> * whitespace (#6574) Signed-off-by: Nikolay Karpov <[email protected]> * Update manifest.py for speedup (#6565) (#6573) * Update manifest.py Re-order the checks for faster processing audio filepaths that are already absolute paths * Update manifest.py --------- Signed-off-by: He Huang (Steve) <[email protected]> Co-authored-by: He Huang (Steve) <[email protected]> Co-authored-by: Vahid Noroozi <[email protected]> * More streaming conformer export fixes (#6567) (#6578) Signed-off-by: Greg Clark <[email protected]> Co-authored-by: Greg Clark <[email protected]> Co-authored-by: Vahid Noroozi <[email protected]> * user selected max_seq_len should be less than model's max_seq_len (#6333) (#6386) * user selection should not break model max limit * eval max seq length --------- Signed-off-by: arendu <[email protected]> Signed-off-by: Adi Renduchintala <[email protected]> Co-authored-by: Adi Renduchintala <[email protected]> Co-authored-by: Sandeep Subramanian <[email protected]> Co-authored-by: Eric Harper <[email protected]> * Framework for PEFT via mixins (#6391) * init commit ptuning via mixin Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updates Signed-off-by: arendu <[email protected]> * gpt ptuning places virtual tokens on the left only Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * encoder input modified when pre_process is true Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * optimizer group and state dict updates Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adapter ptuning working for pp>1 Signed-off-by: arendu <[email protected]> * adapter defaults Signed-off-by: arendu <[email protected]> * adapter ptuining config defaults Signed-off-by: arendu <[email protected]> * training works Signed-off-by: arendu <[email protected]> * loading and saving adapter only params during training Signed-off-by: arendu <[email protected]> * added checks and comments Signed-off-by: arendu <[email protected]> * clean up Signed-off-by: arendu <[email protected]> * checks for grad is None before calling all_reduce Signed-off-by: arendu <[email protected]> * load adapter .nemo file working Signed-off-by: arendu <[email protected]> * resume training for adapters Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * peft tuning Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor Signed-off-by: arendu <[email protected]> * file not needed Signed-off-by: arendu <[email protected]> * undo prompt learning dataset changes Signed-off-by: arendu <[email protected]> * undo updates to gpt prompt learning model Signed-off-by: arendu <[email protected]> * naming updates Signed-off-by: arendu <[email protected]> * decoding Signed-off-by: arendu <[email protected]> * predict_step in gpt_sft_model Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updates Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed inference from tuning config Signed-off-by: arendu <[email protected]> * no test in peft training Signed-off-by: arendu <[email protected]> * answer only loss and correct defaults for val_loss Signed-off-by: arendu <[email protected]> * hybrid adapters and ptuning Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * eval working.. Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * prepending tokens for ptuning Signed-off-by: arendu <[email protected]> * cleaned up eval config Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * clean up Signed-off-by: arendu <[email protected]> * update Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * default prompt template Signed-off-by: arendu <[email protected]> * Lora added Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Support synamic length with GPT SFT Signed-off-by: Abhinav Khattar <[email protected]> * make branch functional Signed-off-by: Abhinav Khattar <[email protected]> * defaults to max_pad_length=False in GPT SFT dataset Signed-off-by: arendu <[email protected]> * adapter parallel_adapters to support Lora Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added early stopping by default Signed-off-by: arendu <[email protected]> * eval script for peft and eval config. bug fixes in predict step and added out_features to t5 adapter config Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updates Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updates Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * docs Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * better defaults Signed-off-by: arendu <[email protected]> * updates Signed-off-by: arendu <[email protected]> * update Signed-off-by: arendu <[email protected]> * docs Signed-off-by: arendu <[email protected]> --------- Signed-off-by: arendu <[email protected]> Signed-off-by: Abhinav Khattar <[email protected]> Signed-off-by: Adi Renduchintala <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Abhinav Khattar <[email protected]> * cache and reuse inputs (#6422) (#6452) Co-authored-by: Sangkug Lym <[email protected]> Co-authored-by: Eric Harper <[email protected]> * Add patches for Virtual Parallel conversion (#6589) * Add patches for Virtual Parllel conversion Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: smajumdar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Pass `.scale` instead of scaler object to core (#6551) * pass .scale instead of scaler object to core (#6545) Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Eric Harper <[email protected]> * Update megatron_gpt_model.py Signed-off-by: Abhinav Khattar <[email protected]> * scale changes for main Signed-off-by: Abhinav Khattar <[email protected]> --------- Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Co-authored-by: Eric Harper <[email protected]> * Documentation for ASR-TTS models (#6594) (#6595) * Add docs about hybrid ASR-TTS models * Add docs about text-only datasets * Add docs about ASR-TTS checkpoints * Add docs about ASR-TTS configs and training * Clean up * ASR-TTS docs: add to api, fix imports * Clean up * Wrap optional import * Revert general ASR import --------- Signed-off-by: Vladimir Bataev <[email protected]> Co-authored-by: Vladimir Bataev <[email protected]> * [TTS] Fix aligner nan loss in fp32 (#6435) * Fix nan loss in fp32 Signed-off-by: hsiehjackson <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: hsiehjackson <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Update SDP docs (#6485) (#6596) * add info about SDP e.g. processor classes in docs * add link to SDP docs in README * address code review comments and add SDP overview diagram * Fix spelling typo --------- Signed-off-by: Elena Rastorgueva <[email protected]> Co-authored-by: Elena Rastorgueva <[email protected]> * Bug/typo fixes (#6599) Signed-off-by: Igor Gitman <[email protected]> * Manual garbage collection with an interval (#6469) (#6482) * Manual garbage collection with an interval * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use trainer.global_step for tracking the interval of GC --------- Signed-off-by: Sangkug Lym <[email protected]> Co-authored-by: Sangkug Lym <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> * Make tensor split contiguous (#6580) (#6593) Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> * [ASR] Fix for old models in change_attention_model (#6608) * fixes Signed-off-by: sam1373 <[email protected]> * done already Signed-off-by: sam1373 <[email protected]> --------- Signed-off-by: sam1373 <[email protected]> * Update manifest.py to use os.path for get_full_path (#6598) * Update manifest.py to use os.path for get_full_path Signed-off-by: He Huang (Steve) <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update manifest.py to get rid of pathlib Signed-off-by: He Huang (Steve) <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update manifest.py Signed-off-by: He Huang (Steve) <[email protected]> * Update manifest.py Signed-off-by: He Huang (Steve) <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: He Huang (Steve) <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Vahid Noroozi <[email protected]> * Cherry pick commits in #6601 to main (#6611) * fix write Signed-off-by: fayejf <[email protected]> * decoding ctc Signed-off-by: fayejf <[email protected]> * temp set rnnt decoding return_best_hypothesis to true Signed-off-by: fayejf <[email protected]> * add wer cal back to transcribe_speech as requested Signed-off-by: fayejf <[email protected]> * add wer cal back to speech_to_text_buffered_infer_rnnt as requested Signed-off-by: fayejf <[email protected]> * add wer cal back to speech_to_text_buffered_infer_ctc as requested Signed-off-by: fayejf <[email protected]> * style fix Signed-off-by: fayejf <[email protected]> * reflect change in asr_evaluator Signed-off-by: fayejf <[email protected]> * reflect som and vahid comment Signed-off-by: fayejf <[email protected]> * remove return_best_hy=true in transcribe_speech Signed-off-by: fayejf <[email protected]> * no text skip Signed-off-by: fayejf <[email protected]> * revert partial Signed-off-by: fayejf <[email protected]> --------- Signed-off-by: fayejf <[email protected]> * Create dummy iters to satisy len checks (#6600) (#6603) Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Co-authored-by: Eric Harper <[email protected]> * add GPT eval mode fix for interleaved to main (#6610) Signed-off-by: Abhinav Khattar <[email protected]> * Fix batch size reconf for T5 FT for multi-validation (#6582) (#6588) Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Co-authored-by: Eric Harper <[email protected]> * Not doing CastToFloat by default (#6524) (#6563) * Not doing CastToFloat by default * Added docustring * Dummy commit --------- Signed-off-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Eric Harper <[email protected]> * Turn autocast off when precision is fp32 (#6576) * Turn autocast off when precision is fp32 (#6554) * Turn autocast off when precision is fp32 Signed-off-by: Abhinav Khattar <[email protected]> * address review Signed-off-by: Abhinav Khattar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by: Abhinav Khattar <[email protected]> * merge Signed-off-by: Abhinav Khattar <[email protected]> --------- Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> * correct auto-merge Signed-off-by: Abhinav Khattar <[email protected]> * correct auto-merge Signed-off-by: Abhinav Khattar <[email protected]> * add to GPT SFT Signed-off-by: Abhinav Khattar <[email protected]> --------- Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> * update core commit hash in readme (#6622) (#6623) Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> * add hat image to docs (#6619) (#6621) Signed-off-by: andrusenkoau <[email protected]> Co-authored-by: Andrei Andrusenko <[email protected]> * Allow indices exchange via distributed (#6618) (#6624) Signed-off-by: Mikołaj Błaż <[email protected]> Co-authored-by: mikolajblaz <[email protected]> * Offline and streaming inference support for hybrid model (#6570) * streaming buffered for hybrid + ctc Signed-off-by: fayejf <[email protected]> * change default model_stride in eval.yaml Signed-off-by: fayejf <[email protected]> * add fc model_stride Signed-off-by: fayejf <[email protected]> * small fix Signed-off-by: fayejf <[email protected]> * check whether model and decoding match Signed-off-by: fayejf <[email protected]> * small fix Signed-off-by: fayejf <[email protected]> * streaming buffered for hybrid + rnnt Signed-off-by: fayejf <[email protected]> * style fix Signed-off-by: fayejf <[email protected]> * fix yaml Signed-off-by: fayejf <[email protected]> * reflect comment wip Signed-off-by: fayejf <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: fayejf <[email protected]> * refactor and verified Signed-off-by: fayejf <[email protected]> * add get_full_path to buffered Signed-off-by: fayejf <[email protected]> * small fix Signed-off-by: fayejf <[email protected]> * add RNNTDecodingConfig Signed-off-by: fayejf <[email protected]> * model name & instruction of changing decoding Signed-off-by: fayejf <[email protected]> --------- Signed-off-by: fayejf <[email protected]> Signed-off-by: fayejf <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Patch decoding for PC models (#6630) (#6631) * Patch decoding logic for PC models * Patch decoding logic for PC models --------- Signed-off-by: smajumdar <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> * Fix wer.py where 'errors' variable was not set (#6633) (#6634) Fix wer.py where 'errors' variable was not set when both reference and hypothesis are empty strings Signed-off-by: He Huang (Steve) <[email protected]> Co-authored-by: He Huang (Steve) <[email protected]> * Restore GPT support for interleaved pipeline parallelism (#6528) (#6613) * Restore logic for data-parallel communication with pipeline parallelism in GPT * Support dynamic attention masks in GPT * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug typos * Debug data iterator caching with interleaved pipeline parallelism Each model chunk accesses the data iterator multiple times, so we need to cache multiple samples. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update Megatron-LM commit * Distinguish between list of data iterators and data iterator that is a list * Create dummy iters to satisy len checks * Kludge while waiting for Megatron-LM update * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set transformers offline to avoid rate limiting --------- Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Eric Harper <[email protected]> Signed-off-by: Abhinav Khattar <[email protected]> Signed-off-by: ericharper <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> * bugfix (#6636) Signed-off-by: fayejf <[email protected]> * Disable interctc tests (#6638) Signed-off-by: Igor Gitman <[email protected]> * Add megatron_core to requirements (#6639) (#6640) * add megatron_core to requirements * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: ericharper <[email protected]> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Remove from jenkins (#6642) * Remove from jenkins (#6641) * add megatron_core to requirements Signed-off-by: ericharper <[email protected]> * remove from jenkins Signed-off-by: ericharper <[email protected]> --------- Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove dup Signed-off-by: ericharper <[email protected]> --------- Signed-off-by: ericharper <[email protected]> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * sft model can use this script for eval (#6637) * sft model can use this script for eval Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * please fix me Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor Signed-off-by: arendu <[email protected]> --------- Signed-off-by: arendu <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [TTS] Fix TTS audio preprocessing bugs (#6628) Signed-off-by: Ryan <[email protected]> * Move black parameters to pyproject.toml (#6647) Signed-off-by: Vladimir Bataev <[email protected]> * ASR-TTS Models: Support hybrid RNNT-CTC, improve docs. (#6620) * ASR-TTS: support hybrid RNNT-CTC models * Do not warn on optional import * Explain adding options to config * Fix import guard docs * Add docs for ConcatDataset * Add explanation for sampling parameters * Initial docs for the enhancer model * Fix use_start_end_token parameter usage --------- Signed-off-by: Vladimir Bataev <[email protected]> * fix conversion and eval (#6648) * fix conversion and eval Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: arendu <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Confidence ensembles implementation (#6614) * Working version to train conf model + save ensemble class Signed-off-by: Igor Gitman <[email protected]> * Working version Signed-off-by: Igor Gitman <[email protected]> * Remove copy of transcribe_speech.py Signed-off-by: Igor Gitman <[email protected]> * Move models parameter to config Signed-off-by: Igor Gitman <[email protected]> * Add explicit parameters to transcribe Signed-off-by: Igor Gitman <[email protected]> * Small cleanups Signed-off-by: Igor Gitman <[email protected]> * Add temperature and integration tests Signed-off-by: Igor Gitman <[email protected]> * Add more tests Signed-off-by: Igor Gitman <[email protected]> * Add pc removal config Signed-off-by: Igor Gitman <[email protected]> * Cleanup Signed-off-by: Igor Gitman <[email protected]> * Fix typo Signed-off-by: Igor Gitman <[email protected]> * Address review comments Signed-off-by: Igor Gitman <[email protected]> --------- Signed-off-by: Igor Gitman <[email protected]> * Patch memory used for NeMo Megatron models (#6615) * Patch memory used for NeMo Megatron models Signed-off-by: smajumdar <[email protected]> * Cleanup the dtype of embeddings Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor util function for parsing precision Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor util function for parsing precision Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Try patch for Megatron O2 Signed-off-by: smajumdar <[email protected]> * Refactor to incorporate megatron amp 02 state Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor to incorporate megatron amp 02 state Signed-off-by: smajumdar <[email protected]> * Correct indent Signed-off-by: smajumdar <[email protected]> * Correct utils import Signed-off-by: smajumdar <[email protected]> --------- Signed-off-by: smajumdar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * handle artifacts when path is dir (#6658) Signed-off-by: arendu <[email protected]> * remove upgrading setuptools in reinstall.sh (#6659) Signed-off-by: Xuesong Yang <[email protected]> Co-authored-by: fayejf <[email protected]> * merge lora weights into base model (#6597) * merge lora weights into base model Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * typo fix Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor update Signed-off-by: arendu <[email protected]> * update copyright Signed-off-by: arendu <[email protected]> * eval needs to know the PEFT class Signed-off-by: arendu <[email protected]> * add target class in training script so that we can use it in eval Signed-off-by: arendu <[email protected]> * update Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update to work for tp1 Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set restore model path Signed-off-by: arendu <[email protected]> * peft can be none Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated merge script so that eval works easily Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * eval with peft or sft model Signed-off-by: arendu <[email protected]> * keep sentences in jsonl format Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * convert sft using correct classpath Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated to force sft yaml to have the correct target Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated docs Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix conversion and eval Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: arendu <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * upgrade to 23.04 (#6660) Signed-off-by: ericharper <[email protected]> * Merge r1.18.0 bugfixes and doc updates to main (#6655) * update branch Signed-off-by: ericharper <[email protected]> * Remove from jenkins (#6641) * add megatron_core to requirements Signed-off-by: ericharper <[email protected]> * remove from jenkins Signed-off-by: ericharper <[email protected]> --------- Signed-off-by: ericharper <[email protected]> * remove dup Signed-off-by: ericharper <[email protected]> * update branch Signed-off-by: ericharper <[email protected]> * [TTS] reformat NeMo versions in the tts logging messages to avoid batch process them when upgrading NeMo versions. Signed-off-by: Xuesong Yang <[email protected]> --------- Signed-off-by: ericharper <[email protected]> Signed-off-by: Xuesong Yang <[email protected]> Co-authored-by: Xuesong Yang <[email protected]> * Confidence ensembles: fix issues and add tuning functionality (#6657) * Implement compute confidence to properly handle blanks Signed-off-by: Igor Gitman <[email protected]> * Implement proper confidence for transducers Signed-off-by: Igor Gitman <[email protected]> * Implement tuning logic Signed-off-by: Igor Gitman <[email protected]> * Add tests for confidence tuning Signed-off-by: Igor Gitman <[email protected]> * Remove unused imports Signed-off-by: Igor Gitman <[email protected]> * Add types/docs Signed-off-by: Igor Gitman <[email protected]> * Add comment about the main conf compute loop Signed-off-by: Igor Gitman <[email protected]> --------- Signed-off-by: Igor Gitman <[email protected]> * [TTS] Implement new TextToSpeech dataset (#6575) * [TTS] Implement new TextToSpeech dataset Signed-off-by: Ryan <[email protected]> * [TTS] Add unit tests Signed-off-by: Ryan <[email protected]> * [TTS] Fix defaulting of use_log_energy Signed-off-by: Ryan <[email protected]> * [TTS] Fix TTS export test Signed-off-by: Ryan <[email protected]> --------- Signed-off-by: Ryan <[email protected]> * Dialogue dataset (#6654) * chatbot interface Signed-off-by: Yi Dong <[email protected]> * latest gradio Signed-off-by: Yi Dong <[email protected]> * default greedy Signed-off-by: Yi Dong <[email protected]> * better chatbot Signed-off-by: Yi Dong <[email protected]> * handle preamble Signed-off-by: Yi Dong <[email protected]> * added chatbot training capablity Signed-off-by: Yi Dong <[email protected]> * added chatbot ui Signed-off-by: Yi Dong <[email protected]> * remove debug code Signed-off-by: Yi Dong <[email protected]> * default human Signed-off-by: Yi Dong <[email protected]> * use special token for roles Signed-off-by: Yi Dong <[email protected]> * special tokens Signed-off-by: Yi Dong <[email protected]> * fix name Signed-off-by: Yi Dong <[email protected]> * new chat dataset Signed-off-by: Yi Dong <[email protected]> * fix the system token Signed-off-by: Yi Dong <[email protected]> * upgrade gradio Signed-off-by: Yi Dong <[email protected]> * save the chat history Signed-off-by: Yi Dong <[email protected]> * update ui Signed-off-by: root <[email protected]> * update chat interface Signed-off-by: Yi Dong <[email protected]> * handles canonical form Signed-off-by: Yi Dong <[email protected]> * new sft chatbot Signed-off-by: Yi Dong <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * change format Signed-off-by: Yi Dong <[email protected]> * check extra_id in the tokenizer Signed-off-by: Yi Dong <[email protected]> * added vocab property check Signed-off-by: Yi Dong <[email protected]> * added missing file Signed-off-by: Yi Dong <[email protected]> --------- Signed-off-by: Yi Dong <[email protected]> Signed-off-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sandeep Subramanian <[email protected]> * Add support for RNNT/hybrid models to partial transcribe (#6609) * Add support for RNNT/hybrid models to partial transcribe Signed-off-by: He Huang (Steve) <[email protected]> * Update transcribe_utils.py Signed-off-by: He Huang (Steve) <[email protected]> * Update transcribe_speech.py Signed-off-by: He Huang (Steve) <[email protected]> * Update transcribe_utils.py Signed-off-by: He Huang (Steve) <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: He Huang (Steve) <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * eval_beamsearch_ngram.py with hybrid ctc (#6656) * separate_punctuation = false * ctc decoding strategy = model.decoding * transcribe(files, logprobs=True) returns logprobs --------- Signed-off-by: Nikolay Karpov <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix bucketing bug issue for picking new bucket (#6663) Signed-off-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: Nithin Rao Koluguri <nithinraok> * minor fix for missing chat attr (#6671) Signed-off-by: arendu <[email protected]> * [TTS] Add callback for saving audio during FastPitch training (#6665) * [TTS] Add callback for saving audio during FastPitch training Signed-off-by: Ryan <[email protected]> * [TTS] Allow NGC model name for vocoder Signed-off-by: Ryan <[email protected]> --------- Signed-off-by: Ryan <[email protected]> --------- Signed-off-by: KunalDhawan <[email protected]> Signed-off-by: smajumdar <[email protected]> Signed-off-by: MaximumEntropy <[email protected]> Signed-off-by: fayejf <[email protected]> Signed-off-by: Abhinav Khattar <[email protected]> Signed-off-by: Nikolay Karpov <[email protected]> Signed-off-by: Nikolay Karpov <[email protected]> Signed-off-by: Ryan <[email protected]> Signed-off-by: He Huang (Steve) <[email protected]> Signed-off-by: Greg Clark <[email protected]> Signed-off-by: arendu <[email protected]> Signed-off-by: Adi Renduchintala <[email protected]> Signed-off-by: Vladimir Bataev <[email protected]> Signed-off-by: hsiehjackson <[email protected]> Signed-off-by: Elena Rastorgueva <[email protected]> Signed-off-by: Igor Gitman <[email protected]> Signed-off-by: Sangkug Lym <[email protected]> Signed-off-by: sam1373 <[email protected]> Signed-off-by: Boris Fomitchev <[email protected]> Signed-off-by: andrusenkoau <[email protected]> Signed-off-by: Mikołaj Błaż <[email protected]> Signed-off-by: fayejf <[email protected]> Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Eric Harper <[email protected]> Signed-off-by: ericharper <[email protected]> Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Xuesong Yang <[email protected]> Signed-off-by: Yi Dong <[email protected]> Signed-off-by: root <[email protected]> Signed-off-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Somshubra Majumdar <[email protected]> Co-authored-by: Sandeep Subramanian <[email protected]> Co-authored-by: fayejf <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: Nikolay Karpov <[email protected]> Co-authored-by: Nikolay Karpov <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Ryan Langman <[email protected]> Co-authored-by: He Huang (Steve) <[email protected]> Co-authored-by: Vahid Noroozi <[email protected]> Co-authored-by: Greg Clark <[email protected]> Co-authored-by: Adi Renduchintala <[email protected]> Co-authored-by: Sangkug Lym <[email protected]> Co-authored-by: Vladimir Bataev <[email protected]> Co-authored-by: Cheng-Ping Hsieh <[email protected]> Co-authored-by: Elena Rastorgueva <[email protected]> Co-authored-by: Igor Gitman <[email protected]> Co-authored-by: Samuel Kriman <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Andrei Andrusenko <[email protected]> Co-authored-by: mikolajblaz <[email protected]> Co-authored-by: Tim Moon <[email protected]> Co-authored-by: Adi Renduchintala <[email protected]> Co-authored-by: Xuesong Yang <[email protected]> Co-authored-by: Yi Dong <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Nithin Rao <[email protected]>

Signed-off-by: Nikolay Karpov <[email protected]> Signed-off-by: hsiehjackson <[email protected]>

…d Flash Attention (#6666) * move to nvidia megatron repo (#6465) (#6475) Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Megatron KERPLE positional embeddings (#6478) (#6480) * [TTS] FastPitch adapter fine-tune and conditional layer normalization (#6416) [TTS] FastPitch adapter fine-tune and conditional layer normalization (#6416) --------- * [TTS] whitelist broken path fix. (#6412) * [TTS] whitelist broken path fix. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- * [TTS] FastPitch speaker encoder (#6417) * Add initial codes * Remove wemb * Fix import * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Restore aligner loss * Add ConditionalInput * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix error and support pre-trained config * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Follow comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rename config * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change copyright and random weight test * Add initial codes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix import error * Add initial codes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix dataset error * Remove reference speaker embedding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove SV encoder * Follow comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix length type * Fix append * Move error msg * Add look-up into speaker encoder * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add valueerror msg * Move lookup * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix error * Rebase and Fix error * Fix spk encoder * Rename n_speakers * Follow comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix n_speakers None error --------- * Sharded manifests for tarred datasets (#6395) * testing sharded manifests * compatibility * proper fixes * adding flag tot convert_to_tarred_audio_dataset * shard_manifests conf param * propagating the shard_manifests param * propagating the shard_manifests param * distributed checks * typo * typo * fixes * fixes * fixes * fixes * fixes * fixes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes based on PR comments and tests * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes to convert_to_tarred_audio_dataset.py * reversing manifest shards flag * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests * excluding manifests from webdataset url expansion * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * expand manifest paths before attempting to cache from datastore * explicit use of UTF-8 for manifest i/o --------- * Update wfst_text_normalization.rst (#6374) Add Hungarian (incoming in NeMo-text-processing) * Support Swiglu in TP PP Conversion (#6437) (#6451) * Support Swiglu in TP PP Conversion * Guard activation * Guard activation --------- * Update NeMo_TTS_Primer.ipynb (#6436) * Update NeMo_TTS_Primer.ipynb Changed a mistake in line 782. Instead of frequency band (ie. pitch) we should write frequency bin. Note that frequency bins in FFT are not related to pitch. * Update NeMo_TTS_Primer.ipynb Corrected the description of spectrogram and mel spectrogram calculations in lines 782 & 783 and added a fourth point to the description and added a reference for more mathematical details at the end of this point. --------- * add rampup batch size support for Megatron GPT (#6424) * added rampup batch size support * added tests for rampup batch size * fixed the typos * added assertions * changed assertion rules * deleted unused imports * changed tests for rampup batch size * updated rampup batch size tests * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed styling * rampup batch size tests changes --------- * Meagtron encoder decoder fix for empty validation outputs (#6459) (#6461) * 1. Meagtron encoder decoder fix for empty validation outputs. * 1. Debugging. --------- * Code-Switching dataset creation - upgrading to aggregate tokenizer manifest format (#6448) * added functionality to create agg tokenizer compatible manifest for CS, flag to use this mode by default * updated README with the new agg_tokenizer_manifest flag * fixed typo in scripts/speech_recognition/code_switching/README.md * changed agg_tokenizer_manifest to is_lid_manifest --------- * Added/updated new Conformer configs (#6426) (#6467) * Update script for ngram rnnt and hat beam search decoding (#6370) * add rnnt ngram beamsearch script * add return encoding embedding option * update script * add rnnt and hat ngram decoding script * add some parameters * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add return_encoder_embeddings parameter to RNNTDecodingConfig * replace return_encoder_embeddings parameter * generalization of scipt behavior * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove return_encoder_embeddings parameter * remove return_encoder_embeddings parameter * add manual encoder_embeddings calculation * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix beam_width value to 8 * fix rescoring description --------- * BERT pre-training mp fork to spawn (#6442) (#6454) * change bert fork to spawn * num_workers=0 fix --------- * fix replace_bos_with_pad not found (#6443) (#6450) * reduce workers on NMT CI (#6472) (#6474) * 1. Added KERPLE positional embeddings to encoder-decoder. * 1. Added a missing file. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 1. Fixing commits. * 1. Debugging. * 1. Debugging. * 1. Debugging. * 1. Debugging. --------- Signed-off-by: hsiehjackson <[email protected]> Signed-off-by: Xuesong Yang <[email protected]> Signed-off-by: Dima Rekesh <[email protected]> Signed-off-by: Jim O’Regan <[email protected]> Signed-off-by: smajumdar <[email protected]> Signed-off-by: Mostafa Ghorbandoost <[email protected]> Signed-off-by: Dmytro Pykhtar <[email protected]> Signed-off-by: Dmytro Pykhtar <[email protected]> Signed-off-by: Micha Livne <[email protected]> Signed-off-by: Kunal Dhawan <[email protected]> Signed-off-by: andrusenkoau <[email protected]> Signed-off-by: Andrei Andrusenko <[email protected]> Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Cheng-Ping Hsieh <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xuesong Yang <[email protected]> Co-authored-by: Dima Rekesh <[email protected]> Co-authored-by: Jim O’Regan <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Somshubra Majumdar <[email protected]> Co-authored-by: Mostafa Ghorbandoost <[email protected]> Co-authored-by: Dmytro Pykhtar <[email protected]> Co-authored-by: Dmytro Pykhtar <[email protected]> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Kunal Dhawan <[email protected]> Co-authored-by: Andrei Andrusenko <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Fix an invalid link in get_data.py of ljspeech (#6456) Usage of the link in line 63 leads to downloading a html file not a tsv file, so we need to change it to a raw link. Signed-off-by: Mostafa Ghorbandoost <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * 1. Added external index sample. (#6462) (#6483) Signed-off-by: Micha Livne <[email protected]> Co-authored-by: Micha Livne <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Update README to add core installation (#6488) (#6489) * update README for megatron-core * fix --------- Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Fix cache aware hybrid bugs (#6466) (#6484) Signed-off-by: hsiehjackson <[email protected]> * Fix typos (#6494) (#6495) Signed-off-by: smajumdar <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Add disclaimer about dataset for ASR (#6496) Signed-off-by: smajumdar <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * fix (#6502) datastore_path_to_webdataset_url(p) if is_datastore_path(p) and is_tarred_path(p) else p NameError: name 'is_tarred_path' is not defined Co-authored-by: George <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * fix broken links r1.18.0 (#6501) (#6504) * fix broken links * fix broken links --------- Signed-off-by: Evelina <[email protected]> Co-authored-by: Evelina <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * [TTS] Create functions for TTS preprocessing without dataloader (#6317) * [TTS] Create functions for TTS preprocessing without dataloader Signed-off-by: Ryan <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Cache aware streaming nfa (#6209) * add cache aware streaming to nemo aligner Signed-off-by: Slyne Deng <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * [BugFix] Force _get_batch_preds() to keep logits in decoder timestamps generator (#6499) * [BugFix] _get_batch_preds() is forced to keep logits in decoder timestamps generators Signed-off-by: Taejin Park <[email protected]> * Ingnore keep_logits boolean in FrameASRBatchLogits Signed-off-by: Taejin Park <[email protected]> --------- Signed-off-by: Taejin Park <[email protected]> Co-authored-by: Jagadeesh Balam <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * [TTS] Fix FastPitch energy code (#6511) Signed-off-by: Ryan <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * fix custom forward_torch_softmax (#6512) (#6517) Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * [TTS] fixed broken path. (#6514) (#6518) Signed-off-by: Xuesong Yang <[email protected]> Co-authored-by: Xuesong Yang <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Fix normalization of impulse response in ImpulsePerturbation (#6505) Signed-off-by: Ante Jukić <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Add interleaved pp support (#6498) * Add support for Virtual Pipeline Parallel conversion Signed-off-by: smajumdar <[email protected]> * Add support for Virtual Pipeline Parallel conversion Signed-off-by: smajumdar <[email protected]> * Switch to megatron core Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: smajumdar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]> * Fix typos (#6523) * Fix typos Signed-off-by: smajumdar <[email protected]> * Fix typos Signed-off-by: smajumdar <[email protected]> --------- Signed-off-by: smajumdar <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * New noise_norm perturbation based on Riva work (#6445) * Initial commit for new noise_norm perturbation Signed-off-by: Daniel Egert <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor fix to random seed in perturb Signed-off-by: Daniel Egert <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updated code to reflect feedback Signed-off-by: Daniel Egert <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updates for feedback given by code reviewers Signed-off-by: Daniel Egert <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updates in response to PR feedback Signed-off-by: Daniel Egert <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added comment about ref_mic being None Signed-off-by: Daniel Egert <[email protected]> * Updated perturb to use inspect module Signed-off-by: Daniel Egert <[email protected]> --------- Signed-off-by: Daniel Egert <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]> * [TTS] Add script for computing feature stats (#6508) * [TTS] Add script for computing feature stats Signed-off-by: Ryan <[email protected]> * [TTS] Add overwrite config Signed-off-by: Ryan <[email protected]> --------- Signed-off-by: Ryan <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Add Frame-VAD model and datasets (#6441) * add model, dataset, necessary utils and tests Signed-off-by: stevehuang52 <[email protected]> * fix tarred data Signed-off-by: stevehuang52 <[email protected]> * fix typo Signed-off-by: stevehuang52 <[email protected]> * update docstring Signed-off-by: stevehuang52 <[email protected]> * update doc Signed-off-by: stevehuang52 <[email protected]> * update doc Signed-off-by: stevehuang52 <[email protected]> * update pretrained model info Signed-off-by: stevehuang52 <[email protected]> --------- Signed-off-by: stevehuang52 <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Support dynamic length batches with GPT SFT (#6510) * Support synamic length with GPT SFT Signed-off-by: Abhinav Khattar <[email protected]> * make branch functional Signed-off-by: Abhinav Khattar <[email protected]> --------- Signed-off-by: Abhinav Khattar <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * added back the fast emit section to the configs. (#6540) (#6542) * added back the fast emit section to the configs. * added back the fast emit section to the configs. --------- Signed-off-by: Vahid <[email protected]> Co-authored-by: Vahid Noroozi <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * removing unnessary avoid_bfloat16_autocast_context (#6481) Signed-off-by: Dima Rekesh <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * FC models in menu (#6473) * FC models in menu Signed-off-by: Dima Rekesh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Dima Rekesh <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]> * [TTS] Add tutorials for FastPitch TTS speaker adaptation with adapters (#6431) * Add tts adapter tutorial Signed-off-by: hsiehjackson <[email protected]> * Update main tutorial Signed-off-by: hsiehjackson <[email protected]> * Add tts adapter tutorial Signed-off-by: hsiehjackson <[email protected]> * Update main tutorial Signed-off-by: hsiehjackson <[email protected]> * Update tutorial Signed-off-by: hsiehjackson <[email protected]> * Follow comments Signed-off-by: hsiehjackson <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Follow comments Signed-off-by: hsiehjackson <[email protected]> * Fix load .nemo error Signed-off-by: hsiehjackson <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Support multi-speaker fine-tune Signed-off-by: hsiehjackson <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Follow comments Signed-off-by: hsiehjackson <[email protected]> * Use .nemo Signed-off-by: hsiehjackson <[email protected]> * Follow Comments Signed-off-by: hsiehjackson <[email protected]> * Fix bug Signed-off-by: hsiehjackson <[email protected]> * Fix bug Signed-off-by: hsiehjackson <[email protected]> * Fix bug Signed-off-by: hsiehjackson <[email protected]> * Add precomputed speaker emb Signed-off-by: hsiehjackson <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix space Signed-off-by: hsiehjackson <[email protected]> * Remove repeated argument Signed-off-by: hsiehjackson <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * optional batch size Signed-off-by: hsiehjackson <[email protected]> * Fix comments in notebook Signed-off-by: hsiehjackson <[email protected]> --------- Signed-off-by: hsiehjackson <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]> * [TTS] Create initial TTS dataset feature processors (#6507) Signed-off-by: Ryan <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * fix (#6529) (#6546) Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Add FastConformer Hybrid ASR models for EN, ES, IT, DE, PL, HR, UA, BY (#6549) (#6553) * Added fastconfomer hybrid asr models for en, es, it, de, pl, hr, ua, by * updated ASR docs with the fastconformer hybrid checkpoints * added the fastconformer RNNT and CTC models --------- Signed-off-by: KunalDhawan <[email protected]> Co-authored-by: Kunal Dhawan <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Add scores for FastConformer models (#6557) (#6558) Signed-off-by: smajumdar <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Fix fp16 (#6543) (#6544) Signed-off-by: MaximumEntropy <[email protected]> Co-authored-by: Sandeep Subramanian <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Patch transcribe and support offline transcribe for hybrid model (#6550) (#6559) Signed-off-by: fayejf <[email protected]> Co-authored-by: fayejf <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Fix notebook bad json (#6561) Signed-off-by: smajumdar <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Change Megatron Enc Dec model to use persistent_workers (#6548) (#6552) * persistent workers * fix --------- Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Make KenLM with PC for AggregateTokenizer and merge it (#6081) * do_lowercase, rm_punctuation Signed-off-by: Nikolay Karpov <[email protected]> * support beam_strategy = beam Signed-off-by: Nikolay Karpov <[email protected]> * black Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix config and^Cunctuation capitalization Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rm math Signed-off-by: Nikolay Karpov <[email protected]> * update kenlm Signed-off-by: Nikolay Karpov <[email protected]> * black Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add opengrm Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * mv install_beamsearch_decoders Signed-off-by: Nikolay Karpov <[email protected]> * punctuation_to_preserve Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Only tikenizer opion Signed-off-by: Nikolay Karpov <[email protected]> * Black Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * DEFAULT_TOKEN_OFFSET Signed-off-by: Nikolay Karpov <[email protected]> * aggregate_tokenizer Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * install kenlm with more than 5gram Signed-off-by: Nikolay Karpov <[email protected]> * install_beamsearch_decoders Signed-off-by: Nikolay Karpov <[email protected]> * ngram_bin_path kenlm_bin_path Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * black Signed-off-by: Nikolay Karpov <[email protected]> * fix greedy PC bug Signed-off-by: Nikolay Karpov <[email protected]> * move global params Signed-off-by: Nikolay Karpov <[email protected]> * fix description and perplexity Signed-off-by: Nikolay Karpov <[email protected]> * fix description Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * NEMO_PATH Signed-off-by: Nikolay Karpov <[email protected]> * nemo:23.01 Signed-off-by: Nikolay Karpov <[email protected]> * License Signed-off-by: Nikolay Karpov <[email protected]> * description Signed-off-by: Nikolay Karpov <[email protected]> * isinstance Signed-off-by: Nikolay Karpov <[email protected]> * refactor kenlm stdin Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * black Signed-off-by: Nikolay Karpov <[email protected]> * add cmd arg Signed-off-by: Nikolay Karpov <[email protected]> * use new iter_files Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * EncDecHybridRNNTCTCModel Signed-off-by: Nikolay Karpov <[email protected]> * punctuation Signed-off-by: Nikolay Karpov <[email protected]> * train_kenlm args Signed-off-by: Nikolay Karpov <[email protected]> * add docstrings Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add ngram_merge docs Signed-off-by: Nikolay Karpov <[email protected]> * ngram_prune Signed-off-by: Nikolay Karpov <[email protected]> * rename to ngram_merge Signed-off-by: Nikolay Karpov <[email protected]> * rename to ngram Signed-off-by: Nikolay Karpov <[email protected]> * add comments Signed-off-by: Nikolay Karpov <[email protected]> * Ngram Signed-off-by: Nikolay Karpov <[email protected]> * nemo_model_file Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * install_opengrm_ngram Signed-off-by: Nikolay Karpov <[email protected]> * install opengrm Signed-off-by: Nikolay Karpov <[email protected]> * rename to install_opengrm.sh Signed-off-by: Nikolay Karpov <[email protected]> * rm extra import Signed-off-by: Nikolay Karpov <[email protected]> * train_paths Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * text_processing Signed-off-by: Nikolay Karpov <[email protected]> * fix ngram_bin_path Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * DECODERS_PATH Signed-off-by: Nikolay Karpov <[email protected]> * farcompile Signed-off-by: Nikolay Karpov <[email protected]> * rm text processing Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * text_processing Signed-off-by: Nikolay Karpov <[email protected]> * AggregateTokenizer.DummyTokenizer Signed-off-by: Nikolay Karpov <[email protected]> * comments Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * TextProcessingConfig Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * typo Signed-off-by: Nikolay Karpov <[email protected]> * doc Signed-off-by: Nikolay Karpov <[email protected]> * types Signed-off-by: Nikolay Karpov <[email protected]> * nemo_model_file Signed-off-by: Nikolay Karpov <[email protected]> * rm assert Signed-off-by: Nikolay Karpov <[email protected]> * import kenlm_utils Signed-off-by: Nikolay Karpov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * return None Signed-off-by: Nikolay Karpov <[email protected]> * Copyright Signed-off-by: Nikolay Karpov <[email protected]> * 2022 Signed-off-by: Nikolay Karpov <[email protected]> * 2023 Signed-off-by: Nikolay Karpov <[email protected]> --------- Signed-off-by: Nikolay Karpov <[email protected]> Signed-off-by: Nikolay Karpov <[email protected]> Co-authored-by: Nikolay Karpov <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]> * fix for running on 1 GPU. Signed-off-by: hsiehjackson <[email protected]> * temp rtd fix (#6568) (#6569) Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * [TTS] Add script for mapping speaker names to indices (#6509) Signed-off-by: Ryan <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * whitespace (#6574) Signed-off-by: Nikolay Karpov <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Update manifest.py for speedup (#6565) (#6573) * Update manifest.py Re-order the checks for faster processing audio filepaths that are already absolute paths * Update manifest.py --------- Signed-off-by: He Huang (Steve) <[email protected]> Co-authored-by: He Huang (Steve) <[email protected]> Co-authored-by: Vahid Noroozi <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * More streaming conformer export fixes (#6567) (#6578) Signed-off-by: Greg Clark <[email protected]> Co-authored-by: Greg Clark <[email protected]> Co-authored-by: Vahid Noroozi <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * user selected max_seq_len should be less than model's max_seq_len (#6333) (#6386) * user selection should not break model max limit * eval max seq length --------- Signed-off-by: arendu <[email protected]> Signed-off-by: Adi Renduchintala <[email protected]> Co-authored-by: Adi Renduchintala <[email protected]> Co-authored-by: Sandeep Subramanian <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Framework for PEFT via mixins (#6391) * init commit ptuning via mixin Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updates Signed-off-by: arendu <[email protected]> * gpt ptuning places virtual tokens on the left only Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * encoder input modified when pre_process is true Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * optimizer group and state dict updates Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adapter ptuning working for pp>1 Signed-off-by: arendu <[email protected]> * adapter defaults Signed-off-by: arendu <[email protected]> * adapter ptuining config defaults Signed-off-by: arendu <[email protected]> * training works Signed-off-by: arendu <[email protected]> * loading and saving adapter only params during training Signed-off-by: arendu <[email protected]> * added checks and comments Signed-off-by: arendu <[email protected]> * clean up Signed-off-by: arendu <[email protected]> * checks for grad is None before calling all_reduce Signed-off-by: arendu <[email protected]> * load adapter .nemo file working Signed-off-by: arendu <[email protected]> * resume training for adapters Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * peft tuning Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor Signed-off-by: arendu <[email protected]> * file not needed Signed-off-by: arendu <[email protected]> * undo prompt learning dataset changes Signed-off-by: arendu <[email protected]> * undo updates to gpt prompt learning model Signed-off-by: arendu <[email protected]> * naming updates Signed-off-by: arendu <[email protected]> * decoding Signed-off-by: arendu <[email protected]> * predict_step in gpt_sft_model Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updates Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed inference from tuning config Signed-off-by: arendu <[email protected]> * no test in peft training Signed-off-by: arendu <[email protected]> * answer only loss and correct defaults for val_loss Signed-off-by: arendu <[email protected]> * hybrid adapters and ptuning Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * eval working.. Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * prepending tokens for ptuning Signed-off-by: arendu <[email protected]> * cleaned up eval config Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * clean up Signed-off-by: arendu <[email protected]> * update Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * default prompt template Signed-off-by: arendu <[email protected]> * Lora added Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Support synamic length with GPT SFT Signed-off-by: Abhinav Khattar <[email protected]> * make branch functional Signed-off-by: Abhinav Khattar <[email protected]> * defaults to max_pad_length=False in GPT SFT dataset Signed-off-by: arendu <[email protected]> * adapter parallel_adapters to support Lora Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added early stopping by default Signed-off-by: arendu <[email protected]> * eval script for peft and eval config. bug fixes in predict step and added out_features to t5 adapter config Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updates Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updates Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * docs Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * better defaults Signed-off-by: arendu <[email protected]> * updates Signed-off-by: arendu <[email protected]> * update Signed-off-by: arendu <[email protected]> * docs Signed-off-by: arendu <[email protected]> --------- Signed-off-by: arendu <[email protected]> Signed-off-by: Abhinav Khattar <[email protected]> Signed-off-by: Adi Renduchintala <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Abhinav Khattar <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * cache and reuse inputs (#6422) (#6452) Co-authored-by: Sangkug Lym <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Add patches for Virtual Parallel conversion (#6589) * Add patches for Virtual Parllel conversion Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: smajumdar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]> * Pass `.scale` instead of scaler object to core (#6551) * pass .scale instead of scaler object to core (#6545) Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Eric Harper <[email protected]> * Update megatron_gpt_model.py Signed-off-by: Abhinav Khattar <[email protected]> * scale changes for main Signed-off-by: Abhinav Khattar <[email protected]> --------- Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Documentation for ASR-TTS models (#6594) (#6595) * Add docs about hybrid ASR-TTS models * Add docs about text-only datasets * Add docs about ASR-TTS checkpoints * Add docs about ASR-TTS configs and training * Clean up * ASR-TTS docs: add to api, fix imports * Clean up * Wrap optional import * Revert general ASR import --------- Signed-off-by: Vladimir Bataev <[email protected]> Co-authored-by: Vladimir Bataev <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * [TTS] Fix aligner nan loss in fp32 (#6435) * Fix nan loss in fp32 Signed-off-by: hsiehjackson <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: hsiehjackson <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]> * Update SDP docs (#6485) (#6596) * add info about SDP e.g. processor classes in docs * add link to SDP docs in README * address code review comments and add SDP overview diagram * Fix spelling typo --------- Signed-off-by: Elena Rastorgueva <[email protected]> Co-authored-by: Elena Rastorgueva <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Bug/typo fixes (#6599) Signed-off-by: Igor Gitman <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Manual garbage collection with an interval (#6469) (#6482) * Manual garbage collection with an interval * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use trainer.global_step for tracking the interval of GC --------- Signed-off-by: Sangkug Lym <[email protected]> Co-authored-by: Sangkug Lym <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Make tensor split contiguous (#6580) (#6593) Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * [ASR] Fix for old models in change_attention_model (#6608) * fixes Signed-off-by: sam1373 <[email protected]> * done already Signed-off-by: sam1373 <[email protected]> --------- Signed-off-by: sam1373 <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Update manifest.py to use os.path for get_full_path (#6598) * Update manifest.py to use os.path for get_full_path Signed-off-by: He Huang (Steve) <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update manifest.py to get rid of pathlib Signed-off-by: He Huang (Steve) <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update manifest.py Signed-off-by: He Huang (Steve) <[email protected]> * Update manifest.py Signed-off-by: He Huang (Steve) <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: He Huang (Steve) <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Vahid Noroozi <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Cherry pick commits in #6601 to main (#6611) * fix write Signed-off-by: fayejf <[email protected]> * decoding ctc Signed-off-by: fayejf <[email protected]> * temp set rnnt decoding return_best_hypothesis to true Signed-off-by: fayejf <[email protected]> * add wer cal back to transcribe_speech as requested Signed-off-by: fayejf <[email protected]> * add wer cal back to speech_to_text_buffered_infer_rnnt as requested Signed-off-by: fayejf <[email protected]> * add wer cal back to speech_to_text_buffered_infer_ctc as requested Signed-off-by: fayejf <[email protected]> * style fix Signed-off-by: fayejf <[email protected]> * reflect change in asr_evaluator Signed-off-by: fayejf <[email protected]> * reflect som and vahid comment Signed-off-by: fayejf <[email protected]> * remove return_best_hy=true in transcribe_speech Signed-off-by: fayejf <[email protected]> * no text skip Signed-off-by: fayejf <[email protected]> * revert partial Signed-off-by: fayejf <[email protected]> --------- Signed-off-by: fayejf <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Create dummy iters to satisy len checks (#6600) (#6603) Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * add GPT eval mode fix for interleaved to main (#6610) Signed-off-by: Abhinav Khattar <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Fix batch size reconf for T5 FT for multi-validation (#6582) (#6588) Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Not doing CastToFloat by default (#6524) (#6563) * Not doing CastToFloat by default * Added docustring * Dummy commit --------- Signed-off-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Turn autocast off when precision is fp32 (#6576) * Turn autocast off when precision is fp32 (#6554) * Turn autocast off when precision is fp32 Signed-off-by: Abhinav Khattar <[email protected]> * address review Signed-off-by: Abhinav Khattar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by: Abhinav Khattar <[email protected]> * merge Signed-off-by: Abhinav Khattar <[email protected]> --------- Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> * correct auto-merge Signed-off-by: Abhinav Khattar <[email protected]> * correct auto-merge Signed-off-by: Abhinav Khattar <[email protected]> * add to GPT SFT Signed-off-by: Abhinav Khattar <[email protected]> --------- Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * update core commit hash in readme (#6622) (#6623) Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * add hat image to docs (#6619) (#6621) Signed-off-by: andrusenkoau <[email protected]> Co-authored-by: Andrei Andrusenko <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Allow indices exchange via distributed (#6618) (#6624) Signed-off-by: Mikołaj Błaż <[email protected]> Co-authored-by: mikolajblaz <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Offline and streaming inference support for hybrid model (#6570) * streaming buffered for hybrid + ctc Signed-off-by: fayejf <[email protected]> * change default model_stride in eval.yaml Signed-off-by: fayejf <[email protected]> * add fc model_stride Signed-off-by: fayejf <[email protected]> * small fix Signed-off-by: fayejf <[email protected]> * check whether model and decoding match Signed-off-by: fayejf <[email protected]> * small fix Signed-off-by: fayejf <[email protected]> * streaming buffered for hybrid + rnnt Signed-off-by: fayejf <[email protected]> * style fix Signed-off-by: fayejf <[email protected]> * fix yaml Signed-off-by: fayejf <[email protected]> * reflect comment wip Signed-off-by: fayejf <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: fayejf <[email protected]> * refactor and verified Signed-off-by: fayejf <[email protected]> * add get_full_path to buffered Signed-off-by: fayejf <[email protected]> * small fix Signed-off-by: fayejf <[email protected]> * add RNNTDecodingConfig Signed-off-by: fayejf <[email protected]> * model name & instruction of changing decoding Signed-off-by: fayejf <[email protected]> --------- Signed-off-by: fayejf <[email protected]> Signed-off-by: fayejf <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]> * Patch decoding for PC models (#6630) (#6631) * Patch decoding logic for PC models * Patch decoding logic for PC models --------- Signed-off-by: smajumdar <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Fix wer.py where 'errors' variable was not set (#6633) (#6634) Fix wer.py where 'errors' variable was not set when both reference and hypothesis are empty strings Signed-off-by: He Huang (Steve) <[email protected]> Co-authored-by: He Huang (Steve) <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Restore GPT support for interleaved pipeline parallelism (#6528) (#6613) * Restore logic for data-parallel communication with pipeline parallelism in GPT * Support dynamic attention masks in GPT * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug typos * Debug data iterator caching with interleaved pipeline parallelism Each model chunk accesses the data iterator multiple times, so we need to cache multiple samples. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update Megatron-LM commit * Distinguish between list of data iterators and data iterator that is a list * Create dummy iters to satisy len checks * Kludge while waiting for Megatron-LM update * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set transformers offline to avoid rate limiting --------- Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Eric Harper <[email protected]> Signed-off-by: Abhinav Khattar <[email protected]> Signed-off-by: ericharper <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: Abhinav Khattar <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Add FA Signed-off-by: hsiehjackson <[email protected]> * Fix XPOS Signed-off-by: hsiehjackson <[email protected]> * Add warning Signed-off-by: hsiehjackson <[email protected]> * Fix bugs Signed-off-by: hsiehjackson <[email protected]> * Fix attention Signed-off-by: hsiehjackson <[email protected]> * Fix comment Signed-off-by: hsiehjackson <[email protected]> * Fix cast dtype Signed-off-by: hsiehjackson <[email protected]> * Undo xpos Signed-off-by: hsiehjackson <[email protected]> * bugfix (#6636) Signed-off-by: fayejf <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Disable interctc tests (#6638) Signed-off-by: Igor Gitman <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Add megatron_core to requirements (#6639) (#6640) * add megatron_core to requirements * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: ericharper <[email protected]> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]> * Remove from jenkins (#6642) * Remove from jenkins (#6641) * add megatron_core to requirements Signed-off-by: ericharper <[email protected]> * remove from jenkins Signed-off-by: ericharper <[email protected]> --------- Signed-off-by: ericharper <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove dup Signed-off-by: ericharper <[email protected]> --------- Signed-off-by: ericharper <[email protected]> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]> * sft model can use this script for eval (#6637) * sft model can use this script for eval Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * please fix me Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor Signed-off-by: arendu <[email protected]> --------- Signed-off-by: arendu <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]> * [TTS] Fix TTS audio preprocessing bugs (#6628) Signed-off-by: Ryan <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Move black parameters to pyproject.toml (#6647) Signed-off-by: Vladimir Bataev <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * ASR-TTS Models: Support hybrid RNNT-CTC, improve docs. (#6620) * ASR-TTS: support hybrid RNNT-CTC models * Do not warn on optional import * Explain adding options to config * Fix import guard docs * Add docs for ConcatDataset * Add explanation for sampling parameters * Initial docs for the enhancer model * Fix use_start_end_token parameter usage --------- Signed-off-by: Vladimir Bataev <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * fix conversion and eval (#6648) * fix conversion and eval Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: arendu <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]> * Confidence ensembles implementation (#6614) * Working version to train conf model + save ensemble class Signed-off-by: Igor Gitman <[email protected]> * Working version Signed-off-by: Igor Gitman <[email protected]> * Remove copy of transcribe_speech.py Signed-off-by: Igor Gitman <[email protected]> * Move models parameter to config Signed-off-by: Igor Gitman <[email protected]> * Add explicit parameters to transcribe Signed-off-by: Igor Gitman <[email protected]> * Small cleanups Signed-off-by: Igor Gitman <[email protected]> * Add temperature and integration tests Signed-off-by: Igor Gitman <[email protected]> * Add more tests Signed-off-by: Igor Gitman <[email protected]> * Add pc removal config Signed-off-by: Igor Gitman <[email protected]> * Cleanup Signed-off-by: Igor Gitman <[email protected]> * Fix typo Signed-off-by: Igor Gitman <[email protected]> * Address review comments Signed-off-by: Igor Gitman <[email protected]> --------- Signed-off-by: Igor Gitman <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Patch memory used for NeMo Megatron models (#6615) * Patch memory used for NeMo Megatron models Signed-off-by: smajumdar <[email protected]> * Cleanup the dtype of embeddings Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor util function for parsing precision Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor util function for parsing precision Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Try patch for Megatron O2 Signed-off-by: smajumdar <[email protected]> * Refactor to incorporate megatron amp 02 state Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor to incorporate megatron amp 02 state Signed-off-by: smajumdar <[email protected]> * Correct indent Signed-off-by: smajumdar <[email protected]> * Correct utils import Signed-off-by: smajumdar <[email protected]> --------- Signed-off-by: smajumdar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]> * handle artifacts when path is dir (#6658) Signed-off-by: arendu <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * remove upgrading setuptools in reinstall.sh (#6659) Signed-off-by: Xuesong Yang <[email protected]> Co-authored-by: fayejf <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * merge lora weights into base model (#6597) * merge lora weights into base model Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * typo fix Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor update Signed-off-by: arendu <[email protected]> * update copyright Signed-off-by: arendu <[email protected]> * eval needs to know the PEFT class Signed-off-by: arendu <[email protected]> * add target class in training script so that we can use it in eval Signed-off-by: arendu <[email protected]> * update Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update to work for tp1 Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set restore model path Signed-off-by: arendu <[email protected]> * peft can be none Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated merge script so that eval works easily Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * eval with peft or sft model Signed-off-by: arendu <[email protected]> * keep sentences in jsonl format Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * convert sft using correct classpath Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated to force sft yaml to have the correct target Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated docs Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix conversion and eval Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: arendu <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]> * upgrade to 23.04 (#6660) Signed-off-by: ericharper <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Merge r1.18.0 bugfixes and doc updates to main (#6655) * update branch Signed-off-by: ericharper <[email protected]> * Remove from jenkins (#6641) * add megatron_core to requirements Signed-off-by: ericharper <[email protected]> * remove from jenkins Signed-off-by: ericharper <[email protected]> --------- Signed-off-by: ericharper <[email protected]> * remove dup Signed-off-by: ericharper <[email protected]> * update branch Signed-off-by: ericharper <[email protected]> * [TTS] reformat NeMo versions in the tts logging messages to avoid batch process them when upgrading NeMo versions. Signed-off-by: Xuesong Yang <[email protected]> --------- Signed-off-by: ericharper <[email protected]> Signed-off-by: Xuesong Yang <[email protected]> Co-authored-by: Xuesong Yang <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Confidence ensembles: fix issues and add tuning functionality (#6657) * Implement compute confidence to properly handle blanks Signed-off-by: Igor Gitman <[email protected]> * Implement proper confidence for transducers Signed-off-by: Igor Gitman <[email protected]> * Implement tuning logic Signed-off-by: Igor Gitman <[email protected]> * Add tests for confidence tuning Signed-off-by: Igor Gitman <[email protected]> * Remove unused imports Signed-off-by: Igor Gitman <[email protected]> * Add types/docs Signed-off-by: Igor Gitman <[email protected]> * Add comment about the main conf compute loop Signed-off-by: Igor Gitman <[email protected]> --------- Signed-off-by: Igor Gitman <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * [TTS] Implement new TextToSpeech dataset (#6575) * [TTS] Implement new TextToSpeech dataset Signed-off-by: Ryan <[email protected]> * [TTS] Add unit tests Signed-off-by: Ryan <[email protected]> * [TTS] Fix defaulting of use_log_energy Signed-off-by: Ryan <[email protected]> * [TTS] Fix TTS export test Signed-off-by: Ryan <[email protected]> --------- Signed-off-by: Ryan <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Dialogue dataset (#6654) * chatbot interface Signed-off-by: Yi Dong <[email protected]> * latest gradio Signed-off-by: Yi Dong <[email protected]> * default greedy Signed-off-by: Yi Dong <[email protected]> * better chatbot Signed-off-by: Yi Dong <[email protected]> * handle preamble Signed-off-by: Yi Dong <[email protected]> * added chatbot training capablity Signed-off-by: Yi Dong <[email protected]> * added chatbot ui Signed-off-by: Yi Dong <[email protected]> * remove debug code Signed-off-by: Yi Dong <[email protected]> * default human Signed-off-by: Yi Dong <[email protected]> * use special token for roles Signed-off-by: Yi Dong <[email protected]> * special tokens Signed-off-by: Yi Dong <[email protected]> * fix name Signed-off-by: Yi Dong <[email protected]> * new chat dataset Signed-off-by: Yi Dong <[email protected]> * fix the system token Signed-off-by: Yi Dong <[email protected]> * upgrade gradio Signed-off-by: Yi Dong <[email protected]> * save the chat history Signed-off-by: Yi Dong <[email protected]> * update ui Signed-off-by: root <[email protected]> * update chat interface Signed-off-by: Yi Dong <[email protected]> * handles canonical form Signed-off-by: Yi Dong <[email protected]> * new sft chatbot Signed-off-by: Yi Dong <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * change format Signed-off-by: Yi Dong <[email protected]> * check extra_id in the tokenizer Signed-off-by: Yi Dong <[email protected]> * added vocab property check Signed-off-by: Yi Dong <[email protected]> * added missing file Signed-off-by: Yi Dong <[email protected]> --------- Signed-off-by: Yi Dong <[email protected]> Signed-off-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sandeep Subramanian <[email protected]> Signed-off-by: hsiehjackson <[email protected]> * Add support for RNNT/hybrid models to partial transcribe (#6609) * Add support for RNNT/hybrid models to partial transcribe Signed-off-by: He Huang (Steve) <[email protected]> * Update transcribe_utils.py Signed-off-by: He Huang (Steve) <[email protected]> * Update transcribe_speech.py Signed-off-by: He Huang (Steve) <[email protected]> * Update transcr…

Signed-off-by: Nikolay Karpov <[email protected]>

whitespace

20f49e7

Signed-off-by: Nikolay Karpov <[email protected]>

github-actions bot added the ASR label May 5, 2023

karpnv requested a review from titu1994 May 5, 2023 17:59

titu1994 approved these changes May 5, 2023

View reviewed changes

titu1994 merged commit 0084c04 into NVIDIA:main May 5, 2023
5 checks passed

hsiehjackson pushed a commit to hsiehjackson/NeMo that referenced this pull request Jun 2, 2023

whitespace (NVIDIA#6574)

5fd9c7f

Signed-off-by: Nikolay Karpov <[email protected]> Signed-off-by: hsiehjackson <[email protected]>

yaoyu-33 pushed a commit that referenced this pull request Oct 16, 2023

whitespace (#6574)

615a256

Signed-off-by: Nikolay Karpov <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate punctuation by whitespace #6574

Separate punctuation by whitespace #6574

karpnv commented May 5, 2023 •

edited

Loading

Kipok commented May 5, 2023

titu1994 left a comment

titu1994 commented May 5, 2023

Separate punctuation by whitespace #6574

Separate punctuation by whitespace #6574

Conversation

karpnv commented May 5, 2023 • edited Loading

What does this PR do ?

Changelog

Who can review?

Kipok commented May 5, 2023

titu1994 left a comment

Choose a reason for hiding this comment

titu1994 commented May 5, 2023

karpnv commented May 5, 2023 •

edited

Loading