From 40c1cccc9eb6773243a5637ddaf0e6ff023d87e9 Mon Sep 17 00:00:00 2001
From: Somshubra Majumdar <titu1994@gmail.com>
Date: Wed, 12 Jul 2023 10:00:05 -0700
Subject: [PATCH] Add support for Numba FP16 RNNT Loss (#6991)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* Force working space memory to always be in fp32

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add support for fp16 testing in Numba

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add support for fp16 testing in Numba

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add support for fp16 testing in Numba

Signed-off-by: smajumdar <titu1994@gmail.com>

* Fix cost calculation by upcasting to fp32

Signed-off-by: smajumdar <titu1994@gmail.com>

* Fix cost calculation by upcasting to fp32

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add support to check if numba fp16 is available

Signed-off-by: smajumdar <titu1994@gmail.com>

* add RNN-T loss implemented by PyTorch and test code (#5312)

* Fix the bugs in cache-aware streaming Conformer (#5032)

Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* IA3 support for GPT and T5 (#4909)

* init commit for ia3 adater training in GPT

Signed-off-by: arendu <adithya.r@gmail.com>

* ia3 adater training in GPT, models and adapter classes

Signed-off-by: arendu <adithya.r@gmail.com>

* reshape to operate even on non-contiguous tensors

Signed-off-by: arendu <adithya.r@gmail.com>

* configs

Signed-off-by: arendu <adithya.r@gmail.com>

* fixed none init

Signed-off-by: arendu <adithya.r@gmail.com>

* adding adapter and ia3 support for T5 based models

Signed-off-by: arendu <adithya.r@gmail.com>

* style fix

Signed-off-by: arendu <adithya.r@gmail.com>

* config update and t5 model adapter and ia3

Signed-off-by: arendu <adithya.r@gmail.com>

* removed unused imports

Signed-off-by: arendu <adithya.r@gmail.com>

* predict step for inference

Signed-off-by: arendu <adithya.r@gmail.com>

* style fix

Signed-off-by: arendu <adithya.r@gmail.com>

* style fix

Signed-off-by: arendu <adithya.r@gmail.com>

* adapter inference for t5

Signed-off-by: arendu <adithya.r@gmail.com>

* style fix

Signed-off-by: arendu <adithya.r@gmail.com>

* fixed bug micro and global batch size in eval

Signed-off-by: arendu <adithya.r@gmail.com>

* minor edit

Signed-off-by: arendu <adithya.r@gmail.com>

* agressive truncation if in test examples if no truncation field is given

Signed-off-by: arendu <adithya.r@gmail.com>

* corrected for language_model_path name changes in main

Signed-off-by: arendu <adithya.r@gmail.com>

* removed unused import

Signed-off-by: arendu <adithya.r@gmail.com>

* name change for language_model_path

Signed-off-by: arendu <adithya.r@gmail.com>

* include inter_attention to IA3

Signed-off-by: arendu <adithya.r@gmail.com>

* minor fix in confg

Signed-off-by: arendu <adithya.r@gmail.com>

* minor fixes

Signed-off-by: arendu <adithya.r@gmail.com>

* removed unused flag

Signed-off-by: arendu <adithya.r@gmail.com>

* addressing PR comments

Signed-off-by: arendu <adithya.r@gmail.com>

* address PR comments

Signed-off-by: arendu <adithya.r@gmail.com>

* minor fix

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* style fix

Signed-off-by: arendu <adithya.r@gmail.com>

* CI test

Signed-off-by: arendu <adithya.r@gmail.com>

* minor fix in jenkinsfile

Signed-off-by: arendu <adithya.r@gmail.com>

Signed-off-by: arendu <adithya.r@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Bug fix - Limit val batches set to 1.0  (#5023)

* Bug fix

Signed-off-by: shanmugamr1992 <shanmugamr1992@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Adressed sandeep's comments

* Fixing limit val batches support in bert

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixing limit val batches support in bert

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: shanmugamr1992 <shanmugamr1992@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [bug_fix] kv_channels is used when available (#5066)

* fix bug s.t kv_channels is used when available

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: arendu <adithya.r@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* P&C Docs (#5068) (#5069)

Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>

Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>

Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
Co-authored-by: Matvei Novikov <mattyson.so@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Add spe_split_by_unicode_script arg (#5072)

* Add spe_split_by_unicode_script arg

Signed-off-by: Anas <aabouallaban@pm.me>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Anas <aabouallaban@pm.me>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* probabilites -> probabilities (#5078) (#5079)

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* increase PR and Issue sweep quantity and active close PRs. (#5073)

* increase PR and Issue sweep quantity and active close PRs.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* update with stricter rules, 30 days to be stale and 7 days to be closed for both Issues and PRs.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [TTS] added missing German phoneme tokenizer. (#5070) (#5074)

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* rename to match prompt leanring (#5076)

Signed-off-by: arendu <adithya.r@gmail.com>

Signed-off-by: arendu <adithya.r@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Missing fixes from r1.11.0 to T5 finetuning eval (#5054) (#5061)

* Fixes to seq2seq eval

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Notebook bug fixes (#5084) (#5085)

* Notebook bug fixes

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Turned nemo install back on

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* reverted notebook

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Updated one line in entity linking nb

Signed-off-by: Virginia Adams <vadams@nvidia.com>

Signed-off-by: Virginia Adams <vadams@nvidia.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

Signed-off-by: Virginia Adams <vadams@nvidia.com>
Co-authored-by: Virginia Adams <78445382+vadam5@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* update strategy in notebook from ddp_fork to dp (#5088) (#5089)

Co-authored-by: Zhilin Wang <wangzhilin12061996@hotmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Fix bug in Squeezeformer Conv block (#5011) (#5024)

* Fix bug in Squeezeformer Conv block

Signed-off-by: smajumdar <smajumdar@nvidia.com>

* Fix kernel context

Signed-off-by: smajumdar <smajumdar@nvidia.com>

* Fix access mixin

Signed-off-by: smajumdar <smajumdar@nvidia.com>

Signed-off-by: smajumdar <smajumdar@nvidia.com>

Signed-off-by: smajumdar <smajumdar@nvidia.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* fixed megatron lm conversion bug (PTL related) (#5038) (#5063)

Signed-off-by: David Mosallanezhad <dmosallanezh@nvidia.com>

Signed-off-by: David Mosallanezhad <dmosallanezh@nvidia.com>
Co-authored-by: David Mosallanezhad <dmosallanezh@nvidia.com>

Signed-off-by: David Mosallanezhad <dmosallanezh@nvidia.com>
Co-authored-by: David <amosalla@asu.edu>
Co-authored-by: David Mosallanezhad <dmosallanezh@nvidia.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Fix Unhashable type list for Numba Cuda spec augment kernel (#5093) (#5094)

Signed-off-by: smajumdar <smajumdar@nvidia.com>

Signed-off-by: smajumdar <smajumdar@nvidia.com>

Signed-off-by: smajumdar <smajumdar@nvidia.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Fix numba (#5098)

Signed-off-by: smajumdar <titu1994@gmail.com>

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Make it possible to specify output_filename in normalize_with_audio.py (#5092)

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Greedy decoding confidence for CTC and RNNT (#4931)

* rnnt confidence draft

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* word confidence

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* advanced entropies added

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* refactoring

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* oops forgot a file

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* metrics and benchmarking script added

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* style fix

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* texterrors installation added

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* lgtm and bug fix

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* fix comments

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* fix typos

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* add missing import after rebase

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>
Co-authored-by: Aleksandr Laptev <alaptev@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [Add] SLURP models and examples (#4668)

* add model, util and loss

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor annd update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update and refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update and refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update and refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update docs

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update available models

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor data processing

Signed-off-by: stevehuang52 <heh@nvidia.com>

* fix typo

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update docs

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor and update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update doc

Signed-off-by: stevehuang52 <heh@nvidia.com>

* move transformer to asr.modules

Signed-off-by: stevehuang52 <heh@nvidia.com>

* move transformer to asr.modules

Signed-off-by: stevehuang52 <heh@nvidia.com>

* get rid of jsonlines

Signed-off-by: stevehuang52 <heh@nvidia.com>

* refactor

Signed-off-by: stevehuang52 <heh@nvidia.com>

* revert changes to nlp

Signed-off-by: stevehuang52 <heh@nvidia.com>

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Jagadeesh Balam <4916480+jbalam-nv@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* only optimize params that are part of the adapter modules (#5086)

Signed-off-by: arendu <adithya.r@gmail.com>

Signed-off-by: arendu <adithya.r@gmail.com>
Co-authored-by: Virginia Adams <78445382+vadam5@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Pipeline Parallel T5 Prompt Learning (#4956)

* Added pre process flag checks and pipeline parallel in fwd

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Added rank check for pipeline parallel

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* T5 prompt learning works!

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* IA3 passing CI

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Fixed typo

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* removed optimizer setup so Adi's change will not conflict

Signed-off-by: Virginia Adams <vadams@nvidia.com>

Signed-off-by: Virginia Adams <vadams@nvidia.com>
Signed-off-by: Adi Renduchintala <108822655+arendu@users.noreply.github.com>
Co-authored-by: Adi Renduchintala <108822655+arendu@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [TTS] remove phonemizer.py (#5090)

remove phonemizer.py and convert code block to markdown in the tutorial.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* T5 Decoding with PP > 2 fix (#5091) (#5103)

* set sequence lenghts in the pipeline properly

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [TTS] fixed wrong val loss for epoch 0 and inconsistent metrics names (#5087) (#5102)

* fixed hifigan configs as well
* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Fix and refactor consumed samples save/restore for Megatron models. (#5077)

* Fixes and refactor

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Remove unused imports

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Empty

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* RIR corpus generator tool (#4927)

Signed-off-by: Ante Jukić <ajukic@nvidia.com>

Signed-off-by: Ante Jukić <ajukic@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Multiprocessing fix (#5106) (#5107)

Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>

Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>

Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
Co-authored-by: Matvei Novikov <mattyson.so@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [Bug fix] PC lexical + audio (#5109) (#5110)

* training running

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* revert

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* revert

Signed-off-by: ekmb <ebakhturina@nvidia.com>

Signed-off-by: ekmb <ebakhturina@nvidia.com>

Signed-off-by: ekmb <ebakhturina@nvidia.com>
Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [Fix] schedulers with no max_steps param (#4564)

* fix schedulers

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update to use python inspect module

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* T5 prompt learning fixes missing from r.11.0 merge (#5075) (#5101)

* Fix special tokens

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Empty

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: David <amosalla@asu.edu>

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: David <amosalla@asu.edu>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [TTS] Add NeMo TTS Primer Tutorial (#4933)

* [TTS] Add NeMo TTS Primer Tutorial

Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Add Squeezeformer CTC model checkpoints on Librispeech (#5121)

Signed-off-by: smajumdar <titu1994@gmail.com>

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* adding loss normalization options to rnnt joint  (#4829)

* adding normalization options to rnnt joint loss

* moving the param to joint

* moving loss normalization to rnnt loss config

* style

* cleaning up

* fixing sum reduction in joint

Signed-off-by: Dima Rekesh <drekesh@nvidia.com>

* moving reduction into RNNT loss class

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactoring

* typos

Signed-off-by: Dima Rekesh <drekesh@nvidia.com>

Signed-off-by: Dima Rekesh <drekesh@nvidia.com>
Co-authored-by: Dima Rekesh <drekesh@nvidia.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Asr concat dataloader (#5108)

* forced precision

* typo

* initial commit

Signed-off-by: Dima Rekesh <bmwshop@gmail.com>

* typos and bugs

Signed-off-by: Dima Rekesh <drekesh@nvidia.com>

* reverting conformer encoder

Signed-off-by: Dima Rekesh <drekesh@nvidia.com>

* additional checks

Signed-off-by: Dima Rekesh <bmwshop@gmail.com>

* adding support to CTC models as well

* reverting conformer_encoder

Signed-off-by: Dima Rekesh <bmwshop@gmail.com>

* typo

Signed-off-by: Dima Rekesh <bmwshop@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactoring

Signed-off-by: Dima Rekesh <bmwshop@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactoring

Signed-off-by: Dima Rekesh <drekesh@nvidia.com>

* merging

Signed-off-by: Dima Rekesh <drekesh@nvidia.com>

Signed-off-by: Dima Rekesh <bmwshop@gmail.com>
Signed-off-by: Dima Rekesh <drekesh@nvidia.com>
Co-authored-by: Dima Rekesh <drekesh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* fix blossom ci unittests

Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* bugfix: pybtex.database.InvalidNameString: Too many commas in author field. (#5112) (#5115)

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Uppdate container version to 22.09 (#5105)

* update container version

Signed-off-by: ericharper <complex451@gmail.com>

* pin click

Signed-off-by: ericharper <complex451@gmail.com>

* pin click 8.0.2

Signed-off-by: ericharper <complex451@gmail.com>

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Remove unsupported arguments from MegatronNMT (#5065)

* Fixes

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fixes

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* More fixes

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* pp2 support for T5 IA3 learning and T5 Adapters learning (#5116)

* enabling pp2

Signed-off-by: arendu <adithya.r@gmail.com>

* optimizer update

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* T5 pp>1 support for adapters and ia3

Signed-off-by: arendu <adithya.r@gmail.com>

* fix bug with missing adapter_tuning

Signed-off-by: arendu <adithya.r@gmail.com>

* inference error fixed, pp=2

Signed-off-by: arendu <adithya.r@gmail.com>

Signed-off-by: arendu <adithya.r@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* T5 Prompt Learning Fixes for Pipeline Parallel (#5120)

* Initial fixes

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Added back validation acc

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Put num workers back

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* added relative encoding if statament

Signed-off-by: Virginia Adams <vadams@selene-login-01.nvidia.com>

* Added back val loss only validation

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Revert "Added back val loss only validation"

This reverts commit 86d8f4806fe30335c40c3716ce18259939df500f.

* Removed val acc for PP > 1

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Removed enc_seq_len if statement

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Added back validation acc calc

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Virginia Adams <vadams@nvidia.com>
Signed-off-by: Virginia Adams <vadams@selene-login-01.nvidia.com>
Co-authored-by: Virginia Adams <vadams@nvidia.com>
Co-authored-by: Virginia Adams <78445382+vadam5@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Virginia Adams <vadams@selene-login-01.nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* add doc info (#4721)

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [TTS] Add SpanishCharsTokenizer (#5135)

* [TTS] Add SpanishCharsTokenizer

Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Update megatron interface to dialogue (#4936)

* fix style formatting

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update template to include description of intent

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* changes based on requests in review

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add compatibility with assistant dataset

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkins

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove dialogue_state_tracking

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update huggingface utils for dialogue

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* rename dialogue_state_tracking_hybrid to dialogue_state_tracking_sgdqa

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* fix style

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix nemo/collections/nlp/models/dialogue_state_tracking_sgdqa/__init__.py

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile for SGDGEN

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile for SGDGEN

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile for SGDGEN

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile for SGDGEN

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile for SGDGEN

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* fix typo

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add docstrings for assistant data processsor

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkins for SGDGEN local checkpoint

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update style

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* use local vocab file for Jenkinsfile

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* patch for Jenkins CI using local file

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add slot filling prediction and metrics

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove unused code

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* refactor metrics code out of Dialogue GPT Model

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* integrate backward compatible support for IntentSlotClassificationModel (bert model)

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* save prediction file for IntentSlotClassification

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update dialogue gpt model training for megatron gpt

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove batch generate for HF GPT2, which causes lower performance

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add few shot capability to dialogue gpt model

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile and remove unused import

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update code description and clarity

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* address PR comments

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* integrate compatibility with ZeroShotIntentModel

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* rename folder to dialogue due to increased scope and further refactor for clarity

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* added dialogue GPT for sequence generation task (e.g. answer extender)

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add CI test for DialogueGPTGenerationModel

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* integrate DialogueS2SGenerationModel for generation task (e.g. answer extender)

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* modify huggingface utils to support HF t5/BART models

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove unused imports

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update bleu metric

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* fix bleu metric style

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* debug bleu metric

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* debug bleu metric

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update based on PR #3893

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update 2 based on PR #3893

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update 3 based on PR #3893

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* integrate sgd generation based on user user utterance and system slot-values to generate system utterance

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add validation model saving capabilities

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* cleaned up code for SGD Based Answer extender

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Dialogue Generation CI

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkinsfile

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* fix Jenkins CI issue"

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add support for design dataset

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove unnecessary imports

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkins

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update jenkins

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update jenkins

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* support megatron for dialogue_s2s_generation_model

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* reduce loaded samples in MSMarcoDataProcessor to 64 when cfg.model.dataset.debug_mode=True

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update CI

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update checkpoint and predictions filename to include epoch number

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* integrate HF BART MNLI into zero shot intent model

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* integrate Dialogue Nearest Neighbour Model

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkins

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Jenkins

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* refactor Dialogue SGD Data Processor to make interface for models cleaner

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update jenkins

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update Dialogue S2S Generation model for DialogueSGDDataProcessor interface

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update jenkins

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update jenkins

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* support sgd and drive thru datasets by zero shot model and nearest neighbour model

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add prediction saving code to nearest neighbour and zero shot intent models

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* fix typo in sgd data processor

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* integrate Dialogue Mellon QA Data Processor

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update mellon qa

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update dialogue.py to remove outdated info

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update dialogue_config.yaml

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update dialogue_config.yaml

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add dialogue docs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* address review comments

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix for cfg

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* make dependency on apex optional

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* change NLPDDPluggin calling logic to make it possible to run without apex

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add first draft of tutorial

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* reduce ms marco size by removing lines without wellFormedAnswers

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* address pr comments

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update colab tutorial link in dialogue docs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* include unit test and some refactor to facilitate unit test

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* address pr issues

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove typos in dialogue tutorial

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* support larger files for question answering

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove unnecessary artifacts to reduce memory use

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* put 0 tensor to device

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update link within dialogue tutorial

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* restore previously delete files

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error handling when loss = nan

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update nan handling

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update spanning loss func

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update spanning loss

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* fix type error raised in qa_dataset.py

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add error checking message

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* revert back to float32

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* revert back to float32

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msgs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msgs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msgs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msgs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msgs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msgs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msgs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msgs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update exp logging

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msgs

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update loading of large file from pickle to json

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update loading of large file from pickle to json

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* limit number of negative samples

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* revert post processing

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* revert post processing

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove unused methods and style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add more documentation

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove unused imports

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* changes base on PR review

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* set wandb logger falseby default

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update interface with megatron gpt prompt learning

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update inline documentation

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update prompt_ids

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update error msg

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update config

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update config

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* set inference = False for dialgue prompt learning during trainng

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* set inference = False for dialgue prompt learning during trainng

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove unused code

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update config yaml

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* fix bug for megatron gpt prompt learning

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove unused import

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* address comments in PR

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* address comments in PR

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* address typo

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* add megatron t5 inference

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* fix bug due to bert tokenizer not being space-aware

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update style

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update IntentSlotModel onnx export test

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update style

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update exportable

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* address PR comments

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* replace functools.cache_property with functools.lru_cache to maintain python 3.7 compatibility

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* improve speed of rank_candidates and support for p tuning

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update dialogue.py

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* fix megatron prompt learning saving bug

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update generate_candidate method

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* remove repeated init text ids and invert attention masks

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update typo

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* custom collate fn to remove excess padding in batch

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* style fix

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update complete method to mitigate issue when max seq len is low

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* address pr comments

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

* update generation interface

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>

Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>
Co-authored-by: Zhilin Wang <zhilinw@nvidia.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Added save inference ready .nemo file with every checkpoint (#5055)

* Added save inference ready .nemo file with every checkpoint

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Python style fix

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* addressed Adi's comment

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Added ptuning check in model checkpoint saving

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Changed save_nemo_on_valdaition default to False

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Changes global batch size of adapter CI

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Changed num workers to 0

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* added first stage of pipeline check

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Virginia Adams <vadams@nvidia.com>
Signed-off-by: Virginia Adams <78445382+vadam5@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Fixes for docs/typos + remove max_utts parameter from tarred datasets as it causes hang in training (#5118)

* Remove ; from jupyter notebook cells

Signed-off-by: Igor Gitman <igitman@nvidia.com>

* Fix typos in documentation/code

Signed-off-by: Igor Gitman <igitman@nvidia.com>

* Fix output message to have 'or equal'

Signed-off-by: Igor Gitman <igitman@nvidia.com>

* Link formatting fixes

Signed-off-by: Igor Gitman <igitman@nvidia.com>

* Add error if max_utts is used in tarred datasets

Signed-off-by: Igor Gitman <igitman@nvidia.com>

* Remove max_utts parameter from tarred datasets

Signed-off-by: Igor Gitman <igitman@nvidia.com>

* Fix max_utts removal in tests

Signed-off-by: Igor Gitman <igitman@nvidia.com>

* Fix typo if -> is

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Merge r1.12.0 main (#5139)

* update branch

Signed-off-by: ericharper <complex451@gmail.com>

* Add cherry-pick action (#4958)

* add cherry-pick action

Signed-off-by: ericharper <complex451@gmail.com>

* Pin Transformers version to fix CI (#4955)

* Pin transformers version in CI to prevent offline tokenizer loading error

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Drop version

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Disable offline temporarily

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Disable offline temporarily

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Enable offline

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Co-authored-by: Sean Naren <snarenthiran@nvidia.com>

* upper bound transformers

Signed-off-by: ericharper <complex451@gmail.com>

* remove duplicate transformers requirement

Signed-off-by: ericharper <complex451@gmail.com>

* Release SOTA Lang ID model  (#5080)

* add pretrained lang id model ambernet

Signed-off-by: fayejf <fayejf07@gmail.com>

* update doc and style fix

Signed-off-by: fayejf <fayejf07@gmail.com>

Signed-off-by: fayejf <fayejf07@gmail.com>

* update branch and package info

Signed-off-by: ericharper <complex451@gmail.com>

* remove upper bounds on lightning and transformers

Signed-off-by: ericharper <complex451@gmail.com>

* remove transformers offline from ci

Signed-off-by: ericharper <complex451@gmail.com>

* upper bound transformers

Signed-off-by: ericharper <complex451@gmail.com>

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: fayejf <fayejf07@gmail.com>
Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Added ASR model comparison to SDE (#5043)

SDE: Added ASR model comparison tool to SDE
transcribe speech: Added support for many predictions in one file, as well as custom field names
Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* fix nmt eval sampler (#5154)

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Fix Global init steps (#5143)

* move global step to base

Signed-off-by: Yi Dong <yidong@nvidia.com>

* fix fused softmax

Signed-off-by: Yi Dong <yidong@nvidia.com>

* add the missing file

Signed-off-by: Yi Dong <yidong@nvidia.com>

* update the fused kernel

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix import error

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix import again

Signed-off-by: Yi Dong <yidong@nvidia.com>

Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <doyend@gmail.com>
Co-authored-by: Yi Dong <doyend@gmail.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [TTS] bug fix - sample rate was being ignored in vocoder dataset (#4518)

* bug fix - sample rate was being ignored in vocoder dataset when not loading mel
* handled n segments for a different sampling rate than original sampling rate
* Added case for n_segments 0, warning for n_segments greater than file length

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: Jocelyn <jocelynh@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Add EMA support to NeMo (#4764)

* Added Base files

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Some refactors, swap to using MNIST Lnet

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Add a few more tests, allow the callback to be set via the exp manager

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Actually run validation for testing

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Run isort

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Add test for saving state/fix saving state

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Use dummy model

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Fix test

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Add copyright

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Support saving separate EMA weight module

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Add standalone functionality/logging

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Expose more parameters

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Modify to allow option to replace validation

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Add jenkins test, formatting

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Pin Transformers version to fix CI (#4955)

* Pin transformers version in CI to prevent offline tokenizer loading error

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Drop version

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Disable offline temporarily

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Disable offline temporarily

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Enable offline

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Add cherry-pick action (#4958) (#4961)

* add cherry-pick action

Signed-off-by: ericharper <complex451@gmail.com>

* Pin Transformers version to fix CI (#4955)

* Pin transformers version in CI to prevent offline tokenizer loading error

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Drop version

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Disable offline temporarily

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Disable offline temporarily

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Enable offline

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Co-authored-by: Sean Naren <snarenthiran@nvidia.com>

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Fix changelog builder (#4962) (#4963)

Signed-off-by: smajumdar <smajumdar@nvidia.com>

Signed-off-by: smajumdar <smajumdar@nvidia.com>

Signed-off-by: smajumdar <smajumdar@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* fix cherry pick workflow (#4964) (#4965)

Signed-off-by: ericharper <complex451@gmail.com>

Signed-off-by: ericharper <complex451@gmail.com>

Signed-off-by: ericharper <complex451@gmail.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* reorder model check (#4959) (#4967)

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* check for active conda environment (#4970) (#4971)

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* [TTS] fix broken tutorial for MixerTTS. (#4949) (#4976)

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Checkpoint averaging class fix (#4946)

* 1. Added args.class_path to provide it externally.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Fixed style.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Add ability to give seperate datasets for test, train and validation (#4798)

* Add ability to give seperate datasets for test, train and validation

* Addressed Sandeeps comments

* Addressed Sandeeps comments

* Add ability to give seperate datasets for test, train and validation

* Add ability to give seperate datasets for test, train and validation

* Addressed review comments

* Bug fix for common dataset utils

* Add CI tests

Signed-off-by: shanmugamr1992 <shanmugamr1992@gmail.com>

* Reformat code

Signed-off-by: shanmugamr1992 <shanmugamr1992@gmail.com>

* Bug fix

Signed-off-by: shanmugamr1992 <shanmugamr1992@gmail.com>

* Bug fix

* Bug Fix

* Bug Fix

* Update Jenkinsfile

* Addressed comments

* Addressed Eriks comments.

* Addressed Sandeep

* Update Jenkinsfile

* Update Jenkinsfile

* Update dataset_utils.py

* Update Jenkinsfile

* Update Jenkinsfile

* Use GPT CI config

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Signed-off-by: shanmugamr1992 <shanmugamr1992@gmail.com>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* fix label models restoring issue from wrighted cross entropy (#4968) (#4975)

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Add simple pre-commit file (#4983)

* Add simple pre-commit file

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Exclude docs folder

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks"

This reverts commit 053bd5ba579537a5f311b431871c21f3381b43eb.

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Import pycuda.autoprimaryctx or pycuda.autoinit to init pycuda execution environment (#4951)

Signed-off-by: Jin Li <liji@nvidia.com>

Signed-off-by: Jin Li <liji@nvidia.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Adding speaker embedding conditioning in fastpitch (#4986)

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Fix ASR issues (#4984) (#4991)

* Fix ASR issues

Signed-off-by: smajumdar <smajumdar@nvidia.com>

* Revert fix

Signed-off-by: smajumdar <smajumdar@nvidia.com>

Signed-off-by: smajumdar <smajumdar@nvidia.com>

Signed-off-by: smajumdar <smajumdar@nvidia.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Fix current tests

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* More test coverage

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Address reviews

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address review

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Drop bf16 test

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Address review

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* remove print

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Add bf16

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: smajumdar <smajumdar@nvidia.com>
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>
Signed-off-by: shanmugamr1992 <shanmugamr1992@gmail.com>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Jin Li <liji@nvidia.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: Micha Livne <michalivne@users.noreply.github.com>
Co-authored-by: shanmugamr1992 <111910568+shanmugamr1992@users.noreply.github.com>
Co-authored-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: liji-nv <59594262+liji-nv@users.noreply.github.com>
Co-authored-by: Subhankar Ghosh <subhankar2321@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Fix BF16 test (#5162)

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Fix errors in speaker diarization nemo docs (#5153)

* fix docs and docstrings for MSDD

Signed-off-by: Taejin Park <tango4j@gmail.com>

* fix nemo docs errors

Signed-off-by: Taejin Park <tango4j@gmail.com>

* reflected review comments

Signed-off-by: Taejin Park <tango4j@gmail.com>

Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Add interleaved pipeline schedule to GPT (#5025)

* add virtual pipeline size to config

Signed-off-by: ericharper <complex451@gmail.com>

* convert model to list of modules

Signed-off-by: ericharper <complex451@gmail.com>

* convert model to list of modules

Signed-off-by: ericharper <complex451@gmail.com>

* convert model to list of modules

Signed-off-by: ericharper <complex451@gmail.com>

* update for list of modules

Signed-off-by: ericharper <complex451@gmail.com>

* add virtual to init

Signed-off-by: ericharper <complex451@gmail.com>

* update first last stage embedding all reduce

Signed-off-by: ericharper <complex451@gmail.com>

* update sequence parallel all reduce for virtual models

Signed-off-by: ericharper <complex451@gmail.com>

* runs but we get an error

Signed-off-by: ericharper <complex451@gmail.com>

* set virtual rank 0 after looping

Signed-off-by: ericharper <complex451@gmail.com>

* account for virtual when determinining first and last pipeline stages

Signed-off-by: ericharper <complex451@gmail.com>

* checkpointing for virtual models in progress

Signed-off-by: ericharper <complex451@gmail.com>

* add checkpoint hooks

Signed-off-by: ericharper <complex451@gmail.com>

* working on validation when resuming

Signed-off-by: ericharper <complex451@gmail.com>

* skip sanity val steps by default in config

Signed-off-by: ericharper <complex451@gmail.com>

* remove comment

Signed-off-by: ericharper <complex451@gmail.com>

* log number of params

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* style

Signed-off-by: ericharper <complex451@gmail.com>

* check if self.model is a list

Signed-off-by: ericharper <complex451@gmail.com>

* make virtual pipeline default size None on init

Signed-off-by: ericharper <complex451@gmail.com>

* make virtual pipeline default to None in config

Signed-off-by: ericharper <complex451@gmail.com>

* remove ensure_divisibility call

Signed-off-by: ericharper <complex451@gmail.com>

* fix lgtm alerts

Signed-off-by: ericharper <complex451@gmail.com>

* remove num_sanity_val_steps from config

Signed-off-by: ericharper <complex451@gmail.com>

* default virtual pipeline size to none

Signed-off-by: ericharper <complex451@gmail.com>

* check for list

Signed-off-by: ericharper <complex451@gmail.com>

* update assert to make sure we are only doing virtual for gpt

Signed-off-by: ericharper <complex451@gmail.com>

* revert change to get_params_for_weight_decay

Signed-off-by: ericharper <complex451@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* init var

Signed-off-by: ericharper <complex451@gmail.com>

* add import guard for set virtual model parallel world size

Signed-off-by: ericharper <complex451@gmail.com>

* use import guard

Signed-off-by: ericharper <complex451@gmail.com>

* update calls to fake init in eval scripts

Signed-off-by: ericharper <complex451@gmail.com>

* add _get_fwd_bwd_function

Signed-off-by: ericharper <complex451@gmail.com>

* log all total model parameters

Signed-off-by: ericharper <complex451@gmail.com>

* remove unused import

Signed-off-by: ericharper <complex451@gmail.com>

Signed-off-by: ericharper <complex451@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* reduced to 14 inactive days to be stale for PRs. (#5165)

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* refactor TTS documentation organization and add new contents. (#5137)

* refactor TTS documentation organization and add new contents.
* fix asr api bug.
* fix broken links.
* fix unexpected indentation errors.
* fixed unexpected indentation.
* fixed broken paper reference.
* fixed cross-reference and typos.
* fixed toctree errors.
* revert to 'Augmentors'
* reordered TTS tutorial list in starthere.
* ordered api classes alphabetically for each Section.
* fixed underscore typo for fastpitch checkpoint.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* upcase 'Tuning'

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* fixed typo for RAD-TTS Aligner

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* reorder aligner section after mel-gen and vocoders in models.rst.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* clarify Mixer-TTS-X and reorder model descriptions alphabetically.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* fixed some typos and formats.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* removed old megatron.rst.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* fixed block quote ends without a blank line warnings.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* remove duplicate reference; fixed missing key nlp-megatron-shoeybi2019megatron

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* Revert "removed old megatron.rst."

This reverts commit c5ea1dc3f23272eecfe8040e3abfa54fa122cf73.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* removed Russian, a hyphen, and add a note about G2P in tts/config.rst

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* added pynini installation in wfst_text_normalization.rst

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* added description of manifest key/value pairs.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* add toctree in tts/intro

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* replace main branch to stable.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* add 'upcoming' for e2e systems.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* replaced main branch to stabl…

* Multiblank Transducer (#5527)

* multi-blank transducers

Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* one line bug fix

Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* change interface of RNNTDecoding class to extract num-extra-output from joint instead of constructor

Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* addressed PR comments

Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Hainan Xu <hainanx@nvidia.com>
Co-authored-by: Hainan Xu <hainanx@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Default RNNT loss to int64 targets (#6011)

Signed-off-by: smajumdar <titu1994@gmail.com>

* Rebase

Signed-off-by: smajumdar <titu1994@gmail.com>

* Begin refactoring tests

Signed-off-by: smajumdar <titu1994@gmail.com>

* Pass all tests for RNNT numba loss

Signed-off-by: smajumdar <titu1994@gmail.com>

* Pass all tests for RNNT numba loss

Signed-off-by: smajumdar <titu1994@gmail.com>

* Remove print

Signed-off-by: smajumdar <titu1994@gmail.com>

* Fix test for version

Signed-off-by: smajumdar <titu1994@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert bad merges

Signed-off-by: smajumdar <titu1994@gmail.com>

* Revert bad merges

Signed-off-by: smajumdar <titu1994@gmail.com>

* Address comments

Signed-off-by: smajumdar <titu1994@gmail.com>

* Remove wrong file

Signed-off-by: smajumdar <titu1994@gmail.com>

---------

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: Vahid <vnoroozi@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>
Signed-off-by: arendu <adithya.r@gmail.com>
Signed-off-by: shanmugamr1992 <shanmugamr1992@gmail.com>
Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
Signed-off-by: Anas <aabouallaban@pm.me>
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: Virginia Adams <vadams@nvidia.com>
Signed-off-by: smajumdar <smajumdar@nvidia.com>
Signed-off-by: David Mosallanezhad <dmosallanezh@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Signed-off-by: Adi Renduchintala <108822655+arendu@users.noreply.github.com>
Signed-off-by: Ante Jukić <ajukic@nvidia.com>
Signed-off-by: ekmb <ebakhturina@nvidia.com>
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Dima Rekesh <drekesh@nvidia.com>
Signed-off-by: Dima Rekesh <bmwshop@gmail.com>
Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: Virginia Adams <vadams@selene-login-01.nvidia.com>
Signed-off-by: Yang Zhang <yangzhang@nvidia.com>
Signed-off-by: Zhilin Wang <zhilinw@nvidia.com>
Signed-off-by: Virginia Adams <78445382+vadam5@users.noreply.github.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: fayejf <fayejf07@gmail.com>
Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <doyend@gmail.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>
Signed-off-by: Jin Li <liji@nvidia.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>
Signed-off-by: miguelangel <miguelangel@@users.noreply.github.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: 彭震东 <275331498@qq.com>
Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>
Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: Patrick Simianer <patrick@lilt.com>
Signed-off-by: Shantanu Acharya <shantanua@nvidia.com>
Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: eharper <eharper@nvidia.com>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: Oleksii Volkovskyi <volkovskyi@berkeley.edu>
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
Signed-off-by: Sasha Meister <117230141+ssh-meister@users.noreply.github.com>
Signed-off-by: whrichd <trabeitwrq@gmail.com>
Signed-off-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com>
Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
Signed-off-by: Viraj Karandikar <vkarandikar@nvidia.com>
Signed-off-by: Yu Yao <yuya@nvidia.com>
Signed-off-by: PeganovAnton <peganoff2@mail.ru>
Signed-off-by: Somshubra Majumdar <titu1994@gmail.com>
Signed-off-by: Jonghwan Hyeon <hyeon0145@gmail.com>
Signed-off-by: Boris Fomitchev <borisfom@users.noreply.github.com>
Signed-off-by: shane carroll <shane.carroll@utsa.edu>
Co-authored-by: Samuel Kriman <samuelkriman@gmail.com>
Co-authored-by: Hainan Xu <hainan.xv@gmail.com>
Co-authored-by: Vahid Noroozi <VahidooX@users.noreply.github.com>
Co-authored-by: Adi Renduchintala <108822655+arendu@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Matvei Novikov <mattyson.so@gmail.com>
Co-authored-by: Anas Abou Allaban <aabouallaban@pm.me>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Co-authored-by: Virginia Adams <78445382+vadam5@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: Zhilin Wang <wangzhilin12061996@hotmail.com>
Co-authored-by: David <amosalla@asu.edu>
Co-authored-by: David Mosallanezhad <dmosallanezh@nvidia.com>
Co-authored-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com>
Co-authored-by: Aleksandr Laptev <laptevsasha12@gmail.com>
Co-authored-by: Aleksandr Laptev <alaptev@nvidia.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Jagadeesh Balam <4916480+jbalam-nv@users.noreply.github.com>
Co-authored-by: anteju <108555623+anteju@users.noreply.github.com>
Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
Co-authored-by: Ryan Langman <rlangman@nvidia.com>
Co-authored-by: Dima Rekesh <bmwshop@gmail.com>
Co-authored-by: Dima Rekesh <drekesh@nvidia.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Co-authored-by: Virginia Adams <vadams@nvidia.com>
Co-authored-by: Virginia Adams <vadams@selene-login-01.nvidia.com>
Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com>
Co-authored-by: Zhilin Wang <zhilinw@nvidia.com>
Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com>
Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: Abhinav Khattar <aklife97@gmail.com>
Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com>
Co-authored-by: Yi Dong <doyend@gmail.com>
Co-authored-by: Paarth Neekhara <paarth.n@gmail.com>
Co-authored-by: Jocelyn <jocelynh@nvidia.com>
Co-authored-by: Micha Livne <michalivne@users.noreply.github.com>
Co-authored-by: liji-nv <59594262+liji-nv@users.noreply.github.com>
Co-authored-by: Subhankar Ghosh <subhankar2321@gmail.com>
Co-authored-by: Taejin Park <tango4j@gmail.com>
Co-authored-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com>
Co-authored-by: miguelangel <miguelangel@@users.noreply.github.com>
Co-authored-by: 彭震东 <275331498@qq.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: bene-ges <61418381+bene-ges@users.noreply.github.com>
Co-authored-by: Alexandra Antonova <aleksandraa@nvidia.com>
Co-authored-by: Jason <jasoli@nvidia.com>
Co-authored-by: Rajesh Ilango <rilango@gmail.com>
Co-authored-by: pks <pks@users.noreply.github.com>
Co-authored-by: Shantanu Acharya <shan.sacharya@gmail.com>
Co-authored-by: Shantanu Acharya <shantanua@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <smkirthishankar@gmail.com>
Co-authored-by: ksivamani <ksivamani@nvidia.com>
Co-authored-by: Eric Harper <eharper@nvidia.com>
Co-authored-by: Sandeep Subramanian <sandeepsub@nvidia.com>
Co-authored-by: Micha Livne <mlivne@nvidia.com>
Co-authored-by: Oleksii Volkovskyi <volkovskyi@berkeley.edu>
Co-authored-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Co-authored-by: Hainan Xu <hainanx@nvidia.com>
Co-authored-by: Boris Fomitchev <borisfom@users.noreply.github.com>
Co-authored-by: anmolgupt <14880251+anmolgupt@users.noreply.github.com>
Co-authored-by: Anmol Gupta <anmolg@nvidia.com>
Co-authored-by: Sasha Meister <117230141+ssh-meister@users.noreply.github.com>
Co-authored-by: Riqiang Wang <43883260+whrichd@users.noreply.github.com>
Co-authored-by: Vladimir Bataev <vbataev@nvidia.com>
Co-authored-by: Shanmugam Ramasamy <shanmugamr@shanmugamr-mlt.client.nvidia.com>
Co-authored-by: Viraj Karandikar <16838694+virajkarandikar@users.noreply.github.com>
Co-authored-by: Shane Carroll <50530592+1-800-BAD-CODE@users.noreply.github.com>
Co-authored-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Yi Dong <yidong@nvidia.com>
Co-authored-by: PeganovAnton <peganoff2@mail.ru>
Co-authored-by: Jonghwan Hyeon <jonghwanhyeon93@gmail.com>
Co-authored-by: Kaden Uhlig <Numeri@users.noreply.github.com>
Co-authored-by: Kaden Uhlig <kaden.uhlig@lilt.com>
Co-authored-by: Boris Fomitchev <bfomitchev@nvidia.com>
Co-authored-by: Jonghwan Hyeon <hyeon0145@gmail.com>
---
 nemo/collections/asr/losses/rnnt.py           |  26 +++-
 nemo/collections/asr/losses/rnnt_pytorch.py   |   5 +
 .../asr/parts/numba/rnnt_loss/rnnt.py         |   2 +-
 .../asr/parts/numba/rnnt_loss/rnnt_numpy.py   |   5 +
 .../asr/parts/numba/rnnt_loss/rnnt_pytorch.py |   7 +-
 .../rnnt_loss/utils/cpu_utils/cpu_rnnt.py     |   8 +-
 .../numba/rnnt_loss/utils/rnnt_helper.py      |   3 +-
 nemo/core/utils/numba_utils.py                |  36 +++++
 .../asr/numba/rnnt_loss/test_rnnt_pytorch.py  | 126 ++++++++++++------
 .../rnnt_loss/utils/test_gpu_rnnt_kernel.py   |  64 +++++----
 .../asr/numba/rnnt_loss/utils/test_reduce.py  |  18 ++-
 .../numba/rnnt_loss/utils/test_rnnt_helper.py |  75 +++++++----
 12 files changed, 263 insertions(+), 112 deletions(-)

diff --git a/nemo/collections/asr/losses/rnnt.py b/nemo/collections/asr/losses/rnnt.py
index 10b85acb42ef..a884f7d3cc68 100644
--- a/nemo/collections/asr/losses/rnnt.py
+++ b/nemo/collections/asr/losses/rnnt.py
@@ -38,9 +38,10 @@
 from nemo.collections.asr.losses.rnnt_pytorch import MultiblankRNNTLossPytorch, RNNTLossPytorch, TDTLossPytorch
 from nemo.core.classes import Loss, typecheck
 from nemo.core.neural_types import LabelsType, LengthsType, LogprobsType, LossType, NeuralType
+from nemo.core.utils import numba_utils
 from nemo.core.utils.k2_utils import K2_INSTALLATION_MESSAGE
 from nemo.core.utils.numba_utils import NUMBA_INSTALLATION_MESSAGE
-from nemo.utils import logging, model_utils
+from nemo.utils import logging, logging_mode, model_utils
 
 try:
     import warprnnt_pytorch as warprnnt
@@ -98,7 +99,7 @@ class RNNTLossConfig:
         min_version='0.53.0',
         is_available=NUMBA_RNNT_AVAILABLE,
         installation_msg=NUMBA_INSTALLATION_MESSAGE,
-        force_float32=True,
+        force_float32=not numba_utils.NUMBA_FP16_SUPPORTED,
     ),
     "pytorch": RNNTLossConfig(
         loss_name="pytorch",
@@ -387,7 +388,7 @@ def __init__(self, num_classes, reduction: str = 'mean_batch', loss_name: str =
                 for the standard "blank" symbol. In particular, say V is the number of non-blank tokens in
                 the vocabulary, then in the case of,
                 standard RNNT: num_classes = V
-                multiblank RNNT: num_classes = V + number-big-blanks (since we store big-blanks before 
+                multiblank RNNT: num_classes = V + number-big-blanks (since we store big-blanks before
                                  standard blank, and the standard blank is the last symbol in the vocab)
                 TDT: num_classes = V. Note, V here does not include any of the "duration outputs".
 
@@ -413,6 +414,7 @@ def __init__(self, num_classes, reduction: str = 'mean_batch', loss_name: str =
         self.reduction = reduction
         self._loss = resolve_rnnt_loss(loss_name, blank_idx=self._blank, loss_kwargs=loss_kwargs)
         self._force_float32 = RNNT_LOSS_RESOLVER[loss_name].force_float32
+        self._fp16_compat_checked = False
 
     def reduce(self, losses, target_lengths):
 
@@ -442,8 +444,22 @@ def forward(self, log_probs, targets, input_lengths, target_lengths):
         max_targets_len = target_lengths.max()
 
         # Force cast joint to float32
-        # TODO: Remove once Numba supports FP16
-        if self._force_float32 and log_probs.dtype != torch.float32:
+        if not self._force_float32 and numba_utils.NUMBA_FP16_SUPPORTED:
+            # Execute the kernel in fp16
+            pass
+        elif self._force_float32 and log_probs.dtype != torch.float32:
+            # Log just once if fp16 tensor was passed and fp16 Numba CUDA loss could not be used.
+            if log_probs.dtype == torch.float16 and not self._fp16_compat_checked:
+                _, reason = numba_utils.is_numba_cuda_fp16_supported(return_reason=True)
+                logging.warning(
+                    f"Provided RNNT Joint tensor is of dtype {log_probs.dtype}, but RNNT loss could not be calculated "
+                    f"in fp16 due to following reason stated below. Loss will be calculated in fp32. \n\n"
+                    f"{reason}",
+                    mode=logging_mode.ONCE,
+                )
+                self._fp16_compat_checked = True
+
+            # Upcast the activation tensor and compute loss and grads in fp32
             logits_orig = log_probs
             log_probs = log_probs.float()
             del logits_orig  # save memory *before* computing the loss
diff --git a/nemo/collections/asr/losses/rnnt_pytorch.py b/nemo/collections/asr/losses/rnnt_pytorch.py
index bc6e5a25a3b2..c8eee90a2eb5 100644
--- a/nemo/collections/asr/losses/rnnt_pytorch.py
+++ b/nemo/collections/asr/losses/rnnt_pytorch.py
@@ -47,7 +47,12 @@ def __init__(self, blank, reduction):
         self.reduction = reduction
 
     def forward(self, acts, labels, act_lens, label_lens):
+        # CPU patch for FP16
+        if not acts.is_cuda and acts.dtype == torch.float16:
+            acts = acts.float()
+
         acts = torch.log_softmax(acts, -1)
+
         forward_logprob = self.compute_forward_prob(acts, labels, act_lens, label_lens)
         losses = -forward_logprob
         if self.reduction == 'mean_batch':
diff --git a/nemo/collections/asr/parts/numba/rnnt_loss/rnnt.py b/nemo/collections/asr/parts/numba/rnnt_loss/rnnt.py
index 118ee88acbfe..046aea425e20 100644
--- a/nemo/collections/asr/parts/numba/rnnt_loss/rnnt.py
+++ b/nemo/collections/asr/parts/numba/rnnt_loss/rnnt.py
@@ -186,7 +186,7 @@ def rnnt_loss_gpu(
 
     # Select GPU index
     cuda.select_device(acts.device.index)
-    gpu_workspace = torch.zeros(gpu_size, device=acts.device, dtype=acts.dtype, requires_grad=False)
+    gpu_workspace = torch.zeros(gpu_size, device=acts.device, dtype=torch.float32, requires_grad=False)
 
     ### VIEW TENSORS AS VECTORS FOR POINTER INDEXING ###
     acts, acts_shape = rnnt_helper.flatten_tensor(acts)
diff --git a/nemo/collections/asr/parts/numba/rnnt_loss/rnnt_numpy.py b/nemo/collections/asr/parts/numba/rnnt_loss/rnnt_numpy.py
index eaa6d332a0fc..58508970aa83 100644
--- a/nemo/collections/asr/parts/numba/rnnt_loss/rnnt_numpy.py
+++ b/nemo/collections/asr/parts/numba/rnnt_loss/rnnt_numpy.py
@@ -344,10 +344,15 @@ def forward(self, acts, labels, act_lens, label_lens):
         _assert_no_grad(label_lens)
         certify_inputs(acts, labels, act_lens, label_lens)
 
+        # CPU Patch for fp16 - force cast to fp32
+        if not acts.is_cuda and acts.dtype == torch.float16:
+            acts = acts.float()
+
         if self.clamp > 0.0:
             acts = LogSoftmaxGradModification.apply(acts, self.clamp)
 
         acts = torch.nn.functional.log_softmax(acts, -1)
+
         return self.rnnt(acts, labels, act_lens, label_lens, self.blank, self.fastemit_lambda)
 
 
diff --git a/nemo/collections/asr/parts/numba/rnnt_loss/rnnt_pytorch.py b/nemo/collections/asr/parts/numba/rnnt_loss/rnnt_pytorch.py
index 2ffe08be361e..5960d5ab6b18 100644
--- a/nemo/collections/asr/parts/numba/rnnt_loss/rnnt_pytorch.py
+++ b/nemo/collections/asr/parts/numba/rnnt_loss/rnnt_pytorch.py
@@ -57,7 +57,7 @@ def forward(ctx, acts, labels, act_lens, label_lens, blank, reduction, fastemit_
         loss_func = rnnt.rnnt_loss_gpu if is_cuda else rnnt.rnnt_loss_cpu
         grads = torch.zeros_like(acts) if acts.requires_grad else None
         minibatch_size = acts.size(0)
-        costs = torch.zeros(minibatch_size, device=acts.device, dtype=acts.dtype)
+        costs = torch.zeros(minibatch_size, device=acts.device, dtype=torch.float32)
 
         loss_func(
             acts,
@@ -119,7 +119,6 @@ def forward(
         label_lens: Tensor of (batch) containing label length of each example
         fastemit_lambda: Float scaling factor for FastEmit regularization. Refer to
             FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
-
         durations: list of durations for TDT model, must include 0 and 1, e.g.
             [0, 1, 2, 3, 4].
         sigma: hyper-parameter for logit under-normalization method for training
@@ -417,6 +416,10 @@ def forward(self, acts, labels, act_lens, label_lens):
         label_lens: Tensor of (batch) containing label length of each example
         """
         if not acts.is_cuda:
+            # Force FP32 until log_softmax() is implemented for fp16 on CPU
+            if acts.dtype == torch.float16:
+                acts = acts.float()
+
             # Since CPU requires log_softmax to be computed explicitly, we need to perform grad clipping
             # *after* we have obtained the gradients of loss(logsoftmax()).
             # This is highly wasteful since it requires a copy of the entire joint tensor which is expensive.
diff --git a/nemo/collections/asr/parts/numba/rnnt_loss/utils/cpu_utils/cpu_rnnt.py b/nemo/collections/asr/parts/numba/rnnt_loss/utils/cpu_utils/cpu_rnnt.py
index 1528606716e1..3feb7b513a50 100644
--- a/nemo/collections/asr/parts/numba/rnnt_loss/utils/cpu_utils/cpu_rnnt.py
+++ b/nemo/collections/asr/parts/numba/rnnt_loss/utils/cpu_utils/cpu_rnnt.py
@@ -231,8 +231,8 @@ def cost_and_grad_kernel(
         )
 
         # Scale llForward by FastEmit lambda
-        llForward *= 1.0 + self.fastemit_lambda_
-        llBackward *= 1.0 + self.fastemit_lambda_
+        llForward += llForward * self.fastemit_lambda_
+        llBackward += llBackward * self.fastemit_lambda_
 
         diff = (llForward - llBackward).abs()
         if diff > 0.1:
@@ -300,6 +300,10 @@ def compute_betas_and_grads(
         Returns:
             Loglikelihood of the forward variable and inplace updates the grad tensor.
         """
+        # Patch for CPU + fp16
+        if log_probs.dtype == torch.float16 and not log_probs.is_cuda:
+            log_probs = log_probs.float()
+
         idx = CpuRNNT_index(U, self.maxU_, self.minibatch_, self.alphabet_size_, self.batch_first)
         betas[idx(T - 1, U - 1)] = log_probs[idx(T - 1, U - 1) * 2]
 
diff --git a/nemo/collections/asr/parts/numba/rnnt_loss/utils/rnnt_helper.py b/nemo/collections/asr/parts/numba/rnnt_loss/utils/rnnt_helper.py
index b579b7315ef2..6ca7cd237264 100644
--- a/nemo/collections/asr/parts/numba/rnnt_loss/utils/rnnt_helper.py
+++ b/nemo/collections/asr/parts/numba/rnnt_loss/utils/rnnt_helper.py
@@ -30,6 +30,7 @@
 import math
 from typing import Optional, Tuple
 
+import numba
 import torch
 from numba import cuda
 
@@ -112,7 +113,7 @@ def compute_costs_data(source: torch.Tensor, dest: torch.Tensor, fastemit_lambda
     if idx < length:
         copy_data_1d(source, dest, idx)
         dest[idx] *= -1.0
-        dest[idx] *= 1.0 + fastemit_lambda
+        dest[idx] *= numba.float32(1.0 + fastemit_lambda)
 
 
 def get_workspace_size(
diff --git a/nemo/core/utils/numba_utils.py b/nemo/core/utils/numba_utils.py
index 6e1a8cb247d6..04010a2f7db4 100644
--- a/nemo/core/utils/numba_utils.py
+++ b/nemo/core/utils/numba_utils.py
@@ -17,6 +17,8 @@
 import operator
 import os
 
+from typing import Tuple, Union
+
 from nemo.utils import model_utils
 
 # Prevent Numba CUDA logs from showing at info level
@@ -26,6 +28,11 @@
 __NUMBA_DEFAULT_MINIMUM_VERSION__ = "0.53.0"
 __NUMBA_MINIMUM_VERSION__ = os.environ.get("NEMO_NUMBA_MINVER", __NUMBA_DEFAULT_MINIMUM_VERSION__)
 
+__NUMBA_MINIMUM_VERSION_FP16_SUPPORTED__ = "0.57.0"
+NUMBA_FP16_SUPPORTED = model_utils.check_lib_version(
+    'numba', __NUMBA_MINIMUM_VERSION_FP16_SUPPORTED__, operator=operator.ge
+)[0]
+
 
 NUMBA_INSTALLATION_MESSAGE = (
     "Could not import `numba`.\n"
@@ -148,6 +155,35 @@ def numba_cuda_is_supported(min_version: str) -> bool:
         return False
 
 
+def is_numba_cuda_fp16_supported(return_reason: bool = False) -> Union[bool, Tuple[bool, str]]:
+    """
+    Utility method that returns a bool, stating if FP16 is supported for numba cuda kernels or not.
+
+    Returns:
+        bool, whether Numba CUDA will support fp16 or not.
+    """
+    reason = ""
+    use_nvidia_binding = os.environ.get('NUMBA_CUDA_USE_NVIDIA_BINDING', None)
+    if use_nvidia_binding is not None:
+        use_nvidia_binding = use_nvidia_binding.lower() == "1"
+        reason += "Env variable `NUMBA_CUDA_USE_NVIDIA_BINDING` is available and set to `1`. "
+    else:
+        use_nvidia_binding = False
+        reason += "Env variable `NUMBA_CUDA_USE_NVIDIA_BINDING` is not available or has not set to `1`."
+
+    if NUMBA_FP16_SUPPORTED:
+        reason += f"Numba CUDA FP16 is supported in installed numba version."
+    else:
+        reason += f"Numba CUDA FP16 is not supported in installed numba version."
+
+    result = use_nvidia_binding and NUMBA_FP16_SUPPORTED
+
+    if return_reason:
+        return result, reason
+    else:
+        return result
+
+
 def skip_numba_cuda_test_if_unsupported(min_version: str):
     """
     Helper method to skip pytest test case if numba cuda is not supported.
diff --git a/tests/collections/asr/numba/rnnt_loss/test_rnnt_pytorch.py b/tests/collections/asr/numba/rnnt_loss/test_rnnt_pytorch.py
index 3fbfcf6df54b..1a29a14f540d 100644
--- a/tests/collections/asr/numba/rnnt_loss/test_rnnt_pytorch.py
+++ b/tests/collections/asr/numba/rnnt_loss/test_rnnt_pytorch.py
@@ -34,9 +34,14 @@
     DEVICES.append('cuda')
 
 
+DTYPES = [np.float32]
+if numba_utils.is_numba_cuda_fp16_supported():
+    DTYPES.append(np.float16)
+
+
 def wrap_and_call(fn, acts, labels, device):
     if not torch.is_tensor(acts):
-        acts = torch.FloatTensor(acts)
+        acts = torch.tensor(acts)
 
     if 'cuda' in device:
         acts = acts.cuda()
@@ -72,7 +77,8 @@ def wrap_and_call(fn, acts, labels, device):
 class TestRNNTLossPytorch:
     @pytest.mark.unit
     @pytest.mark.parametrize('device', DEVICES)
-    def test_case_small(self, device):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_case_small(self, device, dtype):
         if device == 'cuda':
             numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
@@ -83,9 +89,13 @@ def test_case_small(self, device):
                     [[0.1, 0.6, 0.1, 0.1, 0.1], [0.1, 0.1, 0.2, 0.1, 0.1], [0.7, 0.1, 0.2, 0.1, 0.1]],
                 ]
             ]
-        )
+        ).astype(dtype)
         labels = [[1, 2]]
 
+        cost_threshold = 1e-8 if dtype == np.float32 else 5e-4
+        grad_threshold = 1e-8 if dtype == np.float32 else 1e-4
+        rtol = 1e-5 if dtype == np.float32 else 1e-3
+
         fn_pt = RNNTLossNumba(blank=0, reduction='sum')
         pt_cost, pt_grads = wrap_and_call(fn_pt, acts, labels, device)
 
@@ -113,23 +123,28 @@ def test_case_small(self, device):
             ]
         )
 
-        assert np.allclose(pt_cost, expected_cost, rtol=1e-6), "small_test costs mismatch."
-        assert np.allclose(pt_grads, expected_grads), "small_test gradient mismatch."
+        assert np.allclose(pt_cost, expected_cost, atol=cost_threshold, rtol=1e-6), "small_test costs mismatch."
+        assert np.allclose(pt_grads, expected_grads, atol=grad_threshold, rtol=rtol), "small_test gradient mismatch."
 
-        assert np.allclose(pt_cost, np_cost, rtol=1e-6), "small_test costs mismatch."
-        assert np.allclose(pt_grads, np_grads), "small_test gradient mismatch."
+        assert np.allclose(pt_cost, np_cost, atol=cost_threshold, rtol=rtol), "small_test costs mismatch."
+        assert np.allclose(pt_grads, np_grads, atol=grad_threshold, rtol=rtol), "small_test gradient mismatch."
 
-        assert np.allclose(ag_cost, np_cost, rtol=1e-6), "small_test costs mismatch."
-        assert np.allclose(ag_grads, np_grads), "small_test gradient mismatch."
+        assert np.allclose(ag_cost, np_cost, atol=cost_threshold, rtol=rtol), "small_test costs mismatch."
+        assert np.allclose(ag_grads, np_grads, atol=cost_threshold, rtol=rtol), "small_test gradient mismatch."
 
     @pytest.mark.unit
     @pytest.mark.parametrize('device', DEVICES)
-    def test_case_small_random(self, device):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_case_small_random(self, device, dtype):
         if device == 'cuda':
             numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
+        cost_threshold = 1e-8 if dtype == np.float32 else 5e-4
+        grad_threshold = 1e-8 if dtype == np.float32 else 1e-4
+        rtol = 1e-5 if dtype == np.float32 else 1e-3
+
         rng = np.random.RandomState(0)
-        acts = rng.randn(1, 4, 3, 3)
+        acts = rng.randn(1, 4, 3, 3).astype(dtype)
         labels = [[1, 2]]
 
         fn_pt = RNNTLossNumba(blank=0, reduction='sum')
@@ -141,16 +156,17 @@ def test_case_small_random(self, device):
         fn_ag = RNNTLossPytorch(blank=0, reduction='sum')  # ag for automatic gradient computation
         ag_cost, ag_grads = wrap_and_call(fn_ag, acts, labels, device)
 
-        assert np.allclose(pt_cost, np_cost, rtol=1e-6), "small_random_test costs mismatch."
-        assert np.allclose(pt_grads, np_grads), "small_random_test gradient mismatch."
+        assert np.allclose(pt_cost, np_cost, atol=cost_threshold, rtol=rtol), "small_random_test costs mismatch."
+        assert np.allclose(pt_grads, np_grads, atol=grad_threshold, rtol=rtol), "small_random_test gradient mismatch."
 
-        assert np.allclose(pt_cost, ag_cost, rtol=1e-6), "small_random_test costs mismatch."
-        assert np.allclose(pt_grads, ag_grads), "small_random_test gradient mismatch."
+        assert np.allclose(pt_cost, ag_cost, atol=cost_threshold, rtol=rtol), "small_random_test costs mismatch."
+        assert np.allclose(pt_grads, ag_grads, atol=grad_threshold, rtol=rtol), "small_random_test gradient mismatch."
 
     @pytest.mark.unit
     @pytest.mark.parametrize('device', DEVICES)
+    @pytest.mark.parametrize('dtype', DTYPES)
     @pytest.mark.parametrize('fastemit_lambda', [1.0, 0.01, 0.00001])
-    def test_case_small_random_fastemit_reg(self, device, fastemit_lambda):
+    def test_case_small_random_fastemit_reg(self, device, dtype, fastemit_lambda):
         if device == 'cuda':
             numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
@@ -165,11 +181,12 @@ def test_case_small_random_fastemit_reg(self, device, fastemit_lambda):
         np_cost, np_grads = wrap_and_call(fn_np, acts, labels, device)
 
         assert np.allclose(pt_cost, np_cost, rtol=1e-6), "small_random_test costs mismatch."
-        assert np.allclose(pt_grads, np_grads, atol=1e-5, rtol=1e-5), "small_random_test gradient mismatch."
+        assert np.allclose(pt_grads, np_grads, rtol=1e-5), "small_random_test gradient mismatch."
 
     @pytest.mark.unit
     @pytest.mark.parametrize('device', DEVICES)
-    def test_case_big_tensor(self, device):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_case_big_tensor(self, device, dtype):
         if device == 'cuda':
             numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
@@ -269,9 +286,13 @@ def test_case_big_tensor(self, device):
             ],
         ]
 
-        activations = np.array(activations)
+        activations = np.array(activations).astype(dtype)
         labels = [[1, 2], [1, 1]]
 
+        cost_threshold = 1e-8 if dtype == np.float32 else 5e-4
+        grad_threshold = 1e-8 if dtype == np.float32 else 1e-4
+        rtol = 1e-3 if dtype == np.float32 else 0.1
+
         fn_pt = RNNTLossNumba(blank=0, reduction='sum')
         pt_costs, pt_grads = wrap_and_call(fn_pt, activations, labels, device)
 
@@ -281,23 +302,30 @@ def test_case_big_tensor(self, device):
         fn_ag = RNNTLossPytorch(blank=0, reduction='sum')
         ag_costs, ag_grads = wrap_and_call(fn_ag, activations, labels, device)
 
-        assert np.allclose(pt_costs, sum(expected_costs)), "big_test average costs mismatch."
-        assert np.allclose(pt_grads, expected_grads, rtol=1e-3), "big_test grads for average cost mismatch."
+        assert np.allclose(pt_costs, sum(expected_costs), atol=cost_threshold), "big_test average costs mismatch."
+        assert np.allclose(
+            pt_grads, expected_grads, atol=grad_threshold, rtol=1e-3
+        ), "big_test grads for average cost mismatch."
 
-        assert np.allclose(pt_costs, np_costs), "big_test average costs mismatch."
-        assert np.allclose(pt_grads, np_grads, rtol=1e-3), "big_test grads for average cost mismatch."
+        assert np.allclose(pt_costs, np_costs, atol=cost_threshold, rtol=rtol), "big_test average costs mismatch."
+        assert np.allclose(
+            pt_grads, np_grads, atol=grad_threshold, rtol=rtol
+        ), "big_test grads for average cost mismatch."
 
-        assert np.allclose(pt_costs, ag_costs), "big_test average costs mismatch."
-        assert np.allclose(pt_grads, ag_grads, rtol=1e-3), "big_test grads for average cost mismatch."
+        assert np.allclose(pt_costs, ag_costs, atol=cost_threshold, rtol=rtol), "big_test average costs mismatch."
+        assert np.allclose(
+            pt_grads, ag_grads, atol=grad_threshold, rtol=rtol
+        ), "big_test grads for average cost mismatch."
 
     @pytest.mark.unit
     @pytest.mark.parametrize('device', DEVICES)
-    def test_case_large_random(self, device):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_case_large_random(self, device, dtype):
         if device == 'cuda':
             numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
         rng = np.random.RandomState(0)
-        acts = rng.randn(4, 8, 11, 5)
+        acts = rng.randn(4, 8, 11, 5).astype(dtype)
         labels = [
             [1, 2, 4, 3, 2, 2, 1, 1, 1, 1],
             [3, 2, 2, 3, 4, 1, 1, 1, 1, 1],
@@ -305,6 +333,10 @@ def test_case_large_random(self, device):
             [1, 1, 2, 1, 2, 3, 3, 1, 1, 1],
         ]
 
+        cost_threshold = 1e-8 if dtype == np.float32 else 5e-4
+        grad_threshold = 1e-8 if dtype == np.float32 else 1e-4
+        rtol = 1e-3 if dtype == np.float32 else 5e-2
+
         fn_pt = RNNTLossNumba(blank=0, reduction='sum')
         pt_cost, pt_grads = wrap_and_call(fn_pt, acts, labels, device)
 
@@ -314,14 +346,15 @@ def test_case_large_random(self, device):
         fn_ag = RNNTLossPytorch(blank=0, reduction='sum')
         ag_cost, ag_grads = wrap_and_call(fn_ag, acts, labels, device)
 
-        assert np.allclose(pt_cost, np_cost, atol=1e-5, rtol=1e-3), "large_random_test costs mismatch."
-        assert np.allclose(ag_cost, np_cost, atol=1e-5, rtol=1e-3), "large_random_test costs mismatch."
-        assert np.allclose(pt_grads, np_grads, atol=1e-5, rtol=1e-3), "large_random_test gradient mismatch."
-        assert np.allclose(ag_grads, np_grads, atol=1e-5, rtol=1e-3), "large_random_test gradient mismatch."
+        assert np.allclose(pt_cost, np_cost, atol=cost_threshold, rtol=rtol), "large_random_test costs mismatch."
+        assert np.allclose(ag_cost, np_cost, atol=cost_threshold, rtol=rtol), "large_random_test costs mismatch."
+        assert np.allclose(pt_grads, np_grads, atol=grad_threshold, rtol=rtol), "large_random_test gradient mismatch."
+        assert np.allclose(ag_grads, np_grads, atol=grad_threshold, rtol=rtol), "large_random_test gradient mismatch."
 
     @pytest.mark.unit
     @pytest.mark.parametrize('device', DEVICES)
-    def test_case_small_clamp(self, device):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_case_small_clamp(self, device, dtype):
         if device == 'cuda':
             numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
@@ -333,9 +366,13 @@ def test_case_small_clamp(self, device):
                     [[0.1, 0.6, 0.1, 0.1, 0.1], [0.1, 0.1, 0.2, 0.1, 0.1], [0.7, 0.1, 0.2, 0.1, 0.1]],
                 ]
             ]
-        )
+        ).astype(dtype)
         labels = [[1, 2]]
 
+        cost_threshold = 1e-8 if dtype == np.float32 else 5e-4
+        grad_threshold = 1e-8 if dtype == np.float32 else 5e-5
+        rtol = 1e-5 if dtype == np.float32 else 1e-3
+
         fn_pt = RNNTLossNumba(blank=0, reduction='sum', clamp=GRAD_CLAMP)
         pt_cost, pt_grads = wrap_and_call(fn_pt, acts, labels, device)
 
@@ -360,16 +397,17 @@ def test_case_small_clamp(self, device):
             ]
         )
 
-        assert np.allclose(pt_cost, expected_cost, rtol=1e-6), "small_test costs mismatch."
-        assert np.allclose(pt_grads, expected_grads), "small_test gradient mismatch."
+        assert np.allclose(pt_cost, expected_cost, atol=cost_threshold, rtol=rtol), "small_test costs mismatch."
+        assert np.allclose(pt_grads, expected_grads, atol=grad_threshold, rtol=rtol), "small_test gradient mismatch."
 
-        assert np.allclose(pt_cost, np_cost, rtol=1e-6), "small_test costs mismatch."
-        assert np.allclose(pt_grads, np_grads), "small_test gradient mismatch."
+        assert np.allclose(pt_cost, np_cost, atol=cost_threshold, rtol=rtol), "small_test costs mismatch."
+        assert np.allclose(pt_grads, np_grads, atol=grad_threshold, rtol=rtol), "small_test gradient mismatch."
 
     @pytest.mark.unit
     @pytest.mark.parametrize('device', DEVICES)
+    @pytest.mark.parametrize('dtype', DTYPES)
     @pytest.mark.parametrize('fastemit_lambda', [1.0, 0.01, 0.00001])
-    def test_case_small_fastemit_clamp(self, device, fastemit_lambda):
+    def test_case_small_fastemit_clamp(self, device, dtype, fastemit_lambda):
         if device == 'cuda':
             numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
@@ -381,9 +419,13 @@ def test_case_small_fastemit_clamp(self, device, fastemit_lambda):
                     [[0.1, 0.6, 0.1, 0.1, 0.1], [0.1, 0.1, 0.2, 0.1, 0.1], [0.7, 0.1, 0.2, 0.1, 0.1]],
                 ]
             ]
-        )
+        ).astype(dtype)
         labels = [[1, 2]]
 
+        cost_threshold = 1e-8 if dtype == np.float32 else 1e-3
+        grad_threshold = 1e-8 if dtype == np.float32 else 5e-4
+        rtol = 1e-5 if dtype == np.float32 else 1e-3
+
         fn_pt = RNNTLossNumba(blank=0, reduction='sum', fastemit_lambda=fastemit_lambda, clamp=GRAD_CLAMP)
         pt_cost, pt_grads = wrap_and_call(fn_pt, acts, labels, device)
 
@@ -393,9 +435,9 @@ def test_case_small_fastemit_clamp(self, device, fastemit_lambda):
         expected_cost = 4.495666
         expected_cost += expected_cost * fastemit_lambda
 
-        assert np.allclose(pt_cost, expected_cost, rtol=1e-6), "small_test costs mismatch."
-        assert np.allclose(pt_cost, np_cost, rtol=1e-6), "small_test costs mismatch."
-        assert np.allclose(pt_grads, np_grads), "small_test gradient mismatch."
+        assert np.allclose(pt_cost, expected_cost, atol=cost_threshold, rtol=rtol), "small_test costs mismatch."
+        assert np.allclose(pt_cost, np_cost, atol=cost_threshold, rtol=rtol), "small_test costs mismatch."
+        assert np.allclose(pt_grads, np_grads, atol=grad_threshold, rtol=rtol), "small_test gradient mismatch."
 
     @pytest.mark.unit
     @pytest.mark.parametrize('device', DEVICES)
diff --git a/tests/collections/asr/numba/rnnt_loss/utils/test_gpu_rnnt_kernel.py b/tests/collections/asr/numba/rnnt_loss/utils/test_gpu_rnnt_kernel.py
index 230b6b7c099f..cb5a9816e237 100644
--- a/tests/collections/asr/numba/rnnt_loss/utils/test_gpu_rnnt_kernel.py
+++ b/tests/collections/asr/numba/rnnt_loss/utils/test_gpu_rnnt_kernel.py
@@ -25,8 +25,14 @@
 from nemo.core.utils.numba_utils import __NUMBA_MINIMUM_VERSION__
 
 
+DTYPES = [torch.float32]
+if numba_utils.is_numba_cuda_fp16_supported():
+    DTYPES.append(torch.float16)
+
+
 def log_softmax(x, axis=-1):
     x = torch.from_numpy(x)  # zero-copy
+    x = x.float()
     x = torch.log_softmax(x, dim=axis)
     x = x.numpy()
     return x
@@ -42,12 +48,14 @@ def log_softmax_grad(x, axis=-1):
 class TestRNNTCUDAKernels:
     @pytest.mark.skipif(not cuda.is_available(), reason="CUDA Reductions can only be run when CUDA is available")
     @pytest.mark.unit
-    def test_compute_alphas_kernel(self):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_compute_alphas_kernel(self, dtype):
         numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
         random = np.random.RandomState(0)
         original_shape = [1, 5, 11, 3]
         B, T, U, V = original_shape
+        threshold = 1e-5 if dtype == torch.float32 else 3e-4
 
         # Numpy kernel
         x = random.randn(*original_shape)
@@ -67,7 +75,7 @@ def test_compute_alphas_kernel(self):
         else:
             stream = cuda.default_stream()
 
-        x_c = torch.tensor(x, device=device, dtype=torch.float32)
+        x_c = torch.tensor(x, device=device, dtype=dtype)
         labels_c = torch.tensor(labels, device=device, dtype=torch.int64)
 
         # Allocate workspace memory
@@ -100,22 +108,24 @@ def test_compute_alphas_kernel(self):
         alphas = alphas.view([B, T, U])
         diff = ground_alphas - alphas[0].cpu().numpy()
 
-        assert np.abs(diff).mean() <= 1e-5
-        assert np.square(diff).mean() <= 1e-10
+        assert np.abs(diff).mean() <= threshold
+        assert np.square(diff).mean() <= (threshold ** 2)
 
         ll_diff = ground_log_likelihood - llForward[0].cpu().numpy()
 
-        assert np.abs(ll_diff).mean() <= 1e-5
-        assert np.square(ll_diff).mean() <= 1e-10
+        assert np.abs(ll_diff).mean() <= threshold
+        assert np.square(ll_diff).mean() <= (threshold ** 2)
 
     @pytest.mark.skipif(not cuda.is_available(), reason="CUDA Reductions can only be run when CUDA is available")
     @pytest.mark.unit
-    def test_compute_betas_kernel(self):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_compute_betas_kernel(self, dtype):
         numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
         random = np.random.RandomState(0)
         original_shape = [1, 5, 11, 3]
         B, T, U, V = original_shape
+        threshold = 1e-5 if dtype == torch.float32 else 3e-4
 
         # Numpy kernel
         x = random.randn(*original_shape)
@@ -135,7 +145,7 @@ def test_compute_betas_kernel(self):
         else:
             stream = cuda.default_stream()
 
-        x_c = torch.tensor(x, device=device, dtype=torch.float32)
+        x_c = torch.tensor(x, device=device, dtype=dtype)
         labels_c = torch.tensor(labels, device=device, dtype=torch.int64)
 
         # Allocate workspace memory
@@ -168,17 +178,18 @@ def test_compute_betas_kernel(self):
         betas = betas.view([B, T, U])
         diff = ground_alphas - betas[0].cpu().numpy()
 
-        assert np.abs(diff).mean() <= 1e-5
-        assert np.square(diff).mean() <= 1e-10
+        assert np.abs(diff).mean() <= threshold
+        assert np.square(diff).mean() <= (threshold ** 2)
 
         ll_diff = ground_log_likelihood - llBackward[0].cpu().numpy()
 
-        assert np.abs(ll_diff).mean() <= 1e-5
-        assert np.square(ll_diff).mean() <= 1e-10
+        assert np.abs(ll_diff).mean() <= threshold
+        assert np.square(ll_diff).mean() <= (threshold ** 2)
 
     @pytest.mark.skipif(not cuda.is_available(), reason="CUDA Reductions can only be run when CUDA is available")
     @pytest.mark.unit
-    def test_compute_grads_kernel(self):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_compute_grads_kernel(self, dtype):
         numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
         fastemit_lambda = 0.0
@@ -187,6 +198,7 @@ def test_compute_grads_kernel(self):
         random = np.random.RandomState(0)
         original_shape = [1, 5, 11, 3]
         B, T, U, V = original_shape
+        threshold = 1e-5 if dtype == torch.float32 else 3e-5
 
         # Numpy kernel
         x = random.randn(*original_shape)
@@ -220,7 +232,7 @@ def test_compute_grads_kernel(self):
         else:
             stream = cuda.default_stream()
 
-        x_c = torch.tensor(x, device=device, dtype=torch.float32)
+        x_c = torch.tensor(x, device=device, dtype=dtype)
         labels_c = labels.clone().to(device=device, dtype=torch.int64)
 
         # Allocate workspace memory
@@ -283,12 +295,13 @@ def test_compute_grads_kernel(self):
         grads = grads.view([B, T, U, V])
         diff = true_grads - grads[0].cpu().numpy()
 
-        assert np.abs(diff).mean() <= 1e-5
-        assert np.square(diff).mean() <= 1e-10
+        assert np.abs(diff).mean() <= threshold
+        assert np.square(diff).mean() <= (threshold ** 2) * 5.0
 
     @pytest.mark.skipif(not cuda.is_available(), reason="CUDA Reductions can only be run when CUDA is available")
     @pytest.mark.unit
-    def test_compute_grads_kernel_fastemit(self):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_compute_grads_kernel_fastemit(self, dtype):
         numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
         fastemit_lambda = 0.001
@@ -297,6 +310,7 @@ def test_compute_grads_kernel_fastemit(self):
         random = np.random.RandomState(0)
         original_shape = [1, 5, 11, 3]
         B, T, U, V = original_shape
+        threshold = 1e-5 if dtype == torch.float32 else 3e-5
 
         # Numpy kernel
         x = random.randn(*original_shape)
@@ -330,7 +344,7 @@ def test_compute_grads_kernel_fastemit(self):
         else:
             stream = cuda.default_stream()
 
-        x_c = torch.tensor(x, device=device, dtype=torch.float32)
+        x_c = torch.tensor(x, device=device, dtype=dtype)
         labels_c = labels.clone().to(device=device, dtype=torch.int64)
 
         # Allocate workspace memory
@@ -393,12 +407,13 @@ def test_compute_grads_kernel_fastemit(self):
         grads = grads.view([B, T, U, V])
         diff = true_grads - grads[0].cpu().numpy()
 
-        assert np.abs(diff).mean() <= 1e-5
-        assert np.square(diff).mean() <= 1e-10
+        assert np.abs(diff).mean() <= threshold
+        assert np.square(diff).mean() <= (threshold ** 2) * 5
 
     @pytest.mark.skipif(not cuda.is_available(), reason="CUDA Reductions can only be run when CUDA is available")
     @pytest.mark.unit
-    def test_compute_grads_kernel_clamp(self):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_compute_grads_kernel_clamp(self, dtype):
         numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
         fastemit_lambda = 0.0
@@ -407,6 +422,7 @@ def test_compute_grads_kernel_clamp(self):
         random = np.random.RandomState(0)
         original_shape = [1, 5, 11, 3]
         B, T, U, V = original_shape
+        threshold = 1e-5 if dtype == torch.float32 else 3e-5
 
         # Numpy kernel
         x = random.randn(*original_shape)
@@ -440,7 +456,7 @@ def test_compute_grads_kernel_clamp(self):
         else:
             stream = cuda.default_stream()
 
-        x_c = torch.tensor(x, device=device, dtype=torch.float32)
+        x_c = torch.tensor(x, device=device, dtype=dtype)
         labels_c = labels.clone().to(device=device, dtype=torch.int64)
 
         # Allocate workspace memory
@@ -503,8 +519,8 @@ def test_compute_grads_kernel_clamp(self):
         grads = grads.view([B, T, U, V])
         diff = true_grads - grads[0].cpu().numpy()
 
-        assert np.abs(diff).mean() <= 1e-5
-        assert np.square(diff).mean() <= 1e-10
+        assert np.abs(diff).mean() <= threshold
+        assert np.square(diff).mean() <= (threshold ** 2) * 5
 
 
 class TestTDTCUDAKernels:
diff --git a/tests/collections/asr/numba/rnnt_loss/utils/test_reduce.py b/tests/collections/asr/numba/rnnt_loss/utils/test_reduce.py
index 7c2ba6a41208..5994d53e1d8f 100644
--- a/tests/collections/asr/numba/rnnt_loss/utils/test_reduce.py
+++ b/tests/collections/asr/numba/rnnt_loss/utils/test_reduce.py
@@ -20,17 +20,22 @@
 from nemo.core.utils import numba_utils
 from nemo.core.utils.numba_utils import __NUMBA_MINIMUM_VERSION__
 
+DTYPES = [np.float32]
+if numba_utils.is_numba_cuda_fp16_supported():
+    DTYPES.append(np.float16)
+
 
 class TestRNNTCUDAReductions:
     @pytest.mark.skipif(not cuda.is_available(), reason="CUDA Reductions can only be run when CUDA is available")
     @pytest.mark.unit
-    def test_reduce_max(self):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_reduce_max(self, dtype):
         numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
         random = np.random.RandomState(0)
         original_shape = [1, 5, 4, 3]
-        x = random.randn(*original_shape).reshape([-1])
-        dx = random.randn(*x.shape)
+        x = random.randn(*original_shape).reshape([-1]).astype(dtype)
+        dx = random.randn(*x.shape).astype(dtype)
 
         stream = cuda.stream()
         x_c = cuda.to_device(x, stream=stream)
@@ -53,13 +58,14 @@ def test_reduce_max(self):
 
     @pytest.mark.skipif(not cuda.is_available(), reason="CUDA Reductions can only be run when CUDA is available")
     @pytest.mark.unit
-    def test_reduce_exp(self):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_reduce_exp(self, dtype):
         numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
         random = np.random.RandomState(0)
         original_shape = [1, 5, 4, 2]
-        x = random.randn(*original_shape).reshape([-1])
-        dx = np.zeros_like(x)
+        x = random.randn(*original_shape).reshape([-1]).astype(dtype)
+        dx = np.zeros_like(x).astype(dtype)
 
         stream = cuda.stream()
         x_c = cuda.to_device(x, stream=stream)
diff --git a/tests/collections/asr/numba/rnnt_loss/utils/test_rnnt_helper.py b/tests/collections/asr/numba/rnnt_loss/utils/test_rnnt_helper.py
index 243fe727e172..08f12da8324d 100644
--- a/tests/collections/asr/numba/rnnt_loss/utils/test_rnnt_helper.py
+++ b/tests/collections/asr/numba/rnnt_loss/utils/test_rnnt_helper.py
@@ -20,11 +20,16 @@
 from nemo.core.utils import numba_utils
 from nemo.core.utils.numba_utils import __NUMBA_MINIMUM_VERSION__
 
+DTYPES = [np.float32]
+if numba_utils.is_numba_cuda_fp16_supported():
+    DTYPES.append(np.float16)
+
 
 class TestRNNTHelper:
     @pytest.mark.skipif(not cuda.is_available(), reason="CUDA Helpers can only be run when CUDA is available")
     @pytest.mark.unit
-    def test_log_sum_exp(self):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_log_sum_exp(self, dtype):
         numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
         # wrapper kernel for device function that is tested
@@ -34,8 +39,9 @@ def _kernel(x, y):
             if x_pos < x.shape[0] and x_pos < y.shape[0]:
                 x[x_pos] = rnnt_helper.log_sum_exp(x[x_pos], y[x_pos])
 
-        x = np.zeros([8])  # np.random.rand(8192)
-        y = np.ones([8])  # np.random.rand(8192)
+        x = np.zeros([8]).astype(dtype)  # np.random.rand(8192)
+        y = np.ones([8]).astype(dtype)  # np.random.rand(8192)
+        threshold = 1e-5 if dtype == np.float32 else 2e-3
 
         stream = cuda.stream()
         x_c = cuda.to_device(x, stream=stream)
@@ -52,11 +58,12 @@ def _kernel(x, y):
         x_new = x_c.copy_to_host(stream=stream)
         del x_c, y_c
 
-        assert (x_new.sum() - 10.506093500145782) <= 1e-5
+        assert (x_new.sum() - 10.506093500145782) <= threshold
 
     @pytest.mark.skipif(not cuda.is_available(), reason="CUDA Helpers can only be run when CUDA is available")
     @pytest.mark.unit
-    def test_log_sum_exp_neg_inf(self):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_log_sum_exp_neg_inf(self, dtype):
         numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
         # wrapper kernel for device function that is tested
@@ -66,8 +73,8 @@ def _kernel(x, y):
             if x_pos < x.shape[0] and x_pos < y.shape[0]:
                 x[x_pos] = rnnt_helper.log_sum_exp(x[x_pos], y[x_pos])
 
-        x = np.asarray([global_constants.FP32_NEG_INF] * 8)
-        y = np.ones([len(x)])
+        x = np.asarray([global_constants.FP32_NEG_INF] * 8).astype(dtype)
+        y = np.ones([len(x)]).astype(dtype)
 
         stream = cuda.stream()
         x_c = cuda.to_device(x, stream=stream)
@@ -88,7 +95,8 @@ def _kernel(x, y):
 
     @pytest.mark.skipif(not cuda.is_available(), reason="CUDA Helpers can only be run when CUDA is available")
     @pytest.mark.unit
-    def test_div_up(self):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_div_up(self, dtype):
         numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
         # wrapper kernel for device function that is tested
@@ -98,8 +106,8 @@ def _kernel(x, y):
             if x_pos < x.shape[0] and x_pos < y.shape[0]:
                 x[x_pos] = rnnt_helper.div_up(x[x_pos], y[x_pos])
 
-        x = np.full([8], fill_value=10)  # np.random.rand(8192)
-        y = np.full([8], fill_value=2)  # np.random.rand(8192)
+        x = np.full([8], fill_value=10).astype(dtype)  # np.random.rand(8192)
+        y = np.full([8], fill_value=2).astype(dtype)  # np.random.rand(8192)
 
         stream = cuda.stream()
         x_c = cuda.to_device(x, stream=stream)
@@ -121,7 +129,8 @@ def _kernel(x, y):
 
     @pytest.mark.skipif(not cuda.is_available(), reason="CUDA Helpers can only be run when CUDA is available")
     @pytest.mark.unit
-    def test_add(self):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_add(self, dtype):
         numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
         # wrapper kernel for device function that is tested
@@ -131,8 +140,8 @@ def _kernel(x, y):
             if x_pos < x.shape[0] and x_pos < y.shape[0]:
                 x[x_pos] = rnnt_helper.add(x[x_pos], y[x_pos])
 
-        x = np.full([8], fill_value=10)  # np.random.rand(8192)
-        y = np.full([8], fill_value=2)  # np.random.rand(8192)
+        x = np.full([8], fill_value=10).astype(dtype)  # np.random.rand(8192)
+        y = np.full([8], fill_value=2).astype(dtype)  # np.random.rand(8192)
 
         stream = cuda.stream()
         x_c = cuda.to_device(x, stream=stream)
@@ -154,7 +163,8 @@ def _kernel(x, y):
 
     @pytest.mark.skipif(not cuda.is_available(), reason="CUDA Helpers can only be run when CUDA is available")
     @pytest.mark.unit
-    def test_maximum(self):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_maximum(self, dtype):
         numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
         # wrapper kernel for device function that is tested
@@ -164,8 +174,8 @@ def _kernel(x, y):
             if x_pos < x.shape[0] and x_pos < y.shape[0]:
                 x[x_pos] = rnnt_helper.maximum(x[x_pos], y[x_pos])
 
-        x = np.full([8], fill_value=10)  # np.random.rand(8192)
-        y = np.full([8], fill_value=2)  # np.random.rand(8192)
+        x = np.full([8], fill_value=10).astype(dtype)  # np.random.rand(8192)
+        y = np.full([8], fill_value=2).astype(dtype)  # np.random.rand(8192)
 
         stream = cuda.stream()
         x_c = cuda.to_device(x, stream=stream)
@@ -187,7 +197,8 @@ def _kernel(x, y):
 
     @pytest.mark.skipif(not cuda.is_available(), reason="CUDA Helpers can only be run when CUDA is available")
     @pytest.mark.unit
-    def test_identity(self):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_identity(self, dtype):
         numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
         # wrapper kernel for device function that is tested
@@ -197,7 +208,7 @@ def _kernel(x):
             if x_pos < x.shape[0]:
                 x[x_pos] = rnnt_helper.identity(x[x_pos])
 
-        x = np.full([8], fill_value=10)  # np.random.rand(8192)
+        x = np.full([8], fill_value=10).astype(dtype)  # np.random.rand(8192)
 
         stream = cuda.stream()
         x_c = cuda.to_device(x, stream=stream)
@@ -218,7 +229,8 @@ def _kernel(x):
 
     @pytest.mark.skipif(not cuda.is_available(), reason="CUDA Helpers can only be run when CUDA is available")
     @pytest.mark.unit
-    def test_negate(self):
+    @pytest.mark.parametrize('dtype', [np.float32, np.float16])
+    def test_negate(self, dtype):
         numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
         # wrapper kernel for device function that is tested
@@ -228,7 +240,7 @@ def _kernel(x):
             if x_pos < x.shape[0]:
                 x[x_pos] = rnnt_helper.negate(x[x_pos])
 
-        x = np.full([8], fill_value=10)  # np.random.rand(8192)
+        x = np.full([8], fill_value=10).astype(dtype)  # np.random.rand(8192)
 
         stream = cuda.stream()
         x_c = cuda.to_device(x, stream=stream)
@@ -249,7 +261,8 @@ def _kernel(x):
 
     @pytest.mark.skipif(not cuda.is_available(), reason="CUDA Helpers can only be run when CUDA is available")
     @pytest.mark.unit
-    def test_exponential(self):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_exponential(self, dtype):
         numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
         # wrapper kernel for device function that is tested
@@ -259,7 +272,7 @@ def _kernel(x):
             if x_pos < x.shape[0]:
                 x[x_pos] = rnnt_helper.exponential(x[x_pos])
 
-        x = np.random.rand(8)
+        x = np.random.rand(8).astype(dtype)
 
         stream = cuda.stream()
         x_c = cuda.to_device(x, stream=stream)
@@ -281,7 +294,8 @@ def _kernel(x):
 
     @pytest.mark.skipif(not cuda.is_available(), reason="CUDA Helpers can only be run when CUDA is available")
     @pytest.mark.unit
-    def test_log_plus(self):
+    @pytest.mark.parametrize('dtype', DTYPES)
+    def test_log_plus(self, dtype):
         numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
         # wrapper kernel for device function that is tested
@@ -291,8 +305,8 @@ def _kernel(x, y):
             if x_pos < x.shape[0] and x_pos < y.shape[0]:
                 x[x_pos] = rnnt_helper.log_plus(x[x_pos], y[x_pos])
 
-        x = np.full([8], fill_value=10.0)  # np.random.rand(8192)
-        y = np.full([8], fill_value=2.0)  # np.random.rand(8192)
+        x = np.full([8], fill_value=10.0).astype(dtype)  # np.random.rand(8192)
+        y = np.full([8], fill_value=2.0).astype(dtype)  # np.random.rand(8192)
 
         stream = cuda.stream()
         x_c = cuda.to_device(x, stream=stream)
@@ -317,12 +331,15 @@ def _kernel(x, y):
     @pytest.mark.skipif(not cuda.is_available(), reason="CUDA Helpers can only be run when CUDA is available")
     @pytest.mark.parametrize('batch_size', [8, 128, 256])
     @pytest.mark.parametrize('fastemit_lambda', [0.0, 0.001])
+    @pytest.mark.parametrize('dtype', DTYPES)
     @pytest.mark.unit
-    def test_compute_costs_data(self, batch_size, fastemit_lambda):
+    def test_compute_costs_data(self, batch_size, fastemit_lambda, dtype):
         numba_utils.skip_numba_cuda_test_if_unsupported(__NUMBA_MINIMUM_VERSION__)
 
+        np.random.seed(0)
         x = np.full([batch_size], fill_value=0.0)  # np.random.rand(8192)
-        y = np.random.randn(batch_size)  # np.random.rand(8192)
+        y = np.random.randn(batch_size).astype(dtype)  # np.random.rand(8192)
+        threshold = 1e-5 if dtype == np.float32 else 1e-5
 
         stream = cuda.stream()
         x_c = cuda.to_device(x, stream=stream)
@@ -340,11 +357,11 @@ def test_compute_costs_data(self, batch_size, fastemit_lambda):
         x_new = x_c.copy_to_host(stream=stream)
         del x_c, y_c
 
-        res = -(y.copy())
+        res = -(y.astype(np.float32).copy())
         res *= 1.0 + fastemit_lambda
 
         for i in range(len(x_new)):
-            assert x_new[i] == res[i], f"index failed {i}"
+            assert abs(x_new[i] - res[i]) < threshold, f"index failed {i}"
 
 
 if __name__ == '__main__':