NVIDIA · titu1994 · Apr 20, 2023 · Apr 17, 2023 · Apr 17, 2023 · Apr 17, 2023
diff --git a/README.rst b/README.rst
@@ -75,12 +75,20 @@ Key Features
 * Speech processing
     * `HuggingFace Space for Audio Transcription (File, Microphone and YouTube) <https://huggingface.co/spaces/smajumdar/nemo_multilingual_language_id>`_
     * `Automatic Speech Recognition (ASR) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/intro.html>`_
-        * Supported models: Jasper, QuartzNet, CitriNet, Conformer-CTC, Conformer-Transducer, Squeezeformer-CTC, Squeezeformer-Transducer, ContextNet, LSTM-Transducer (RNNT), LSTM-CTC, FastConformer-CTC, FastConformer-Transducer, Conformer-HAT...
-        * Supports CTC, Transducer/RNNT and Hybrid losses/decoders
+        * Supported ASR models: `<https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html>`_
+            * Jasper, QuartzNet, CitriNet, ContextNet
+            * Conformer-CTC, Conformer-Transducer, FastConformer-CTC, FastConformer-Transducer
+            * Squeezeformer-CTC and Squeezeformer-Transducer
+            * LSTM-Transducer (RNNT) and LSTM-CTC
+        * Supports the following decoders/losses:
+            * CTC
+            * Transducer/RNNT
+            * Hybrid Transducer/CTC
             * NeMo Original `Multi-blank Transducers <https://arxiv.org/abs/2211.03541>`_
+        * Streaming/Buffered ASR (CTC/Transducer) - `Chunked Inference Examples <https://github.com/NVIDIA/NeMo/tree/stable/examples/asr/asr_chunked_inference>`_
+        * Cache-aware Streaming Conformer - `<https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#cache-aware-streaming-conformer>`_
         * Beam Search decoding
         * `Language Modelling for ASR <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html>`_: N-gram LM in fusion with Beam Search decoding, Neural Rescoring with Transformer
-        * Streaming and Buffered ASR (CTC/Transducer) - `Chunked Inference Examples <https://github.com/NVIDIA/NeMo/tree/stable/examples/asr/asr_chunked_inference>`_
         * `Support of long audios for Conformer with memory efficient local attention <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/results.html#inference-on-long-audio>`_
     * `Speech Classification, Speech Command Recognition and Language Identification <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/intro.html>`_: MatchboxNet (Command Recognition), AmberNet (LangID)
     * `Voice activity Detection (VAD) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speech_classification/models.html#marblenet-vad>`_: MarbleNet
@@ -98,12 +106,12 @@ Key Features
     * `Punctuation and Capitalization <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/punctuation_and_capitalization.html>`_
     * `Token classification (named entity recognition) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/token_classification.html>`_
     * `Text classification <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/text_classification.html>`_
-    * `Joint Intent and Slot Classification <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/joint_intent_slot.html>`_    
+    * `Joint Intent and Slot Classification <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/joint_intent_slot.html>`_
     * `Question answering <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/question_answering.html>`_
     * `GLUE benchmark <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/glue_benchmark.html>`_
     * `Information retrieval <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/information_retrieval.html>`_
     * `Entity Linking <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/entity_linking.html>`_
-    * `Dialogue State Tracking <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/sgd_qa.html>`_   
+    * `Dialogue State Tracking <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/sgd_qa.html>`_
     * `Prompt Learning <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/nemo_megatron/prompt_learning.html>`_
     * `NGC collection of pre-trained NLP models. <https://ngc.nvidia.com/catalog/collections/nvidia:nemo_nlp>`_
     * `Synthetic Tabular Data Generation <https://developer.nvidia.com/blog/generating-synthetic-data-with-transformers-a-solution-for-enterprise-data-challenges/>`_
@@ -170,7 +178,7 @@ We recommend installing NeMo in a fresh Conda environment.
     conda create --name nemo python==3.8.10
     conda activate nemo
 
-Install PyTorch using their `configurator <https://pytorch.org/get-started/locally/>`_. 
+Install PyTorch using their `configurator <https://pytorch.org/get-started/locally/>`_.
 
 .. code-block:: bash
 
@@ -237,7 +245,7 @@ Install it manually if not using the NVIDIA PyTorch container.
     git checkout 57057e2fcf1c084c0fcc818f55c0ff6ea1b24ae2
     pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./
 
-It is highly recommended to use the NVIDIA PyTorch or NeMo container if having issues installing Apex or any other dependencies. 
+It is highly recommended to use the NVIDIA PyTorch or NeMo container if having issues installing Apex or any other dependencies.
 
 While installing Apex, it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with.
 This raise can be avoided by commenting it here: https://github.com/NVIDIA/apex/blob/master/setup.py#L32
@@ -251,21 +259,21 @@ cuda-nvprof is needed to install Apex. The version should match the CUDA version
 packaging is also needed:
 
 .. code-block:: bash
-  
+
   pip install -y packaging
 
 
 Transformer Engine
 ~~~~~~~~~~~~~~~~~~
-NeMo Megatron GPT has been integrated with `NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>`_ 
+NeMo Megatron GPT has been integrated with `NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>`_
 Transformer Engine enables FP8 training on NVIDIA Hopper GPUs.
 `Install <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html>`_ it manually if not using the NVIDIA PyTorch container.
 
 .. code-block:: bash
 
   pip install --upgrade git+https://github.com/NVIDIA/TransformerEngine.git@stable
 
-It is highly recommended to use the NVIDIA PyTorch or NeMo container if having issues installing Transformer Engine or any other dependencies. 
+It is highly recommended to use the NVIDIA PyTorch or NeMo container if having issues installing Transformer Engine or any other dependencies.
 
 Transformer Engine requires PyTorch to be built with CUDA 11.8.
 
@@ -275,15 +283,15 @@ NeMo Text Processing, specifically (Inverse) Text Normalization, is now a separa
 
 Docker containers:
 ~~~~~~~~~~~~~~~~~~
-We release NeMo containers alongside NeMo releases. For example, NeMo ``r1.16.0`` comes with container ``nemo:23.01``, you may find more details about released containers in `releases page <https://github.com/NVIDIA/NeMo/releases>`_. 
+We release NeMo containers alongside NeMo releases. For example, NeMo ``r1.16.0`` comes with container ``nemo:23.01``, you may find more details about released containers in `releases page <https://github.com/NVIDIA/NeMo/releases>`_.
 
 To use built container, please run
 
 .. code-block:: bash
 
     docker pull nvcr.io/nvidia/nemo:23.01
 
-To build a nemo container with Dockerfile from a branch, please run 
+To build a nemo container with Dockerfile from a branch, please run
 
 .. code-block:: bash
 

diff --git a/docs/source/asr/models.rst b/docs/source/asr/models.rst
@@ -154,6 +154,7 @@ This allows using the model on longer audio (up to 70 minutes with Fast Conforme
 can be used with limited context attention even if trained with full context. However, if you also want to use global tokens,
 which help aggregate information from outside the limited context, then training is required.
 
+You may find more examples under ``<NeMo_git_root>/examples/asr/conf/fastconformer/``.
 
 Cache-aware Streaming Conformer
 -------------------------------
@@ -212,6 +213,8 @@ To simulate cache-aware streaming, you may use the script at ``<NeMo_git_root>/e
 This script can be used for models trained offline with full-context but the accuracy would not be great unless the chunk size is large enough which would result in high latency.
 It is recommended to train a model in streaming model with limited context for this script. More info can be found in the script.
 
+You may find FastConformer variants of cache-aware streaming models under ``<NeMo_git_root>/examples/asr/conf/fastconformer/``.
+
 .. _LSTM-Transducer_model:
 
 LSTM-Transducer
@@ -284,6 +287,9 @@ You may find the example config files of Conformer variant of such hybrid models
 ``<NeMo_git_root>/examples/asr/conf/conformer/hybrid_transducer_ctc/conformer_hybrid_transducer_ctc_char.yaml`` and
 with sub-word encoding at ``<NeMo_git_root>/examples/asr/conf/conformer/hybrid_transducer_ctc/conformer_hybrid_transducer_ctc_bpe.yaml``.
 
+Similar example configs for FastConformer variants of Hybrid models can be found here:
+``<NeMo_git_root>/examples/asr/conf/fastconformer/hybrid_transducer_ctc/``
+``<NeMo_git_root>/examples/asr/conf/fastconformer/hybrid_cache_aware_streaming/``
 
 .. _Conformer-HAT_model:
 

diff --git a/examples/asr/conf/conformer/cache_aware_streaming/conformer_ctc_bpe_streaming.yaml b/examples/asr/conf/conformer/cache_aware_streaming/conformer_ctc_bpe_streaming.yaml
@@ -1,14 +1,14 @@
-# It contains the default values for training a streaming Conformer-CTC ASR model, large size (~120M) with CTC loss and sub-word encoding.
+# It contains the default values for training a streaming cache-aware Conformer-CTC ASR model, large size (~120M) with CTC loss and sub-word encoding.
+# Models trained with this config have limited right context which make them efficient for streaming ASR.
 
-# Architecture and training config:
-# You may find more detail on the architecture and training config at NeMo/examples/asr/comf/offline/conformer_ctc_bpe.yaml
+# You may find more detail:
+# Conformer's architecture config: NeMo/examples/asr/conf/conformer/conformer_ctc_bpe.yaml
+# Cache-aware Streaming Conformer: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#cache-aware-streaming-conformer
 
-# Models trained with this config have limited right context which make them efficient for streaming ASR
-# You may use NeMo/examples/asr/speech_to_text_streaming_infer.py to simulate/evaluate this model in cache-aware streaming mode
+# You may use NeMo/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py to simulate/evaluate this model in cache-aware streaming mode
+# Pre-trained ASR models can be found here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/results.html
 
-# if loss does not go down properly or gives NAN, you may try the followings:
-# + using gradient clipping of 1.0
-# + increase the warmup steps from 10K to 20K
+# Note: if loss does not go down properly or diverges, you may try increasing the warmup steps from 10K to 20K.
 
 name: "Conformer-CTC-BPE-Streaming"
 
@@ -24,8 +24,6 @@ model:
     shuffle: true
     num_workers: 8
     pin_memory: true
-    use_start_end_token: false
-    trim_silence: false
     max_duration: 16.7 # it is set for LibriSpeech, you may need to update it for your dataset
     min_duration: 0.1
     # tarred datasets
@@ -41,18 +39,18 @@ model:
     sample_rate: ${model.sample_rate}
     batch_size: 16 # you may increase batch_size if your memory allows
     shuffle: false
+    use_start_end_token: false
     num_workers: 8
     pin_memory: true
-    use_start_end_token: false
 
   test_ds:
     manifest_filepath: null
     sample_rate: ${model.sample_rate}
     batch_size: 16 # you may increase batch_size if your memory allows
     shuffle: false
+    use_start_end_token: false
     num_workers: 8
     pin_memory: true
-    use_start_end_token: false
 
   # recommend small vocab size of 128 or 256 when using 4x sub-sampling
   # you may find more detail on how to train a tokenizer at: /scripts/tokenizers/process_asr_text_tokenizer.py
@@ -63,7 +61,7 @@ model:
   preprocessor:
     _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
     sample_rate: ${model.sample_rate}
-    normalize: "per_feature"
+    normalize: "NA" # No normalization for mel-spectogram makes streaming easier
     window_size: 0.025
     window_stride: 0.01
     window: "hann"
@@ -89,9 +87,8 @@ model:
     d_model: 512
 
     # Sub-sampling params
-    # stacking and stacking_norm would result in significant accuracy degradation and training instability with CTC models, recommend to use striding
-    # stacking_norm is more stable and robust for CTC models and can be around 25% faster during inference
-    subsampling: striding # vggnet, striding, stacking or stacking_norm, dw_striding
+    # stacking_norm, stacking and dw_striding can be around 25% faster than striding during inference, while they may give similar or slightly worse results in terms of accuracy for Transducer models
+    subsampling: striding # vggnet, striding, stacking, stacking_norm, or dw_striding
     subsampling_factor: 4 # must be power of 2 for striding and vggnet
     subsampling_conv_channels: -1 # -1 sets it to d_model
     causal_downsampling: true # enables causal convolutions during downsampling
@@ -106,7 +103,9 @@ model:
     # [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
     # for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
     # for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
-    att_context_size: [102, 33] # -1 means unlimited context
+    # for chunked_limited you may calculate the look-ahead or right context by the following formula:
+    # look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 27*4*0.01=1.08s
+    att_context_size: [140, 27] # -1 means unlimited context
     att_context_style: chunked_limited # regular or chunked_limited
 
     xscaling: true # scales up the input embeddings by sqrt(d_model)
@@ -161,7 +160,6 @@ model:
       d_model: ${model.encoder.d_model}
       # scheduler config override
       warmup_steps: 10000 # you may try larger warmup like 20K is training is not stable
-      warmup_ratio: null
       min_lr: 1e-6
 
 trainer:
@@ -174,7 +172,7 @@ trainer:
   strategy: ddp
   accumulate_grad_batches: 1
   gradient_clip_val: 1.0
-  precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.
+  precision: 32 # 16, 32, or bf16
   log_every_n_steps: 10  # Interval of logging.
   enable_progress_bar: True
   resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.