Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added/updated new Conformer configs #6426

Merged
merged 10 commits into from
Apr 20, 2023
32 changes: 20 additions & 12 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,12 +75,20 @@ Key Features
* Speech processing
* `HuggingFace Space for Audio Transcription (File, Microphone and YouTube) <https://huggingface.co/spaces/smajumdar/nemo_multilingual_language_id>`_
* `Automatic Speech Recognition (ASR) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/intro.html>`_
* Supported models: Jasper, QuartzNet, CitriNet, Conformer-CTC, Conformer-Transducer, Squeezeformer-CTC, Squeezeformer-Transducer, ContextNet, LSTM-Transducer (RNNT), LSTM-CTC, FastConformer-CTC, FastConformer-Transducer, Conformer-HAT...
* Supports CTC, Transducer/RNNT and Hybrid losses/decoders
* Supported ASR models: `<https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html>`_
* Jasper, QuartzNet, CitriNet, ContextNet
* Conformer-CTC, Conformer-Transducer, FastConformer-CTC, FastConformer-Transducer
* Squeezeformer-CTC and Squeezeformer-Transducer
* LSTM-Transducer (RNNT) and LSTM-CTC
* Supports the following decoders/losses:
* CTC
* Transducer/RNNT
* Hybrid Transducer/CTC
* NeMo Original `Multi-blank Transducers <https://arxiv.org/abs/2211.03541>`_
* Streaming/Buffered ASR (CTC/Transducer) - `Chunked Inference Examples <https://github.com/NVIDIA/NeMo/tree/stable/examples/asr/asr_chunked_inference>`_
* Cache-aware Streaming Conformer - `<https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#cache-aware-streaming-conformer>`_
* Beam Search decoding
* `Language Modelling for ASR <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html>`_: N-gram LM in fusion with Beam Search decoding, Neural Rescoring with Transformer
* Streaming and Buffered ASR (CTC/Transducer) - `Chunked Inference Examples <https://github.com/NVIDIA/NeMo/tree/stable/examples/asr/asr_chunked_inference>`_
* `Support of long audios for Conformer with memory efficient local attention <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/results.html#inference-on-long-audio>`_
* `Speech Classification, Speech Command Recognition and Language Identification <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/intro.html>`_: MatchboxNet (Command Recognition), AmberNet (LangID)
* `Voice activity Detection (VAD) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speech_classification/models.html#marblenet-vad>`_: MarbleNet
Expand All @@ -98,12 +106,12 @@ Key Features
* `Punctuation and Capitalization <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/punctuation_and_capitalization.html>`_
* `Token classification (named entity recognition) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/token_classification.html>`_
* `Text classification <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/text_classification.html>`_
* `Joint Intent and Slot Classification <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/joint_intent_slot.html>`_
* `Joint Intent and Slot Classification <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/joint_intent_slot.html>`_
* `Question answering <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/question_answering.html>`_
* `GLUE benchmark <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/glue_benchmark.html>`_
* `Information retrieval <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/information_retrieval.html>`_
* `Entity Linking <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/entity_linking.html>`_
* `Dialogue State Tracking <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/sgd_qa.html>`_
* `Dialogue State Tracking <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/sgd_qa.html>`_
* `Prompt Learning <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/nemo_megatron/prompt_learning.html>`_
* `NGC collection of pre-trained NLP models. <https://ngc.nvidia.com/catalog/collections/nvidia:nemo_nlp>`_
* `Synthetic Tabular Data Generation <https://developer.nvidia.com/blog/generating-synthetic-data-with-transformers-a-solution-for-enterprise-data-challenges/>`_
Expand Down Expand Up @@ -170,7 +178,7 @@ We recommend installing NeMo in a fresh Conda environment.
conda create --name nemo python==3.8.10
conda activate nemo

Install PyTorch using their `configurator <https://pytorch.org/get-started/locally/>`_.
Install PyTorch using their `configurator <https://pytorch.org/get-started/locally/>`_.

.. code-block:: bash

Expand Down Expand Up @@ -237,7 +245,7 @@ Install it manually if not using the NVIDIA PyTorch container.
git checkout 57057e2fcf1c084c0fcc818f55c0ff6ea1b24ae2
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./

It is highly recommended to use the NVIDIA PyTorch or NeMo container if having issues installing Apex or any other dependencies.
It is highly recommended to use the NVIDIA PyTorch or NeMo container if having issues installing Apex or any other dependencies.

While installing Apex, it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with.
This raise can be avoided by commenting it here: https://github.com/NVIDIA/apex/blob/master/setup.py#L32
Expand All @@ -251,21 +259,21 @@ cuda-nvprof is needed to install Apex. The version should match the CUDA version
packaging is also needed:

.. code-block:: bash

pip install -y packaging


Transformer Engine
~~~~~~~~~~~~~~~~~~
NeMo Megatron GPT has been integrated with `NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>`_
NeMo Megatron GPT has been integrated with `NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>`_
Transformer Engine enables FP8 training on NVIDIA Hopper GPUs.
`Install <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html>`_ it manually if not using the NVIDIA PyTorch container.

.. code-block:: bash

pip install --upgrade git+https://github.com/NVIDIA/TransformerEngine.git@stable

It is highly recommended to use the NVIDIA PyTorch or NeMo container if having issues installing Transformer Engine or any other dependencies.
It is highly recommended to use the NVIDIA PyTorch or NeMo container if having issues installing Transformer Engine or any other dependencies.

Transformer Engine requires PyTorch to be built with CUDA 11.8.

Expand All @@ -275,15 +283,15 @@ NeMo Text Processing, specifically (Inverse) Text Normalization, is now a separa

Docker containers:
~~~~~~~~~~~~~~~~~~
We release NeMo containers alongside NeMo releases. For example, NeMo ``r1.16.0`` comes with container ``nemo:23.01``, you may find more details about released containers in `releases page <https://github.com/NVIDIA/NeMo/releases>`_.
We release NeMo containers alongside NeMo releases. For example, NeMo ``r1.16.0`` comes with container ``nemo:23.01``, you may find more details about released containers in `releases page <https://github.com/NVIDIA/NeMo/releases>`_.

To use built container, please run

.. code-block:: bash

docker pull nvcr.io/nvidia/nemo:23.01

To build a nemo container with Dockerfile from a branch, please run
To build a nemo container with Dockerfile from a branch, please run

.. code-block:: bash

Expand Down
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# It contains the default values for training a streaming Conformer-CTC ASR model, large size (~120M) with CTC loss and sub-word encoding.
# It contains the default values for training a streaming cache-aware Conformer-CTC ASR model, large size (~120M) with CTC loss and sub-word encoding.
# Models trained with this config have limited right context which make them efficient for streaming ASR.

# Architecture and training config:
# You may find more detail on the architecture and training config at NeMo/examples/asr/comf/offline/conformer_ctc_bpe.yaml
# You may find more detail:
# Conformer's architecture config: NeMo/examples/asr/conf/conformer/conformer_ctc_bpe.yaml
# Cache-aware Streaming Conformer: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#cache-aware-streaming-conformer

# Models trained with this config have limited right context which make them efficient for streaming ASR
# You may use NeMo/examples/asr/speech_to_text_streaming_infer.py to simulate/evaluate this model in cache-aware streaming mode
# You may use NeMo/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py to simulate/evaluate this model in cache-aware streaming mode
# Pre-trained ASR models can be found here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/results.html

# if loss does not go down properly or gives NAN, you may try the followings:
# + using gradient clipping of 1.0
# + increase the warmup steps from 10K to 20K
# Note: if loss does not go down properly or diverges, you may try increasing the warmup steps from 10K to 20K.

name: "Conformer-CTC-BPE-Streaming"

Expand All @@ -24,8 +24,6 @@ model:
shuffle: true
num_workers: 8
pin_memory: true
use_start_end_token: false
trim_silence: false
max_duration: 16.7 # it is set for LibriSpeech, you may need to update it for your dataset
min_duration: 0.1
# tarred datasets
Expand All @@ -41,18 +39,18 @@ model:
sample_rate: ${model.sample_rate}
batch_size: 16 # you may increase batch_size if your memory allows
shuffle: false
use_start_end_token: false
num_workers: 8
pin_memory: true
use_start_end_token: false

test_ds:
manifest_filepath: null
sample_rate: ${model.sample_rate}
batch_size: 16 # you may increase batch_size if your memory allows
shuffle: false
use_start_end_token: false
num_workers: 8
pin_memory: true
use_start_end_token: false

# recommend small vocab size of 128 or 256 when using 4x sub-sampling
# you may find more detail on how to train a tokenizer at: /scripts/tokenizers/process_asr_text_tokenizer.py
Expand All @@ -63,7 +61,7 @@ model:
preprocessor:
_target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
sample_rate: ${model.sample_rate}
normalize: "per_feature"
normalize: "NA" # No normalization for mel-spectogram makes streaming easier
window_size: 0.025
window_stride: 0.01
window: "hann"
Expand All @@ -89,9 +87,8 @@ model:
d_model: 512

# Sub-sampling params
# stacking and stacking_norm would result in significant accuracy degradation and training instability with CTC models, recommend to use striding
# stacking_norm is more stable and robust for CTC models and can be around 25% faster during inference
subsampling: striding # vggnet, striding, stacking or stacking_norm, dw_striding
# stacking_norm, stacking and dw_striding can be around 25% faster than striding during inference, while they may give similar or slightly worse results in terms of accuracy for Transducer models
subsampling: striding # vggnet, striding, stacking, stacking_norm, or dw_striding
subsampling_factor: 4 # must be power of 2 for striding and vggnet
subsampling_conv_channels: -1 # -1 sets it to d_model
causal_downsampling: true # enables causal convolutions during downsampling
Expand All @@ -106,7 +103,9 @@ model:
# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
# for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
att_context_size: [102, 33] # -1 means unlimited context
# for chunked_limited you may calculate the look-ahead or right context by the following formula:
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 27*4*0.01=1.08s
att_context_size: [140, 27] # -1 means unlimited context
att_context_style: chunked_limited # regular or chunked_limited

xscaling: true # scales up the input embeddings by sqrt(d_model)
Expand Down Expand Up @@ -161,7 +160,6 @@ model:
d_model: ${model.encoder.d_model}
# scheduler config override
warmup_steps: 10000 # you may try larger warmup like 20K is training is not stable
warmup_ratio: null
min_lr: 1e-6

trainer:
Expand All @@ -174,7 +172,7 @@ trainer:
strategy: ddp
accumulate_grad_batches: 1
gradient_clip_val: 1.0
precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.
precision: 32 # 16, 32, or bf16
log_every_n_steps: 10 # Interval of logging.
enable_progress_bar: True
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
# It contains the default values for training a streaming Conformer-Transducer ASR model, large size (~120M) with Transducer loss and sub-word encoding.
# It contains the default values for training a streaming cache-aware Conformer-Transducer ASR model, large size (~120M) with Transducer loss and sub-word encoding.
# Models trained with this config have limited right context which make them efficient for streaming ASR.

# Architecture and training config:
# You may find more detail on the architecture and training config at NeMo/examples/asr/comf/offline/conformer_transducer_bpe.yaml
# You may find more detail:
# Conformer's architecture config: NeMo/examples/asr/conf/conformer/conformer_transducer_bpe.yaml
# Cache-aware Streaming Conformer: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#cache-aware-streaming-conformer

# Models trained with this config have limited right context which make them efficient for streaming ASR
# You may use NeMo/examples/asr/speech_to_text_streaming_infer.py to simulate/evaluate this model in cache-aware streaming mode
# You may use NeMo/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py to simulate/evaluate this model in cache-aware streaming mode
# Pre-trained ASR models can be found here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/results.html

# if loss does not go down properly or gives NAN, you may try the followings by order:
# + using gradient clipping of 1.0
# + increase the warmup steps from 10K to 20K
# + use striding instead of stacking for downsampling
# Note: if loss does not go down properly or diverges, you may try increasing the warmup steps from 10K to 20K.

name: "Conformer-Transducer-BPE-Streaming"

Expand All @@ -30,8 +29,6 @@ model:
shuffle: true
num_workers: 8
pin_memory: true
use_start_end_token: false
trim_silence: false
max_duration: 16.7 # it is set for LibriSpeech, you may need to update it for your dataset
min_duration: 0.1
# tarred datasets
Expand All @@ -47,18 +44,18 @@ model:
sample_rate: ${model.sample_rate}
batch_size: 16
shuffle: false
use_start_end_token: false
num_workers: 8
pin_memory: true
use_start_end_token: false

test_ds:
manifest_filepath: null
sample_rate: ${model.sample_rate}
batch_size: 16
shuffle: false
use_start_end_token: false
num_workers: 8
pin_memory: true
use_start_end_token: false

# You may find more detail on how to train a tokenizer at: /scripts/tokenizers/process_asr_text_tokenizer.py
tokenizer:
Expand All @@ -68,7 +65,7 @@ model:
preprocessor:
_target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
sample_rate: ${model.sample_rate}
normalize: "per_feature"
normalize: "NA" # No normalization for mel-spectogram makes streaming easier
window_size: 0.025
window_stride: 0.01
window: "hann"
Expand All @@ -93,8 +90,8 @@ model:
d_model: 512

# Sub-sampling params
# stacking_norm and stacking can be around 25% faster than striding during inference, while they may give similar or slightly worse results in terms of accuracy for Transducer models
subsampling: striding # vggnet, striding, stacking or stacking_norm
# stacking_norm, stacking and dw_striding can be around 25% faster than striding during inference, while they may give similar or slightly worse results in terms of accuracy for Transducer models
subsampling: striding # vggnet, striding, stacking, stacking_norm, or dw_striding
subsampling_factor: 4 # must be power of 2 for striding and vggnet
subsampling_conv_channels: -1 # -1 sets it to d_model
causal_downsampling: true # enables causal convolutions during downsampling
Expand All @@ -116,7 +113,9 @@ model:
# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
# for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
att_context_size: [102, 33] # -1 means unlimited context
# for chunked_limited you may calculate the look-ahead or right context by the following formula:
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 27*4*0.01=1.08s
att_context_size: [140, 27] # -1 means unlimited context
att_context_style: chunked_limited # regular or chunked_limited

xscaling: true # scales up the input embeddings by sqrt(d_model)
Expand Down Expand Up @@ -169,7 +168,7 @@ model:
# However, to preserve memory, this ratio can be 1:8 or even 1:16.
# Extreme case of 1:B (i.e. fused_batch_size=1) should be avoided as training speed would be very slow.
fuse_loss_wer: true
fused_batch_size: 16
fused_batch_size: 4

jointnet:
joint_hidden: ${model.model_defaults.joint_hidden}
Expand Down Expand Up @@ -201,11 +200,6 @@ model:
fastemit_lambda: 1e-3 # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start.
clamp: -1.0 # if > 0, applies gradient clamping in range [-clamp, clamp] for the joint tensor only.

# Adds Gaussian noise to the gradients of the decoder to avoid overfitting
variational_noise:
start_step: 0
std: 0.0

optim:
name: adamw
lr: 5.0
Expand All @@ -219,7 +213,6 @@ model:
d_model: ${model.encoder.d_model}
# scheduler config override
warmup_steps: 10000 # you may try larger warmup like 20K is training is not stable
warmup_ratio: null
VahidooX marked this conversation as resolved.
Show resolved Hide resolved
min_lr: 1e-6

trainer:
Expand All @@ -231,8 +224,8 @@ trainer:
accelerator: auto
strategy: ddp
accumulate_grad_batches: 1
gradient_clip_val: 0.0
precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.
gradient_clip_val: 1.0
precision: 32 # 16, 32, or bf16
log_every_n_steps: 10 # Interval of logging.
enable_progress_bar: True
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
Expand Down
Loading