Skip to content

Commit

Permalink
Multi-lookahead cache-aware streaming models (#6711)
Browse files Browse the repository at this point in the history
* added methods.

Signed-off-by: Vahid <[email protected]>

* added methods.

Signed-off-by: Vahid <[email protected]>

* added initial code.

Signed-off-by: Vahid <[email protected]>

* added initial code.

Signed-off-by: Vahid <[email protected]>

* added initial code.

Signed-off-by: Vahid <[email protected]>

* added config files.

Signed-off-by: Vahid <[email protected]>

* fixed bugs.

Signed-off-by: Vahid <[email protected]>

* updated confs.

Signed-off-by: Vahid <[email protected]>

* updated confs.

Signed-off-by: Vahid <[email protected]>

* updated confs.

Signed-off-by: Vahid <[email protected]>

* updated confs.

Signed-off-by: Vahid <[email protected]>

* improved f.conv1d

Signed-off-by: Vahid <[email protected]>

* pulled from main.

Signed-off-by: Vahid <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* pulled from main.

Signed-off-by: Vahid <[email protected]>

* added postpostnorm.

Signed-off-by: Vahid <[email protected]>

* fixed the target continiouse bug.

Signed-off-by: Vahid <[email protected]>

* added dw_striding causal.

Signed-off-by: Vahid <[email protected]>

* added print for debugging.

Signed-off-by: Vahid <[email protected]>

* added print for debugging.

Signed-off-by: Vahid <[email protected]>

* fixed causal convolutions.

Signed-off-by: Vahid <[email protected]>

* added _midnorm.

Signed-off-by: Vahid <[email protected]>

* fixed transcribe.

Signed-off-by: Vahid <[email protected]>

* cleaned code.

Signed-off-by: Vahid <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* moved back configs.

Signed-off-by: Vahid <[email protected]>

* moved back configs.

Signed-off-by: Vahid <[email protected]>

* updated fast emit for FC models.

Signed-off-by: Vahid <[email protected]>

* updated fast emit for FC models.

Signed-off-by: Vahid <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixed bug.

Signed-off-by: Vahid <[email protected]>

* fixed bug and addressed comments.

Signed-off-by: Vahid <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixed configs.

Signed-off-by: Vahid <[email protected]>

* fixed configs.

Signed-off-by: Vahid <[email protected]>

* dropped the test.

Signed-off-by: Vahid <[email protected]>

---------

Signed-off-by: Vahid <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
VahidooX and pre-commit-ci[bot] committed Jun 8, 2023
1 parent 9cca92b commit b67d410
Show file tree
Hide file tree
Showing 12 changed files with 272 additions and 158 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -103,10 +103,16 @@ model:
# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
# for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
# for chunked_limited you may calculate the look-ahead or right context by the following formula:
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 27*4*0.01=1.08s

# For multi-lookahead models, you may specify a list of context sizes. During the training, different context sizes would be used randomly with the distribution specified by att_context_probs.
# The first item in the list would be the default during test/validation/inference.
# An example of settings for multi-lookahead:
# att_context_size: [[140,27],[140,13],[140,2],[140,0]]
# att_context_probs: [0.25, 0.25, 0.25, 0.25, 0.25]
att_context_size: [140, 27] # -1 means unlimited context
att_context_style: chunked_limited # regular or chunked_limited
att_context_probs: null

xscaling: true # scales up the input embeddings by sqrt(d_model)
untie_biases: true # unties the biases of the TransformerXL layers
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -113,10 +113,16 @@ model:
# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
# for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
# for chunked_limited you may calculate the look-ahead or right context by the following formula:
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 27*4*0.01=1.08s
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 13*8*0.01=1.04s

# For multi-lookahead models, you may specify a list of context sizes. During the training, different context sizes would be used randomly with the distribution specified by att_context_probs.
# The first item in the list would be the default during test/validation/inference.
# An example of settings for multi-lookahead:
# att_context_size: [[140,27],[140,13],[140,2],[140,0]]
# att_context_probs: [0.25, 0.25, 0.25, 0.25, 0.25]
att_context_size: [140, 27] # -1 means unlimited context
att_context_style: chunked_limited # regular or chunked_limited
att_context_probs: null

xscaling: true # scales up the input embeddings by sqrt(d_model)
untie_biases: true # unties the biases of the TransformerXL layers
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -97,10 +97,17 @@ model:
# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
# for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
# for chunked_limited you may calculate the look-ahead or right context by the following formula:
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 13*8*0.01=1.04s

# For multi-lookahead models, you may specify a list of context sizes. During the training, different context sizes would be used randomly with the distribution specified by att_context_probs.
# The first item in the list would be the default during test/validation/inference.
# An example of settings for multi-lookahead:
# att_context_size: [[70,13],[70,6],[70,1],[70,0]]
# att_context_probs: [0.25, 0.25, 0.25, 0.25, 0.25]
att_context_size: [70, 13] # -1 means unlimited context
att_context_style: chunked_limited # regular or chunked_limited
att_context_probs: null


xscaling: true # scales up the input embeddings by sqrt(d_model)
pos_emb_max_len: 5000
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,11 +100,19 @@ model:
n_heads: 8 # may need to be lower for smaller d_models

# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 2 as multiple-layers may increase the effective right context too large
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
# for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 13*8*0.01=1.04s

# For multi-lookahead models, you may specify a list of context sizes. During the training, different context sizes would be used randomly with the distribution specified by att_context_probs.
# The first item in the list would be the default during test/validation/inference.
# An example of settings for multi-lookahead:
# att_context_size: [[70,13],[70,6],[70,1],[70,0]]
# att_context_probs: [0.25, 0.25, 0.25, 0.25, 0.25]
att_context_size: [70, 13] # -1 means unlimited context
att_context_style: chunked_limited # regular or chunked_limited
att_context_probs: null


xscaling: true # scales up the input embeddings by sqrt(d_model)
pos_emb_max_len: 5000
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -102,10 +102,17 @@ model:
# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
# for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
# for chunked_limited you may calculate the look-ahead or right context by the following formula:
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 13*8*0.01=1.04s

# For multi-lookahead models, you may specify a list of context sizes. During the training, different context sizes would be used randomly with the distribution specified by att_context_probs.
# The first item in the list would be the default during test/validation/inference.
# An example of settings for multi-lookahead:
# att_context_size: [[70,13],[70,6],[70,1],[70,0]]
# att_context_probs: [0.25, 0.25, 0.25, 0.25, 0.25]
att_context_size: [70, 13] # -1 means unlimited context
att_context_style: chunked_limited # regular or chunked_limited
att_context_probs: null


xscaling: true # scales up the input embeddings by sqrt(d_model)
pos_emb_max_len: 5000
Expand Down Expand Up @@ -191,9 +198,9 @@ model:
loss_name: "default"
warprnnt_numba_kwargs:
# FastEmit regularization: https://arxiv.org/abs/2010.11148
# You may enable FastEmit to reduce the latency of the model for streaming
# It also helps to improve the accuracy of the model in streaming mode
fastemit_lambda: 1e-3 # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start.
# You may enable FastEmit to increase the accuracy and reduce the latency of the model for streaming
# You may set it to lower values like 1e-3 for models with larger right context
fastemit_lambda: 5e-3 # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start.
clamp: -1.0 # if > 0, applies gradient clamping in range [-clamp, clamp] for the joint tensor only.

optim:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -106,11 +106,19 @@ model:
n_heads: 8 # may need to be lower for smaller d_models

# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 2 as multiple-layers may increase the effective right context too large
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
# for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 13*8*0.01=1.04s

# For multi-lookahead models, you may specify a list of context sizes. During the training, different context sizes would be used randomly with the distribution specified by att_context_probs.
# The first item in the list would be the default during test/validation/inference.
# An example of settings for multi-lookahead:
# att_context_size: [[70,13],[70,6],[70,1],[70,0]]
# att_context_probs: [0.25, 0.25, 0.25, 0.25, 0.25]
att_context_size: [70, 13] # -1 means unlimited context
att_context_style: chunked_limited # regular or chunked_limited
att_context_probs: null


xscaling: true # scales up the input embeddings by sqrt(d_model)
pos_emb_max_len: 5000
Expand Down Expand Up @@ -196,9 +204,9 @@ model:
loss_name: "default"
warprnnt_numba_kwargs:
# FastEmit regularization: https://arxiv.org/abs/2010.11148
# You may enable FastEmit to reduce the latency of the model for streaming
# It also helps to improve the accuracy of the model in streaming mode
fastemit_lambda: 1e-3 # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start.
# You may enable FastEmit to increase the accuracy and reduce the latency of the model for streaming
# You may set it to lower values like 1e-3 for models with larger right context
fastemit_lambda: 5e-3 # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start.
clamp: -1.0 # if > 0, applies gradient clamping in range [-clamp, clamp] for the joint tensor only.

optim:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
# FastConformer-CTC's architecture config: NeMo/examples/asr/conf/fastconformer/fast-conformer_ctc_bpe.yaml
# FastConformer-Transducer's architecture config, along with the optimal batch size and precision: NeMo/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml

# Note: if training loss does not converge, you may increase warm-up to 20K.

name: "FastConformer-Hybrid-Transducer-CTC-BPE-Streaming"

model:
Expand Down Expand Up @@ -106,8 +108,15 @@ model:
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
# for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 13*8*0.01=1.04s

# For multi-lookahead models, you may specify a list of context sizes. During the training, different context sizes would be used randomly with the distribution specified by att_context_probs.
# The first item in the list would be the default during test/validation/inference.
# An example of settings for multi-lookahead:
# att_context_size: [[70,13],[70,6],[70,1],[70,0]]
# att_context_probs: [0.25, 0.25, 0.25, 0.25, 0.25]
att_context_size: [70, 13] # -1 means unlimited context
att_context_style: chunked_limited # regular or chunked_limited
att_context_probs: null

xscaling: true # scales up the input embeddings by sqrt(d_model)
pos_emb_max_len: 5000
Expand Down Expand Up @@ -206,9 +215,9 @@ model:
loss_name: "default"
warprnnt_numba_kwargs:
# FastEmit regularization: https://arxiv.org/abs/2010.11148
# You may enable FastEmit to reduce the latency of the model for streaming
# It also helps to improve the accuracy of the model in streaming mode
fastemit_lambda: 1e-3 # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start.
# You may enable FastEmit to increase the accuracy and reduce the latency of the model for streaming
# You may set it to lower values like 1e-3 for models with larger right context
fastemit_lambda: 5e-3 # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start.
clamp: -1.0 # if > 0, applies gradient clamping in range [-clamp, clamp] for the joint tensor only.

optim:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
# FastConformer-CTC's architecture config: NeMo/examples/asr/conf/fastconformer/fast-conformer_ctc_bpe.yaml
# FastConformer-Transducer's architecture config, along with the optimal batch size and precision: NeMo/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml

# Note: if training loss does not converge, you may increase warm-up to 20K.

name: "FastConformer-Hybrid-Transducer-CTC-Char-Streaming"

model:
Expand Down Expand Up @@ -111,8 +113,15 @@ model:
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
# for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 13*8*0.01=1.04s

# For multi-lookahead models, you may specify a list of context sizes. During the training, different context sizes would be used randomly with the distribution specified by att_context_probs.
# The first item in the list would be the default during test/validation/inference.
# An example of settings for multi-lookahead:
# att_context_size: [[70,13],[70,6],[70,1],[70,0]]
# att_context_probs: [0.25, 0.25, 0.25, 0.25, 0.25]
att_context_size: [70, 13] # -1 means unlimited context
att_context_style: chunked_limited # regular or chunked_limited
att_context_probs: null

xscaling: true # scales up the input embeddings by sqrt(d_model)
pos_emb_max_len: 5000
Expand Down Expand Up @@ -211,9 +220,9 @@ model:
loss_name: "default"
warprnnt_numba_kwargs:
# FastEmit regularization: https://arxiv.org/abs/2010.11148
# You may enable FastEmit to reduce the latency of the model for streaming
# It also helps to improve the accuracy of the model in streaming mode
fastemit_lambda: 1e-3 # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start.
# You may enable FastEmit to increase the accuracy and reduce the latency of the model for streaming
# You may set it to lower values like 1e-3 for models with larger right context
fastemit_lambda: 5e-3 # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start.
clamp: -1.0 # if > 0, applies gradient clamping in range [-clamp, clamp] for the joint tensor only.

optim:
Expand Down
Loading

0 comments on commit b67d410

Please sign in to comment.