Skip to content

Commit

Permalink
docs: fix typos (NVIDIA#7758)
Browse files Browse the repository at this point in the history
Signed-off-by: shuoer86 <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
  • Loading branch information
2 people authored and pzelasko committed Jan 3, 2024
1 parent e51d0d1 commit fb36f2e
Show file tree
Hide file tree
Showing 8 changed files with 10 additions and 10 deletions.
2 changes: 1 addition & 1 deletion docs/source/asr/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -300,7 +300,7 @@ Bucketing Datasets
For training ASR models, audios with different lengths may be grouped into a batch. It would make it necessary to use paddings to make all the same length.
These extra paddings is a significant source of computation waste. Splitting the training samples into buckets with different lengths and sampling from the same bucket for each batch would increase the computation efficicncy.
It may result into training speeedup of more than 2X. To enable and use the bucketing feature, you need to create the bucketing version of the dataset by using `conversion script here <https://github.com/NVIDIA/NeMo/tree/stable/scripts/speech_recognition/convert_to_tarred_audio_dataset.py>`_.
You may use --buckets_num to specify the number of buckets (Recommened to use 4 to 8 buckets). It creates multiple tarred datasets, one per bucket, based on the audio durations. The range of [min_duration, max_duration) is split into equal sized buckets.
You may use --buckets_num to specify the number of buckets (Recommend to use 4 to 8 buckets). It creates multiple tarred datasets, one per bucket, based on the audio durations. The range of [min_duration, max_duration) is split into equal sized buckets.


To enable the bucketing feature in the dataset section of the config files, you need to pass the multiple tarred datasets as a list of lists.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/asr/speaker_diarization/configs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ The hyper-parameters for MSDD models are under the ``msdd_module`` key. The mode
num_lstm_layers: 3 # Number of stacked LSTM layers
dropout_rate: 0.5 # Dropout rate
cnn_output_ch: 32 # Number of filters in a conv-net layer.
conv_repeat: 2 # Determins the number of conv-net layers. Should be greater or equal to 1.
conv_repeat: 2 # Determines the number of conv-net layers. Should be greater or equal to 1.
emb_dim: 192 # Dimension of the speaker embedding vectors
scale_n: ${model.scale_n} # Number of scales for multiscale segmentation input
weighting_scheme: 'conv_scale_weight' # Type of weighting algorithm. Options: ('conv_scale_weight', 'attn_scale_weight')
Expand Down
4 changes: 2 additions & 2 deletions docs/source/asr/speaker_diarization/models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ The data flow of the multiscale speaker diarization system is shown in the above
.. image:: images/scale_weight_cnn.png
:align: center
:width: 800px
:alt: A figure explaing CNN based scale weighting mechanism
:alt: A figure explaining CNN based scale weighting mechanism


A neural network model named multi-scale diarization decoder (MSDD) is trained to take advantage of a multi-scale approach by dynamically calculating the weight of each scale. MSDD takes the initial clustering results and compares the extracted speaker embeddings with the cluster-average speaker representation vectors.
Expand All @@ -53,7 +53,7 @@ Most importantly, the weight of each scale at each time step is determined throu
.. image:: images/weighted_sum.png
:align: center
:width: 800px
:alt: A figure explaing weighted sum of cosine similarity values
:alt: A figure explaining weighted sum of cosine similarity values

The estimated scale weights are applied to cosine similarity values calculated for each speaker and each scale. The above figure shows the process of calculating the context vector by applying the estimated scale weights on cosine similarity calculated between cluster-average speaker embedding and input speaker embeddings.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/asr/speech_classification/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ Voxlingua107

VoxLingua107 consists of short speech segments automatically extracted from YouTube videos.
It contains 107 languages. The total amount of speech in the training set is 6628 hours, and 62 hours per language on average but it's highly imbalanced.
It also includes seperate evaluation set containing 1609 speech segments from 33 languages, validated by at least two volunteers.
It also includes separate evaluation set containing 1609 speech segments from 33 languages, validated by at least two volunteers.

You could download dataset from its `official website <http://bark.phon.ioc.ee/voxlingua107/>`_.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/asr/ssl/configs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -240,7 +240,7 @@ where the mlm loss uses targets from the quantization module of the contrastive
We can also use other losses which require labels instead of mlm, such as ctc or rnnt loss. Since these losses, unlike mlm,
don't require our targets to have a direct alignment with our steps, we may also want to use set the ``reduce_ids`` parameter of the
contrastive loss to true, to convert any sequence of consecutive equivalent ids to a single occurence of that id.
contrastive loss to true, to convert any sequence of consecutive equivalent ids to a single occurrence of that id.

An example of a ``loss_list`` consisting of contrastive+ctc loss can look like this:

Expand Down
2 changes: 1 addition & 1 deletion docs/source/asr/ssl/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ from observed part of the input (e.g., filling in the blanks in a sentence or pr
an image is upright or inverted).

SSL for speech/audio understanding broadly falls into either contrastive or reconstruction
based approaches. In contrastive methods, models learn by distinguising between true and distractor
based approaches. In contrastive methods, models learn by distinguishing between true and distractor
tokens (or latents). Examples of contrastive approaches are Contrastive Predictive Coding (CPC),
Masked Language Modeling (MLM) etc. In reconstruction methods, models learn by directly estimating
the missing (intentionally leftout) portions of the input. Masked Reconstruction, Autoregressive
Expand Down
4 changes: 2 additions & 2 deletions docs/source/core/exp_manager.rst
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,7 @@ name as shown below -
project: "<Add some project name here>"
# HP Search may crash due to various reasons, best to attempt continuation in order to
# resume from where the last failure case occured.
# resume from where the last failure case occurred.
resume_if_exists: true
resume_ignore_no_checkpoint: true
Expand Down Expand Up @@ -321,7 +321,7 @@ tracking tool and can simply rerun the best config after the search is finished.

When running hydra scripts, you may sometimes face config issues which crash the program. In NeMo Multi-Run, a crash in
any one run will **not** crash the entire program, we will simply take note of it and move onto the next job. Once all
jobs are completed, we then raise the error in the order that it occured (it will crash the program with the first error's
jobs are completed, we then raise the error in the order that it occurred (it will crash the program with the first error's
stack trace).

In order to debug Muti-Run, we suggest to comment out the full hyper parameter config set inside ``sweep.params``
Expand Down
2 changes: 1 addition & 1 deletion docs/source/tts/models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ End2End Models

VITS
~~~~~~~~~~~~~~~
VITS is an end-to-end speech synthesis model, which generates raw waveform audios from grapheme/phoneme input. It uses Variational Autoencoder to combine GlowTTS-like spectrogram generator with HiFi-GAN vocoder model. Also, it has separate flow-based duration predictor, which samples alignments from noise with conditioning on text. Please refer to :cite:`tts-models-kim2021conditional` for details. The model is experemental yet, so we do not guarantee clean running.
VITS is an end-to-end speech synthesis model, which generates raw waveform audios from grapheme/phoneme input. It uses Variational Autoencoder to combine GlowTTS-like spectrogram generator with HiFi-GAN vocoder model. Also, it has separate flow-based duration predictor, which samples alignments from noise with conditioning on text. Please refer to :cite:`tts-models-kim2021conditional` for details. The model is experimental yet, so we do not guarantee clean running.

.. image:: images/vits_model.png
:align: center
Expand Down

0 comments on commit fb36f2e

Please sign in to comment.