docs: fix typos (NVIDIA#7758)

Signed-off-by: shuoer86 <[email protected]> Co-authored-by: Xuesong Yang <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]>
pzelasko · Jan 3, 2024 · fb36f2e · fb36f2e
1 parent e51d0d1
commit fb36f2e
Show file tree

Hide file tree

Showing 8 changed files with 10 additions and 10 deletions.
diff --git a/docs/source/asr/datasets.rst b/docs/source/asr/datasets.rst
@@ -300,7 +300,7 @@ Bucketing Datasets
 For training ASR models, audios with different lengths may be grouped into a batch. It would make it necessary to use paddings to make all the same length.
 These extra paddings is a significant source of computation waste. Splitting the training samples into buckets with different lengths and sampling from the same bucket for each batch would increase the computation efficicncy.
 It may result into training speeedup of more than 2X. To enable and use the bucketing feature, you need to create the bucketing version of the dataset by using `conversion script here <https://github.com/NVIDIA/NeMo/tree/stable/scripts/speech_recognition/convert_to_tarred_audio_dataset.py>`_.
-You may use --buckets_num to specify the number of buckets (Recommened to use 4 to 8 buckets). It creates multiple tarred datasets, one per bucket, based on the audio durations. The range of [min_duration, max_duration) is split into equal sized buckets.
+You may use --buckets_num to specify the number of buckets (Recommend to use 4 to 8 buckets). It creates multiple tarred datasets, one per bucket, based on the audio durations. The range of [min_duration, max_duration) is split into equal sized buckets.
 
 
 To enable the bucketing feature in the dataset section of the config files, you need to pass the multiple tarred datasets as a list of lists.

diff --git a/docs/source/asr/speaker_diarization/configs.rst b/docs/source/asr/speaker_diarization/configs.rst
@@ -121,7 +121,7 @@ The hyper-parameters for MSDD models are under the ``msdd_module`` key. The mode
     num_lstm_layers: 3 # Number of stacked LSTM layers
     dropout_rate: 0.5 # Dropout rate
     cnn_output_ch: 32 # Number of filters in a conv-net layer.
-    conv_repeat: 2 # Determins the number of conv-net layers. Should be greater or equal to 1.
+    conv_repeat: 2 # Determines the number of conv-net layers. Should be greater or equal to 1.
     emb_dim: 192 # Dimension of the speaker embedding vectors
     scale_n: ${model.scale_n} # Number of scales for multiscale segmentation input
     weighting_scheme: 'conv_scale_weight' # Type of weighting algorithm. Options: ('conv_scale_weight', 'attn_scale_weight')

diff --git a/docs/source/asr/speaker_diarization/models.rst b/docs/source/asr/speaker_diarization/models.rst
@@ -43,7 +43,7 @@ The data flow of the multiscale speaker diarization system is shown in the above
 .. image:: images/scale_weight_cnn.png
         :align: center
         :width: 800px
-        :alt: A figure explaing CNN based scale weighting mechanism
+        :alt: A figure explaining CNN based scale weighting mechanism
 
 
 A neural network model named multi-scale diarization decoder (MSDD) is trained to take advantage of a multi-scale approach by dynamically calculating the weight of each scale. MSDD takes the initial clustering results and compares the extracted speaker embeddings with the cluster-average speaker representation vectors. 
@@ -53,7 +53,7 @@ Most importantly, the weight of each scale at each time step is determined throu
 .. image:: images/weighted_sum.png
         :align: center
         :width: 800px
-        :alt: A figure explaing weighted sum of cosine similarity values
+        :alt: A figure explaining weighted sum of cosine similarity values
 
 The estimated scale weights are applied to cosine similarity values calculated for each speaker and each scale. The above figure shows the process of calculating the context vector by applying the estimated scale weights on cosine similarity calculated between cluster-average speaker embedding and input speaker embeddings. 
 

diff --git a/docs/source/asr/speech_classification/datasets.rst b/docs/source/asr/speech_classification/datasets.rst
@@ -121,7 +121,7 @@ Voxlingua107
 
 VoxLingua107 consists of short speech segments automatically extracted from YouTube videos. 
 It contains 107 languages. The total amount of speech in the training set is 6628 hours, and 62 hours per language on average but it's highly imbalanced. 
-It also includes seperate evaluation set containing 1609 speech segments from 33 languages, validated by at least two volunteers.
+It also includes separate evaluation set containing 1609 speech segments from 33 languages, validated by at least two volunteers.
 
 You could download dataset from its `official website <http://bark.phon.ioc.ee/voxlingua107/>`_.
 

diff --git a/docs/source/asr/ssl/configs.rst b/docs/source/asr/ssl/configs.rst
@@ -240,7 +240,7 @@ where the mlm loss uses targets from the quantization module of the contrastive
 
 We can also use other losses which require labels instead of mlm, such as ctc or rnnt loss. Since these losses, unlike mlm,
 don't require our targets to have a direct alignment with our steps, we may also want to use set the ``reduce_ids`` parameter of the
-contrastive loss to true, to convert any sequence of consecutive equivalent ids to a single occurence of that id.
+contrastive loss to true, to convert any sequence of consecutive equivalent ids to a single occurrence of that id.
 
 An example of a ``loss_list`` consisting of contrastive+ctc loss can look like this:
 

diff --git a/docs/source/asr/ssl/intro.rst b/docs/source/asr/ssl/intro.rst
@@ -8,7 +8,7 @@ from observed part of the input (e.g., filling in the blanks in a sentence or pr
 an image is upright or inverted).
 
 SSL for speech/audio understanding broadly falls into either contrastive or reconstruction 
-based approaches. In contrastive methods, models learn by distinguising between true and distractor 
+based approaches. In contrastive methods, models learn by distinguishing between true and distractor 
 tokens (or latents). Examples of contrastive approaches are Contrastive Predictive Coding (CPC), 
 Masked Language Modeling (MLM) etc. In reconstruction methods, models learn by directly estimating 
 the missing (intentionally leftout) portions of the input. Masked Reconstruction, Autoregressive 

diff --git a/docs/source/core/exp_manager.rst b/docs/source/core/exp_manager.rst
@@ -265,7 +265,7 @@ name as shown below -
         project: "<Add some project name here>"
 
       # HP Search may crash due to various reasons, best to attempt continuation in order to
-      # resume from where the last failure case occured.
+      # resume from where the last failure case occurred.
       resume_if_exists: true
       resume_ignore_no_checkpoint: true
 
@@ -321,7 +321,7 @@ tracking tool and can simply rerun the best config after the search is finished.
 
 When running hydra scripts, you may sometimes face config issues which crash the program. In NeMo Multi-Run, a crash in
 any one run will **not** crash the entire program, we will simply take note of it and move onto the next job. Once all
-jobs are completed, we then raise the error in the order that it occured (it will crash the program with the first error's
+jobs are completed, we then raise the error in the order that it occurred (it will crash the program with the first error's
 stack trace).
 
 In order to debug Muti-Run, we suggest to comment out the full hyper parameter config set inside ``sweep.params``

diff --git a/docs/source/tts/models.rst b/docs/source/tts/models.rst
@@ -116,7 +116,7 @@ End2End Models
 
 VITS
 ~~~~~~~~~~~~~~~
-VITS is an end-to-end speech synthesis model, which generates raw waveform audios from grapheme/phoneme input. It uses Variational Autoencoder to combine GlowTTS-like spectrogram generator with HiFi-GAN vocoder model. Also, it has separate flow-based duration predictor, which samples alignments from noise with conditioning on text.  Please refer to :cite:`tts-models-kim2021conditional` for details. The model is experemental yet, so we do not guarantee clean running.
+VITS is an end-to-end speech synthesis model, which generates raw waveform audios from grapheme/phoneme input. It uses Variational Autoencoder to combine GlowTTS-like spectrogram generator with HiFi-GAN vocoder model. Also, it has separate flow-based duration predictor, which samples alignments from noise with conditioning on text.  Please refer to :cite:`tts-models-kim2021conditional` for details. The model is experimental yet, so we do not guarantee clean running.
 
     .. image:: images/vits_model.png
         :align: center