NVIDIA · ekmb · Apr 26, 2023 · Apr 26, 2023 · Apr 26, 2023
diff --git a/tutorials/tools/CTC_Segmentation_Tutorial.ipynb b/tutorials/tools/CTC_Segmentation_Tutorial.ipynb
@@ -280,7 +280,7 @@
         "* `max_length` argument - max number of words in a segment for alignment (used only if there are no punctuation marks present in the original text. Long non-speech segments are better for segments split and are more likely to co-occur with punctuation marks. Random text split could deteriorate the quality of the alignment.\n",
         "* out-of-vocabulary words will be removed based on pre-trained ASR model vocabulary, and the text will be changed to lowercase \n",
         "* sentences for alignment with the original punctuation and capitalization will be stored under  `$OUTPUT_DIR/processed/*_with_punct.txt`\n",
-        "* numbers will be converted from written to their spoken form with `num2words` package. For English, it's recommended to use NeMo normalization tool use `--use_nemo_normalization` argument (not supported if running this segmentation tutorial in Colab, see the text normalization tutorial: [`tutorials/text_processing/Text_Normalization.ipynb`](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/text_processing/Text_Normalization.ipynb) for more details). Even `num2words` normalization is usually enough for proper segmentation. However, it does not take audio into account. NeMo supports audio-based normalization for English, German and Russian languages that can be applied to the segmented data as a post-processing step. Audio-based normalization produces multiple normalization options. For example, `901` could be normalized as `nine zero one` or `nine hundred and one`. The audio-based normalization chooses the best match among the possible normalization options and the transcript based on the character error rate. Note, the audio-based normalization of long audio samples is not supported due to multiple normalization options. See [NeMo/nemo_text_processing/text_normalization/normalize_with_audio.py](https://github.com/NVIDIA/NeMo/blob/stable/nemo_text_processing/text_normalization/normalize_with_audio.py) for more details.\n",
+        "* numbers will be converted from written to their spoken form with `num2words` package. For English, it's recommended to use NeMo normalization tool use `--use_nemo_normalization` argument (not supported if running this segmentation tutorial in Colab, see the text normalization tutorial: [`https://github.com/NVIDIA/NeMo-text-processing/blob/main/tutorials/Text_(Inverse)_Normalization.ipynb`](https://colab.research.google.com/github/NVIDIA/NeMo-text-processing/blob/main/tutorials/Text_(Inverse)_Normalization.ipynb) for more details). Even `num2words` normalization is usually enough for proper segmentation. However, it does not take audio into account. NeMo supports audio-based normalization for English, German and Russian languages that can be applied to the segmented data as a post-processing step. Audio-based normalization produces multiple normalization options. For example, `901` could be normalized as `nine zero one` or `nine hundred and one`. The audio-based normalization chooses the best match among the possible normalization options and the transcript based on the character error rate. See [https://github.com/NVIDIA/NeMo-text-processing/blob/main/nemo_text_processing/text_normalization/normalize_with_audio.py](https://github.com/NVIDIA/NeMo-text-processing/blob/main/nemo_text_processing/text_normalization/normalize_with_audio.py) for more details.\n",
         "\n",
         "### Audio preprocessing:\n",
         "* non '.wav' audio files will be converted to `.wav` format\n",
@@ -714,4 +714,4 @@
       ]
     }
   ]
-}
+}
diff --git a/tutorials/tts/Pronunciation_customization.ipynb b/tutorials/tts/Pronunciation_customization.ipynb
@@ -58,7 +58,7 @@
     "* *[heteronyms](https://en.wikipedia.org/wiki/Heteronym_&#40;linguistics&#41;)* - words with the same spelling but different pronunciations and/or meanings, e.g., *bass* (the fish) and *bass* (the musical instrument).\n",
     "\n",
     "#### Important NeMo flags:\n",
-    "* `your_spec_generator_model.vocab.g2p.phoneme_dict` - phoneme dictionary that maps words to their phonetic transcriptions, e.g., [ARPABET-based CMU Dictionary](https://github.com/NVIDIA/NeMo/blob/r1.14.0/scripts/tts_dataset_files/cmudict-0.7b_nv22.10) or [IPA-based CMU Dictionary](https://github.com/NVIDIA/NeMo/blob/r1.14.0/scripts/tts_dataset_files/ipa_cmudict-0.7b_nv23.01.txt)\n",
+    "* `your_spec_generator_model.vocab.g2p.phoneme_dict` - phoneme dictionary that maps words to their phonetic transcriptions, e.g., [ARPABET-based CMU Dictionary](https://raw.githubusercontent.com/NVIDIA/NeMo/stable/scripts/tts_dataset_files/cmudict-0.7b_nv22.10) or [IPA-based CMU Dictionary](https://github.com/NVIDIA/NeMo/blob/stable/scripts/tts_dataset_files/ipa_cmudict-0.7b_nv23.01.txt)\n",
     "* `your_spec_generator_model.vocab.g2p.heteronyms` - list of the model's heteronyms, grapheme form of these words will be used even if the word is present in the phoneme dictionary.\n",
     "* `your_spec_generator_model.vocab.g2p.ignore_ambiguous_words`: if is set to **True**, words with more than one phonetic representation in the pronunciation dictionary are ignored. This flag is relevant to the words with multiple valid phonetic transcriptions in the dictionary that are not in `your_spec_generator_model.vocab.g2p.heteronyms` list.\n",
     "* `your_spec_generator_model.vocab.phoneme_probability` - phoneme probability flag in the Tokenizer and the same from in the G2P module: `your_spec_generator_model.vocab.g2p.phoneme_probability` ([0, 1]). If a word is present in the phoneme dictionary, we still want our TTS model to see graphemes and phonemes during training to handle OOV words during inference. The `phoneme_probability` determines the probability of an unambiguous dictionary word appearing in phonetic form during model training, `(1 - phoneme_probability)` is the probability of the graphemes. This flag is set to `1` in the parse() method during inference.\n",