Skip to content

Commit c142c17

Browse files
authored
G2P docs (#4841)
* g2p docs added Signed-off-by: ekmb <[email protected]> * fix references Signed-off-by: ekmb <[email protected]> * address review feedback Signed-off-by: ekmb <[email protected]> Signed-off-by: ekmb <[email protected]>
1 parent d8c513b commit c142c17

11 files changed

+296
-3
lines changed

docs/source/asr/datasets.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,7 @@ The audio files can be of any format supported by `Pydub <https://github.com/jia
171171
WAV files as they are the default and have been most thoroughly tested.
172172

173173
There should be one manifest file per dataset that will be passed in, therefore, if the user wants separate training and validation
174-
datasets, they should also have separate manifests. Otherwise, thay will be loading validation data with their training data and vice
174+
datasets, they should also have separate manifests. Otherwise, they will be loading validation data with their training data and vice
175175
versa.
176176

177177
Each line of the manifest should be in the following format:

docs/source/conf.py

+1
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,7 @@
120120
'nlp/text_normalization/tn_itn_all.bib',
121121
'tools/tools_all.bib',
122122
'tts_all.bib',
123+
'text_processing/text_processing_all.bib',
123124
'core/adapters/adapter_bib.bib',
124125
]
125126

docs/source/index.rst

+9
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ NVIDIA NeMo User Guide
4444
nlp/machine_translation/machine_translation
4545
nlp/text_normalization/intro
4646
nlp/api
47+
nlp/models
4748

4849

4950
.. toctree::
@@ -60,6 +61,14 @@ NVIDIA NeMo User Guide
6061
:caption: Common
6162
:name: Common
6263

64+
text_processing/intro
65+
66+
.. toctree::
67+
:maxdepth: 2
68+
:caption: Text Processing
69+
:name: Text Processing
70+
71+
text_processing/g2p/g2p
6372
common/intro
6473

6574

+209
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
.. _g2p:
2+
3+
Grapheme-to-Phoneme Models
4+
==========================
5+
6+
Grapheme-to-phoneme conversion (G2P) is the task of transducing graphemes (i.e., orthographic symbols) to phonemes (i.e., units of the sound system of a language).
7+
For example, for `International_Phonetic_Alphabet (IPA): <https://en.wikipedia.org/wiki/International_Phonetic_Alphabet>`__ ``"Swifts, flushed from chimneys …" → "ˈswɪfts, ˈfɫəʃt ˈfɹəm ˈtʃɪmniz …"``.
8+
9+
Modern text-to-speech (TTS) models can learn pronunciations from raw text input and its corresponding audio data,
10+
but by relying on grapheme input during training, such models fail to provide a reliable way of correcting wrong pronunciations. As a result, many TTS systems use phonetic input
11+
during training to directly access and correct pronunciations at inference time. G2P systems allow users to enforce the desired pronunciation by providing a phonetic transcript of the input.
12+
13+
G2P models convert out-of-vocabulary words (OOV), e.g. proper names and loaner words, as well as heteronyms in their phonetic form to improve the quality of the syntesized text.
14+
15+
*Heteronyms* represent words that have the same spelling but different pronunciations, e.g., “read” in “I will read the book.” vs. “She read her project last week.” A single model that can handle OOVs and heteronyms and replace dictionary lookups can significantly simplify and improve the quality of synthesized speech.
16+
17+
We support the following G2P models:
18+
19+
* **ByT5 G2P** a text-to-text model that is based on ByT5 :cite:`g2p--xue2021byt5` neural network model that was originally proposed in :cite:`g2p--vrezavckova2021t5g2p` and :cite:`g2p--zhu2022byt5`.
20+
21+
* **G2P-Conformer** CTC model - uses a Conformer encoder :cite:`g2p--ggulati2020conformer` followed by a linear decoder; the model is trained with CTC-loss. G2P-Conformer model has about 20 times fewer parameters than the ByT5 model and is a non-autoregressive model that makes it faster during inference.
22+
23+
The models can be trained using words or sentences as input.
24+
If trained with sentence-level input, the models can handle out-of-vocabulary (OOV) and heteronyms along with unambiguous words in a single pass.
25+
See :ref:`Sentence-level Dataset Preparation Pipeline <sentence_level_dataset_pipeline>` on how to label data for G2P model training.
26+
27+
Additionally, we support a purpose-built BERT-based classification model for heteronym disambiguation, see :ref:`this <bert_heteronym_cl>` for details.
28+
29+
Model Training, Evaluation and Inference
30+
----------------------------------------
31+
32+
The section covers both ByT5 and G2P-Conformer models.
33+
34+
The models take input data in `.json` manifest format, and there should be separate training and validation manifests.
35+
Each line of the manifest should be in the following format:
36+
37+
.. code::
38+
39+
{"text_graphemes": "Swifts, flushed from chimneys.", "text": "ˈswɪfts, ˈfɫəʃt ˈfɹəm ˈtʃɪmniz."}
40+
41+
Manifest fields:
42+
43+
* ``text`` - name of the field in manifest_filepath for ground truth phonemes
44+
45+
* ``text_graphemes`` - name of the field in manifest_filepath for input grapheme text
46+
47+
The models can handle input with and without punctuation marks.
48+
49+
To train ByT5 G2P model and evaluate it after at the end of the training, run:
50+
51+
.. code::
52+
53+
python examples/text_processing/g2p/g2p_train_and_evaluate.py \
54+
# (Optional: --config-path=<Path to dir of configs> --config-name=<name of config without .yaml>) \
55+
model.train_ds.manifest_filepath="<Path to manifest file>" \
56+
model.validation_ds.manifest_filepath="<Path to manifest file>" \
57+
model.test_ds.manifest_filepath="<Path to manifest file>" \
58+
trainer.devices=1 \
59+
do_training=True \
60+
do_testing=True
61+
62+
Example of the config file: ``NeMo/examples/text_processing/g2p/conf/t5_g2p.yaml``.
63+
64+
65+
To train G2P-Conformer model and evaluate it after at the end of the training, run:
66+
67+
.. code::
68+
69+
python examples/text_processing/g2p/g2p_train_and_evaluate.py \
70+
# (Optional: --config-path=<Path to dir of configs> --config-name=<name of config without .yaml>) \
71+
model.train_ds.manifest_filepath="<Path to manifest file>" \
72+
model.validation_ds.manifest_filepath="<Path to manifest file>" \
73+
model.test_ds.manifest_filepath="<Path to manifest file>" \
74+
model.tokenizer.dir=<Path to pretrained tokenizer> \
75+
model.tokenizer_grapheme.do_lower=False \
76+
model.tokenizer_grapheme.add_punctuation=True \
77+
trainer.devices=1 \
78+
79+
do_training=True \
80+
do_testing=True
81+
82+
Example of the config file: ``NeMo/examples/text_processing/g2p/conf/g2p_conformer_ctc.yaml``.
83+
84+
85+
To evaluate a pretrained G2P model, run:
86+
87+
.. code::
88+
89+
python examples/text_processing/g2p/g2p_train_and_evaluate.py \
90+
# (Optional: --config-path=<Path to dir of configs> --config-name=<name of config without .yaml>) \
91+
pretrained_model="<Path to .nemo file or pretrained model name from list_available_models()>" \
92+
model.test_ds.manifest_filepath="<Path to manifest file>" \
93+
trainer.devices=1 \
94+
do_training=False \
95+
do_testing=True
96+
97+
To run inference with a pretrained G2P model, run:
98+
99+
.. code-block::
100+
101+
python g2p_inference.py \
102+
pretrained_model=<Path to .nemo file or pretrained model name for G2PModel from list_available_models()>" \
103+
manifest_filepath="<Path to .json manifest>" \
104+
output_file="<Path to .json manifest to save prediction>" \
105+
batch_size=32 \
106+
num_workers=4 \
107+
pred_field="pred_text"
108+
109+
Model's predictions will be saved in `pred_field` of the `output_file`.
110+
111+
.. _sentence_level_dataset_pipeline:
112+
113+
Sentence-level Dataset Preparation Pipeline
114+
-------------------------------------------
115+
116+
Here is the overall overview of the data labeling pipeline for sentence-level G2P model training:
117+
118+
.. image:: images/data_labeling_pipeline.png
119+
:align: center
120+
:alt: Data labeling pipeline for sentence-level G2P model training
121+
:scale: 70%
122+
123+
Here we describe the automatic phoneme-labeling process for generating augmented data. The figure below shows the phoneme-labeling steps to prepare data for sentence-level G2P model training. We first convert known unambiguous words to their phonetic pronunciations with dictionary lookups, e.g. CMU dictionary.
124+
Next, we automatically label heteronyms using a RAD-TTS Aligner :cite:`g2p--badlani2022one`. More details on how to disambiguate heteronyms with a pretrained Aligner model could be found in `NeMo/tutorials/tts/Aligner_Inference_Examples.ipynb <https://github.com/NVIDIA/NeMo/blob/stable/tutorials/tts/Aligner_Inference_Examples.ipynb>`__ in `Google's Colab <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Aligner_Inference_Examples.ipynb>`_.
125+
Finally, we mask-out OOV words with a special masking token, “<unk>” in the figure below (note, we use `model.tokenizer_grapheme.unk_token="҂"` symbol during G2P model training.)
126+
Using this unknown token forces a G2P model to produce the same masking token as a phonetic representation during training. During inference, the model generates phoneme predictions for OOV words without emitting the masking token as long as this token is not included in the grapheme input.
127+
128+
129+
130+
.. _bert_heteronym_cl:
131+
132+
Purpose-built BERT-based classification model for heteronym disambiguation
133+
--------------------------------------------------------------------------
134+
135+
HeteronymClassificationModel is a BERT-based :cite:`g2p--ddevlin2018bert` model represents a token classification model and can handle multiple heteronyms at once. The model takes a sentence as an input, and then for every word, it selects a heteronym option out of the available forms.
136+
We mask irrelevant forms to disregard the model’s predictions for non-ambiguous words. E.g., given the input “The Poems are simple to read and easy to comprehend.” the model scores possible {READ_PRESENT and READ_PAST} options for the word “read”.
137+
Possible heteronym forms are extracted from the WikipediaHomographData :cite:`g2p--gorman2018improving`.
138+
139+
The model expects input to be in `.json` manifest format, where is line contains at least the following fields:
140+
141+
.. code::
142+
143+
{"text_graphemes": "Oxygen is less able to diffuse into the blood, leading to hypoxia.", "start_end": [23, 30], "homograph_span": "diffuse", "word_id": "diffuse_vrb"}
144+
145+
Manifest fields:
146+
147+
* `text_graphemes` - input sentence
148+
149+
* `start_end` - beginning and end of the heteronym span in the input sentence
150+
151+
* `homograph_span` - heteronym word in the sentence
152+
153+
* `word_id` - heteronym label, e.g., word `diffuse` has the following possible labels: `diffuse_vrb` and `diffuse_adj`. See `https://github.com/google-research-datasets/WikipediaHomographData/blob/master/data/wordids.tsv <https://github.com/google-research-datasets/WikipediaHomographData/blob/master/data/wordids.tsv>`__ for more details.
154+
155+
To convert the WikipediaHomographData to `.json` format suitable for the HeteronymClassificationModel training, run:
156+
157+
.. code-block::
158+
159+
# WikipediaHomographData could be downloaded from `https://github.com/google-research-datasets/WikipediaHomographData <https://github.com/google-research-datasets/WikipediaHomographData>`__.
160+
161+
python NeMo/scripts/dataset_processing/g2p/export_wikihomograph_data_to_manifest.py \
162+
--data_folder=<Path to WikipediaHomographData>/WikipediaHomographData-master/data/eval/
163+
--output=eval.json
164+
python NeMo/scripts/dataset_processing/g2p/export_wikihomograph_data_to_manifest.py \
165+
--data_folder=<Path to WikipediaHomographData>/WikipediaHomographData-master/data/train/
166+
--output=train.json
167+
168+
To train and evaluate the model, run:
169+
170+
.. code-block::
171+
172+
python heteronym_classification_train_and_evaluate.py \
173+
train_manifest=<Path to manifest file>" \
174+
validation_manifest=<Path to manifest file>" \
175+
model.encoder.pretrained="<Path to .nemo file or pretrained model name from list_available_models()>" \
176+
model.wordids=<Path to wordids.tsv file, similar to https://github.com/google-research-datasets/WikipediaHomographData/blob/master/data/wordids.tsv> \
177+
do_training=True \
178+
do_testing=True
179+
180+
181+
To run inference with a pretrained HeteronymClassificationModel, run:
182+
183+
.. code-block::
184+
185+
python heteronym_classification_inference.py \
186+
manifest="<Path to .json manifest>" \
187+
pretrained_model="<Path to .nemo file or pretrained model name from list_available_models()>" \
188+
output_file="<Path to .json manifest to save prediction>"
189+
190+
Note, if the input manifest contains target "word_id", evaluation will be also performed. During inference, the model predicts heteronym `word_id` and saves predictions in `"pred_text"` field of the `output_file`:
191+
192+
.. code::
193+
194+
{"text_graphemes": "Oxygen is less able to diffuse into the blood, leading to hypoxia.", "pred_text": "diffuse_vrb", "start_end": [23, 30], "homograph_span": "diffuse", "word_id": "diffuse_vrb"}
195+
196+
197+
Requirements
198+
------------
199+
200+
G2P requires NeMo NLP and ASR collections installed. See `Installation instructions <https://github.com/NVIDIA/NeMo/blob/main/docs/source/starthere/intro.rst#installation>`__ for more details.
201+
202+
203+
References
204+
----------
205+
206+
.. bibliography:: ../text_processing_all.bib
207+
:style: plain
208+
:labelprefix: g2p-
209+
:keyprefix: g2p--
Loading

docs/source/text_processing/intro.rst

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
NeMo Text Processing
2+
====================
3+
4+
NeMo provides a set of models for text processing input and/or output of Automatic Speech Recognitions (ASR) and Text-to-Speech (TTS) models: \
5+
`https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing <https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing>`__ .
6+
7+
.. toctree::
8+
:maxdepth: 1
9+
10+
g2p
11+
12+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
@article{xue2021byt5,
2+
title={ByT5: Towards a token-free future with pre-trained byte-to-byte models 2021},
3+
author={Xue, Linting and Barua, Aditya and Constant, Noah and Al-Rfou, Rami and Narang, Sharan and Kale, Mihir and Roberts, Adam and Raffel, Colin},
4+
journal={arXiv preprint arXiv:2105.13626},
5+
year={2021}
6+
}
7+
8+
@article{vrezavckova2021t5g2p,
9+
title={T5g2p: Using text-to-text transfer transformer for grapheme-to-phoneme conversion},
10+
author={{\v{R}}ez{\'a}{\v{c}}kov{\'a}, Mark{\'e}ta and {\v{S}}vec, Jan and Tihelka, Daniel},
11+
year={2021},
12+
journal={International Speech Communication Association}
13+
}
14+
15+
@article{zhu2022byt5,
16+
title={ByT5 model for massively multilingual grapheme-to-phoneme conversion},
17+
author={Zhu, Jian and Zhang, Cong and Jurgens, David},
18+
journal={arXiv preprint arXiv:2204.03067},
19+
year={2022}
20+
}
21+
22+
@article{ggulati2020conformer,
23+
title={Conformer: Convolution-augmented transformer for speech recognition},
24+
author={Gulati, Anmol and Qin, James and Chiu, Chung-Cheng and Parmar, Niki and Zhang, Yu and Yu, Jiahui and Han, Wei and Wang, Shibo and Zhang, Zhengdong and Wu, Yonghui and others},
25+
journal={arXiv preprint arXiv:2005.08100},
26+
year={2020}
27+
}
28+
29+
@article{ddevlin2018bert,
30+
title={Bert: Pre-training of deep bidirectional transformers for language understanding},
31+
author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
32+
journal={arXiv preprint arXiv:1810.04805},
33+
year={2018}
34+
}
35+
36+
@inproceedings{gorman2018improving,
37+
title={Improving homograph disambiguation with supervised machine learning},
38+
author={Gorman, Kyle and Mazovetskiy, Gleb and Nikolaev, Vitaly},
39+
booktitle={Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
40+
year={2018}
41+
}
42+
43+
44+
@inproceedings{badlani2022one,
45+
title={One TTS alignment to rule them all},
46+
author={Badlani, Rohan and {\L}a{\'n}cucki, Adrian and Shih, Kevin J and Valle, Rafael and Ping, Wei and Catanzaro, Bryan},
47+
booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
48+
pages={6092--6096},
49+
year={2022},
50+
organization={IEEE}
51+
}
52+
53+

examples/text_processing/g2p/conf/g2p_conformer_ctc.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ model:
2727
feat_in: ${model.embedding.d_model}
2828
feat_out: -1 # you may set it if you need different output size other than the default d_model
2929
n_layers: 16
30-
d_model: 256
30+
d_model: 176
3131

3232
# Sub-sampling params
3333
subsampling: null # vggnet or striding, vggnet may give better results but needs more memory
@@ -39,7 +39,7 @@ model:
3939

4040
# Multi-headed Attention Module's params
4141
self_attention_model: rel_pos # rel_pos or abs_pos
42-
n_heads: 8 # may need to be lower for smaller d_models
42+
n_heads: 4 # may need to be lower for smaller d_models
4343
# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
4444
att_context_size: [ -1, -1 ] # -1 means unlimited context
4545
xscaling: true # scales up the input embeddings by sqrt(d_model)

examples/text_processing/g2p/g2p_train_and_evaluate.py

+4
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,8 @@
3838
trainer.devices=1 \
3939
do_training=True \
4040
do_testing=True
41+
42+
Example of the config file: NeMo/examples/text_processing/g2p/conf/t5_g2p.yaml
4143
4244
# Training Conformer-G2P Model and evaluation at the end of training:
4345
python examples/text_processing/g2p/g2p_train_and_evaluate.py \
@@ -50,6 +52,8 @@
5052
do_training=True \
5153
do_testing=True
5254
55+
Example of the config file: NeMo/examples/text_processing/g2p/conf/g2p_conformer_ctc.yaml
56+
5357
# Run evaluation of the pretrained model:
5458
python examples/text_processing/g2p/g2p_train_and_evaluate.py \
5559
# (Optional: --config-path=<Path to dir of configs> --config-name=<name of config without .yaml>) \

examples/text_processing/g2p/heteronym_classification_inference.py

+2
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,8 @@
3131
This script runs inference with HeteronymClassificationModel
3232
If the input manifest contains target "word_id", evaluation will be also performed.
3333
34+
To prepare dataset, see NeMo/scripts/dataset_processing/g2p/export_wikihomograph_data_to_manifest.py
35+
3436
python heteronym_classification_inference.py \
3537
manifest="<Path to .json manifest>" \
3638
pretrained_model="<Path to .nemo file or pretrained model name from list_available_models()>" \

examples/text_processing/g2p/heteronym_classification_train_and_evaluate.py

+3
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,14 @@
2727
"""
2828
This script runs training and evaluation of HeteronymClassificationModel
2929
30+
To prepare dataset, see NeMo/scripts/dataset_processing/g2p/export_wikihomograph_data_to_manifest.py
31+
3032
To run training and testing:
3133
python heteronym_classification_train_and_evaluate.py \
3234
train_manifest=<Path to manifest file>" \
3335
validation_manifest=<Path to manifest file>" \
3436
model.encoder.pretrained="<Path to .nemo file or pretrained model name from list_available_models()>" \
37+
model.wordids=<Path to wordids.tsv file, similar to https://github.com/google-research-datasets/WikipediaHomographData/blob/master/data/wordids.tsv> \
3538
do_training=True \
3639
do_testing=True
3740
"""

0 commit comments

Comments
 (0)