Tts fixed vocab #6172

redoctopus · 2023-03-10T23:15:34Z

What does this PR do ?

Allows users to pass in a grapheme + phoneme list/set as an enforced symbol vocabulary when using the IPA classes. Other symbols, such as punctuation, OOV, etc. are still handled by other tokenizer arguments.

Once set, G2P dictionary entries will be filtered in the following way:

Words with any illegal graphemes are removed entirely.
Words with one unique pronunciation and any illegal phonemes are removed entirely.
If a word has multiple pronunciations and all of them have at least one illegal phoneme, the word is removed entirely.
If at least one pronunciation is valid, it is kept if keep_alternate=True (default), otherwise the word is removed entirely.

The TTSDataset will also filter out all manifest entries whose normalized forms have any illegal graphemes.

Note: Users should take care that the passed-in vocab is consistent with grapheme_prefix and grapheme_case, or else filtering may not work properly.

Collection: TTS

Changelog

Added a replace_symbols() function to IPAG2P to enforce user-set vocab and filter existing G2P dict
Modified IPATokenizer to support taking in a user-set vocab
Modified TTSDataset to check for illegal graphemes if a fixed vocab is used
Added unit tests for replace_symbols()

PR Type:

New Feature
Bugfix
Documentation

Signed-off-by: Jocelyn Huang <[email protected]>

XuesongYang

LGTM. Could you pls add a unit test for IPATokenizer with on/off param fixed_vocab at tests.collections.common.tokenizers.text_to_speech.test_tts_tokenizers? just make sure the behavior of overridden symbol set is expected. Thanks.

nemo/collections/tts/data/tts_dataset.py

Signed-off-by: Jocelyn Huang <[email protected]>

redoctopus · 2023-03-14T23:32:58Z

LGTM. Could you pls add a unit test for IPATokenizer with on/off param fixed_vocab at tests.collections.common.tokenizers.text_to_speech.test_tts_tokenizers? just make sure the behavior of overridden symbol set is expected. Thanks.

Added a test to make sure the phoneme dict in G2P and the tokenization output is as expected.

XuesongYang · 2023-03-15T23:12:34Z

nemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py

-        tokens = set(g2p.symbols)
+        # Build tokens list if fixed_vocab isn't set
+        if fixed_vocab:
+            tokens = set(fixed_vocab)


I wonder if this function takes good care of the case that, e.g. 'ö' can be encoded as
b'\xc3\xb6' (one char) as well as b'o\xcc\x88' (two chars). We discussed similar long time ago. https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/common/tokenizers/text_to_speech/tokenizer_utils.py#L96-L101

I'm thinking here that it's reasonable to assume that the user has passed in a "correct"/"canonical" version of the symbols they want (mostly I'm assuming they're copy/pasting from a previous config or model).

Are you suggesting we run normalize over the user input fixed vocab?

hmm...i was thinking to apply the same process (calling normalize_unicode_text) as what we did now. But this process is applied to g2p/modules.py in our current implementation rather than in tts_tokenizers.py. I guess replace_symbols func should be a better place?

@rlangman for better comments.

Right now it is done both in tts_tokenizers.py as part of text_preprocessing_func, as well as in g2p/modules.py. I would favor putting any text normalization in tts_tokenizers.py where possible.

Alright, added a bit of code to preprocess the fixed vocab symbols in the tokenizer init.

Signed-off-by: Jocelyn Huang <[email protected]>

for more information, see https://pre-commit.ci

nemo/collections/tts/data/tts_dataset.py

nemo/collections/tts/g2p/modules.py

Signed-off-by: Jocelyn Huang <[email protected]>

* Draft code for fixing grapheme/phoneme vocabulary * Check for grapheme case before filtering * Fix imports and style * Update import path * Fix support for grapheme prefixes * Add test for fixed vocab filtering * Fix attribute error * Bugfix for attribute error, uncomment decorator * Remove dataset filtering if fixed_vocab is set (handled by tokenizer) * Add tokenizer test for setting fixed vocab * Add preprocessing to fixed vocab for unicode normalization Signed-off-by: Jocelyn Huang <[email protected]> --------- Signed-off-by: Jocelyn Huang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]>

redoctopus requested review from XuesongYang, blisc and rlangman March 10, 2023 23:15

github-actions bot added common TTS labels Mar 10, 2023

redoctopus force-pushed the tts_fixed_vocab branch from 8652ca7 to 8bd8806 Compare March 13, 2023 21:54

redoctopus added 8 commits March 14, 2023 10:41

Draft code for fixing grapheme/phoneme vocabulary

3e4e1ce

Signed-off-by: Jocelyn Huang <[email protected]>

Check for grapheme case before filtering

0d80781

Signed-off-by: Jocelyn Huang <[email protected]>

Fix imports and style

5226e8b

Signed-off-by: Jocelyn Huang <[email protected]>

Update import path

557b960

Signed-off-by: Jocelyn Huang <[email protected]>

Fix support for grapheme prefixes

1068c99

Signed-off-by: Jocelyn Huang <[email protected]>

Add test for fixed vocab filtering

bdd663c

Signed-off-by: Jocelyn Huang <[email protected]>

Fix attribute error

f1063eb

Signed-off-by: Jocelyn Huang <[email protected]>

Bugfix for attribute error, uncomment decorator

003ee6b

Signed-off-by: Jocelyn Huang <[email protected]>

redoctopus force-pushed the tts_fixed_vocab branch from 8bd8806 to 003ee6b Compare March 14, 2023 17:41

XuesongYang requested changes Mar 14, 2023

View reviewed changes

rlangman reviewed Mar 14, 2023

View reviewed changes

nemo/collections/tts/data/tts_dataset.py Outdated Show resolved Hide resolved

redoctopus added 3 commits March 14, 2023 15:37

Remove dataset filtering if fixed_vocab is set (handled by tokenizer)

67b7f0e

Signed-off-by: Jocelyn Huang <[email protected]>

Add tokenizer test for setting fixed vocab

65064b2

Signed-off-by: Jocelyn Huang <[email protected]>

Merge branch 'main' into tts_fixed_vocab

c19dbd2

Merge branch 'main' into tts_fixed_vocab

1355ac1

rlangman previously approved these changes Mar 15, 2023

View reviewed changes

redoctopus requested a review from XuesongYang March 15, 2023 22:11

XuesongYang reviewed Mar 15, 2023

View reviewed changes

Add preprocessing to fixed vocab for unicode normalization

053b5b9

Signed-off-by: Jocelyn Huang <[email protected]>

redoctopus dismissed rlangman’s stale review via 053b5b9 March 16, 2023 20:32

redoctopus and others added 3 commits March 16, 2023 13:33

Merge branch 'main' into tts_fixed_vocab

ddfa0e9

[pre-commit.ci] auto fixes from pre-commit.com hooks

1c1ef9a

for more information, see https://pre-commit.ci

Merge branch 'main' into tts_fixed_vocab

fbf3c60

Merge branch 'main' into tts_fixed_vocab

0635f47

redoctopus requested a review from XuesongYang March 17, 2023 16:58

XuesongYang reviewed Mar 20, 2023

View reviewed changes

nemo/collections/tts/data/tts_dataset.py Show resolved Hide resolved

XuesongYang reviewed Mar 20, 2023

View reviewed changes

nemo/collections/tts/data/tts_dataset.py Show resolved Hide resolved

XuesongYang reviewed Mar 20, 2023

View reviewed changes

nemo/collections/tts/g2p/modules.py Outdated Show resolved Hide resolved

XuesongYang reviewed Mar 20, 2023

View reviewed changes

nemo/collections/tts/g2p/modules.py Outdated Show resolved Hide resolved

XuesongYang previously approved these changes Mar 20, 2023

View reviewed changes

Fix typo, move check for set equality

a3bb341

Signed-off-by: Jocelyn Huang <[email protected]>

redoctopus dismissed XuesongYang’s stale review via a3bb341 March 20, 2023 20:57

Merge branch 'main' into tts_fixed_vocab

b36daae

XuesongYang previously approved these changes Mar 20, 2023

View reviewed changes

Bugfix - use else case

d59ba2c

Signed-off-by: Jocelyn Huang <[email protected]>

redoctopus dismissed XuesongYang’s stale review via d59ba2c March 20, 2023 21:20

XuesongYang approved these changes Mar 20, 2023

View reviewed changes

XuesongYang merged commit 1c2e45a into main Mar 20, 2023

XuesongYang deleted the tts_fixed_vocab branch March 20, 2023 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tts fixed vocab #6172

Tts fixed vocab #6172

redoctopus commented Mar 10, 2023

XuesongYang left a comment

redoctopus commented Mar 14, 2023

XuesongYang Mar 15, 2023 •

edited

Loading

redoctopus Mar 15, 2023 •

edited

Loading

XuesongYang Mar 15, 2023

XuesongYang Mar 15, 2023

rlangman Mar 16, 2023

redoctopus Mar 16, 2023

Tts fixed vocab #6172

Tts fixed vocab #6172

Conversation

redoctopus commented Mar 10, 2023

What does this PR do ?

Changelog

XuesongYang left a comment

Choose a reason for hiding this comment

redoctopus commented Mar 14, 2023

XuesongYang Mar 15, 2023 • edited Loading

Choose a reason for hiding this comment

redoctopus Mar 15, 2023 • edited Loading

Choose a reason for hiding this comment

XuesongYang Mar 15, 2023

Choose a reason for hiding this comment

XuesongYang Mar 15, 2023

Choose a reason for hiding this comment

rlangman Mar 16, 2023

Choose a reason for hiding this comment

redoctopus Mar 16, 2023

Choose a reason for hiding this comment

XuesongYang Mar 15, 2023 •

edited

Loading

redoctopus Mar 15, 2023 •

edited

Loading