Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tts fixed vocab #6172

Merged
merged 20 commits into from
Mar 20, 2023
Merged

Tts fixed vocab #6172

merged 20 commits into from
Mar 20, 2023

Conversation

redoctopus
Copy link
Collaborator

What does this PR do ?

Allows users to pass in a grapheme + phoneme list/set as an enforced symbol vocabulary when using the IPA classes. Other symbols, such as punctuation, OOV, etc. are still handled by other tokenizer arguments.

Once set, G2P dictionary entries will be filtered in the following way:

  • Words with any illegal graphemes are removed entirely.
  • Words with one unique pronunciation and any illegal phonemes are removed entirely.
  • If a word has multiple pronunciations and all of them have at least one illegal phoneme, the word is removed entirely.
  • If at least one pronunciation is valid, it is kept if keep_alternate=True (default), otherwise the word is removed entirely.

The TTSDataset will also filter out all manifest entries whose normalized forms have any illegal graphemes.

Note: Users should take care that the passed-in vocab is consistent with grapheme_prefix and grapheme_case, or else filtering may not work properly.

Collection: TTS

Changelog

  • Added a replace_symbols() function to IPAG2P to enforce user-set vocab and filter existing G2P dict
  • Modified IPATokenizer to support taking in a user-set vocab
  • Modified TTSDataset to check for illegal graphemes if a fixed vocab is used
  • Added unit tests for replace_symbols()

PR Type:

  • New Feature
  • Bugfix
  • Documentation

Copy link
Collaborator

@XuesongYang XuesongYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could you pls add a unit test for IPATokenizer with on/off param fixed_vocab at tests.collections.common.tokenizers.text_to_speech.test_tts_tokenizers? just make sure the behavior of overridden symbol set is expected. Thanks.

@redoctopus
Copy link
Collaborator Author

LGTM. Could you pls add a unit test for IPATokenizer with on/off param fixed_vocab at tests.collections.common.tokenizers.text_to_speech.test_tts_tokenizers? just make sure the behavior of overridden symbol set is expected. Thanks.

Added a test to make sure the phoneme dict in G2P and the tokenization output is as expected.

rlangman
rlangman previously approved these changes Mar 15, 2023
tokens = set(g2p.symbols)
# Build tokens list if fixed_vocab isn't set
if fixed_vocab:
tokens = set(fixed_vocab)
Copy link
Collaborator

@XuesongYang XuesongYang Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this function takes good care of the case that, e.g. 'ö' can be encoded as
b'\xc3\xb6' (one char) as well as b'o\xcc\x88' (two chars). We discussed similar long time ago. https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/common/tokenizers/text_to_speech/tokenizer_utils.py#L96-L101

Copy link
Collaborator Author

@redoctopus redoctopus Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking here that it's reasonable to assume that the user has passed in a "correct"/"canonical" version of the symbols they want (mostly I'm assuming they're copy/pasting from a previous config or model).

Are you suggesting we run normalize over the user input fixed vocab?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm...i was thinking to apply the same process (calling normalize_unicode_text) as what we did now. But this process is applied to g2p/modules.py in our current implementation rather than in tts_tokenizers.py. I guess replace_symbols func should be a better place?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlangman for better comments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now it is done both in tts_tokenizers.py as part of text_preprocessing_func, as well as in g2p/modules.py. I would favor putting any text normalization in tts_tokenizers.py where possible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, added a bit of code to preprocess the fixed vocab symbols in the tokenizer init.

XuesongYang
XuesongYang previously approved these changes Mar 20, 2023
XuesongYang
XuesongYang previously approved these changes Mar 20, 2023
Signed-off-by: Jocelyn Huang <[email protected]>
@XuesongYang XuesongYang merged commit 1c2e45a into main Mar 20, 2023
@XuesongYang XuesongYang deleted the tts_fixed_vocab branch March 20, 2023 22:54
hsiehjackson pushed a commit to hsiehjackson/NeMo that referenced this pull request Jun 2, 2023
* Draft code for fixing grapheme/phoneme vocabulary
* Check for grapheme case before filtering
* Fix imports and style
* Update import path
* Fix support for grapheme prefixes
* Add test for fixed vocab filtering
* Fix attribute error
* Bugfix for attribute error, uncomment decorator
* Remove dataset filtering if fixed_vocab is set (handled by tokenizer)
* Add tokenizer test for setting fixed vocab
* Add preprocessing to fixed vocab for unicode normalization

Signed-off-by: Jocelyn Huang <[email protected]>

---------

Signed-off-by: Jocelyn Huang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: hsiehjackson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants