-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tts fixed vocab #6172
Tts fixed vocab #6172
Conversation
8652ca7
to
8bd8806
Compare
Signed-off-by: Jocelyn Huang <[email protected]>
Signed-off-by: Jocelyn Huang <[email protected]>
Signed-off-by: Jocelyn Huang <[email protected]>
Signed-off-by: Jocelyn Huang <[email protected]>
Signed-off-by: Jocelyn Huang <[email protected]>
Signed-off-by: Jocelyn Huang <[email protected]>
Signed-off-by: Jocelyn Huang <[email protected]>
Signed-off-by: Jocelyn Huang <[email protected]>
8bd8806
to
003ee6b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Could you pls add a unit test for IPATokenizer
with on/off param fixed_vocab
at tests.collections.common.tokenizers.text_to_speech.test_tts_tokenizers
? just make sure the behavior of overridden symbol set is expected. Thanks.
Signed-off-by: Jocelyn Huang <[email protected]>
Signed-off-by: Jocelyn Huang <[email protected]>
Added a test to make sure the phoneme dict in G2P and the tokenization output is as expected. |
tokens = set(g2p.symbols) | ||
# Build tokens list if fixed_vocab isn't set | ||
if fixed_vocab: | ||
tokens = set(fixed_vocab) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this function takes good care of the case that, e.g. 'ö' can be encoded as
b'\xc3\xb6' (one char) as well as b'o\xcc\x88' (two chars). We discussed similar long time ago. https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/common/tokenizers/text_to_speech/tokenizer_utils.py#L96-L101
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking here that it's reasonable to assume that the user has passed in a "correct"/"canonical" version of the symbols they want (mostly I'm assuming they're copy/pasting from a previous config or model).
Are you suggesting we run normalize over the user input fixed vocab?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm...i was thinking to apply the same process (calling normalize_unicode_text
) as what we did now. But this process is applied to g2p/modules.py
in our current implementation rather than in tts_tokenizers.py
. I guess replace_symbols
func should be a better place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rlangman for better comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now it is done both in tts_tokenizers.py
as part of text_preprocessing_func
, as well as in g2p/modules.py
. I would favor putting any text normalization in tts_tokenizers.py
where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, added a bit of code to preprocess the fixed vocab symbols in the tokenizer init.
Signed-off-by: Jocelyn Huang <[email protected]>
Signed-off-by: Jocelyn Huang <[email protected]>
Signed-off-by: Jocelyn Huang <[email protected]>
* Draft code for fixing grapheme/phoneme vocabulary * Check for grapheme case before filtering * Fix imports and style * Update import path * Fix support for grapheme prefixes * Add test for fixed vocab filtering * Fix attribute error * Bugfix for attribute error, uncomment decorator * Remove dataset filtering if fixed_vocab is set (handled by tokenizer) * Add tokenizer test for setting fixed vocab * Add preprocessing to fixed vocab for unicode normalization Signed-off-by: Jocelyn Huang <[email protected]> --------- Signed-off-by: Jocelyn Huang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]>
What does this PR do ?
Allows users to pass in a grapheme + phoneme list/set as an enforced symbol vocabulary when using the IPA classes. Other symbols, such as punctuation, OOV, etc. are still handled by other tokenizer arguments.
Once set, G2P dictionary entries will be filtered in the following way:
keep_alternate=True
(default), otherwise the word is removed entirely.The
TTSDataset
will also filter out all manifest entries whose normalized forms have any illegal graphemes.Note: Users should take care that the passed-in vocab is consistent with
grapheme_prefix
andgrapheme_case
, or else filtering may not work properly.Collection: TTS
Changelog
replace_symbols()
function toIPAG2P
to enforce user-set vocab and filter existing G2P dictIPATokenizer
to support taking in a user-set vocabTTSDataset
to check for illegal graphemes if a fixed vocab is usedreplace_symbols()
PR Type: