-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tts fixed vocab #6172
Merged
Merged
Tts fixed vocab #6172
Changes from 12 commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
3e4e1ce
Draft code for fixing grapheme/phoneme vocabulary
redoctopus 0d80781
Check for grapheme case before filtering
redoctopus 5226e8b
Fix imports and style
redoctopus 557b960
Update import path
redoctopus 1068c99
Fix support for grapheme prefixes
redoctopus bdd663c
Add test for fixed vocab filtering
redoctopus f1063eb
Fix attribute error
redoctopus 003ee6b
Bugfix for attribute error, uncomment decorator
redoctopus 67b7f0e
Remove dataset filtering if fixed_vocab is set (handled by tokenizer)
redoctopus 65064b2
Add tokenizer test for setting fixed vocab
redoctopus c19dbd2
Merge branch 'main' into tts_fixed_vocab
redoctopus 1355ac1
Merge branch 'main' into tts_fixed_vocab
redoctopus 053b5b9
Add preprocessing to fixed vocab for unicode normalization
redoctopus ddfa0e9
Merge branch 'main' into tts_fixed_vocab
redoctopus 1c1ef9a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] fbf3c60
Merge branch 'main' into tts_fixed_vocab
redoctopus 0635f47
Merge branch 'main' into tts_fixed_vocab
redoctopus a3bb341
Fix typo, move check for set equality
redoctopus b36daae
Merge branch 'main' into tts_fixed_vocab
redoctopus d59ba2c
Bugfix - use else case
redoctopus File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this function takes good care of the case that, e.g. 'ö' can be encoded as
b'\xc3\xb6' (one char) as well as b'o\xcc\x88' (two chars). We discussed similar long time ago. https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/common/tokenizers/text_to_speech/tokenizer_utils.py#L96-L101
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking here that it's reasonable to assume that the user has passed in a "correct"/"canonical" version of the symbols they want (mostly I'm assuming they're copy/pasting from a previous config or model).
Are you suggesting we run normalize over the user input fixed vocab?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm...i was thinking to apply the same process (calling
normalize_unicode_text
) as what we did now. But this process is applied tog2p/modules.py
in our current implementation rather than intts_tokenizers.py
. I guessreplace_symbols
func should be a better place?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rlangman for better comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now it is done both in
tts_tokenizers.py
as part oftext_preprocessing_func
, as well as ing2p/modules.py
. I would favor putting any text normalization intts_tokenizers.py
where possible.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, added a bit of code to preprocess the fixed vocab symbols in the tokenizer init.