-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(text): apply global cleaners to symbol sets #408
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #408 +/- ##
==========================================
+ Coverage 73.29% 73.45% +0.16%
==========================================
Files 43 43
Lines 2816 2837 +21
Branches 462 467 +5
==========================================
+ Hits 2064 2084 +20
Misses 668 668
- Partials 84 85 +1 ☔ View full report in Codecov by Sentry. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just started reading the code, but I have to go home, so here's my comment so far.
everyvoice/config/text_config.py
Outdated
for k, v in self.symbols: | ||
normalized_symbols[k] = v | ||
if k in ["punctuation", "silence"]: | ||
continue | ||
normalized_symbols[k] = [ | ||
normalize_text_helper(x, self.to_replace, self.cleaners) for x in v | ||
] | ||
self.symbols = Symbols(**normalized_symbols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're not changing which keys are in self.symbols
here, so I don't understand why you can't just do something like:
for k, v in self.symbols:
if k not in ["punctuation", "silence"]:
self.symbols[k] = [
normalize_text_helper(x, self.to_replace, self.cleaners) for x in v
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this was my first solution, but Pydantic doesn't allow item assignment like this. I get an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, of course, they're attributes rather than dictionary entries. So you can use setattr(self.symbols, k, [ ... ])
instead. I just tested and this works, at least in a toy example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha right, of course
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change the loop with setattr()
(see comments) and handle norm_form = none
, then this will be good.
everyvoice/text/phonemizer.py
Outdated
@@ -40,6 +41,9 @@ def g2p_engine(normalized_input_text: str) -> list[str]: | |||
tokens = tokenise( | |||
text, replace=False, tones=True, strict=False, unknown=True | |||
) | |||
# normalize the output since ipatok applies NFD | |||
unicode_normalization_form = phonemizer.transducers[-1].norm_form.value | |||
tokens = [normalize(unicode_normalization_form, token) for token in tokens] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will raise a ValueError when norm_form is "none"
, which is one of the options we allow in g2p, even though I'm not sure we ever use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah right - good point. thanks
refactors text normalization code out to utils so that it can be used before initializing the text config and ensures that the phonemizer applies the same normalization as the final transducer since ipatok outputs NFD fixes #407
8bcf334
to
0d4cc94
Compare
refactors text normalization code out to utils
so that it can be used before initializing the text config
and ensures that the phonemizer applies the same normalization as the final transducer since ipatok outputs NFD
fixes #407
PR Goal?
This PR ensures that symbols are processed with the defined (global) cleaners. And also fixes #407 by ensuring that the tokenizer doesn't change the normalization form of the g2p engine.
Fixes?
#407
Feedback sought?
Just sanity check I didn't do something silly
Priority?
Medium/high, since @MENGZHEGENG would like to base his work on cleaners on top of this.
Tests added?
Basic unit tests added.
How to test?
I think just reviewing the code is sufficient. In particular, I wonder if there's a better way to write the model_validator. I can't directly assign values to items that I don't know the names of in-advance, which is why I have that somewhat funny creation of a new
Symbols
object.If you really want to see it in action:
Create a config, change the symbols to uppercase or NFD or something, and run preprocessing. The uppercase and NFD symbols should have been lowercased and NFC normalized
Confidence?
medium/high
Version change?
patch only