Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(text): apply global cleaners to symbol sets #408

Merged
merged 1 commit into from
Apr 30, 2024

Conversation

roedoejet
Copy link
Member

@roedoejet roedoejet commented Apr 29, 2024

refactors text normalization code out to utils
so that it can be used before initializing the text config

and ensures that the phonemizer applies the same normalization as the final transducer since ipatok outputs NFD

fixes #407

PR Goal?

This PR ensures that symbols are processed with the defined (global) cleaners. And also fixes #407 by ensuring that the tokenizer doesn't change the normalization form of the g2p engine.

Fixes?

#407

Feedback sought?

Just sanity check I didn't do something silly

Priority?

Medium/high, since @MENGZHEGENG would like to base his work on cleaners on top of this.

Tests added?

Basic unit tests added.

How to test?

I think just reviewing the code is sufficient. In particular, I wonder if there's a better way to write the model_validator. I can't directly assign values to items that I don't know the names of in-advance, which is why I have that somewhat funny creation of a new Symbols object.

If you really want to see it in action:

Create a config, change the symbols to uppercase or NFD or something, and run preprocessing. The uppercase and NFD symbols should have been lowercased and NFC normalized

Confidence?

medium/high

Version change?

patch only

Copy link

codecov bot commented Apr 29, 2024

Codecov Report

Attention: Patch coverage is 86.11111% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 73.45%. Comparing base (c863f0c) to head (0d4cc94).

Files Patch % Lines
everyvoice/text/utils.py 80.00% 2 Missing and 2 partials ⚠️
everyvoice/text/phonemizer.py 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #408      +/-   ##
==========================================
+ Coverage   73.29%   73.45%   +0.16%     
==========================================
  Files          43       43              
  Lines        2816     2837      +21     
  Branches      462      467       +5     
==========================================
+ Hits         2064     2084      +20     
  Misses        668      668              
- Partials       84       85       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

github-actions bot commented Apr 29, 2024

CLI load time: 0:00.28
Pull Request HEAD: 0d4cc94f61c4b73d56915ad2032cb3b5557bd505
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

Copy link
Member

@joanise joanise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just started reading the code, but I have to go home, so here's my comment so far.

Comment on lines 98 to 105
for k, v in self.symbols:
normalized_symbols[k] = v
if k in ["punctuation", "silence"]:
continue
normalized_symbols[k] = [
normalize_text_helper(x, self.to_replace, self.cleaners) for x in v
]
self.symbols = Symbols(**normalized_symbols)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not changing which keys are in self.symbols here, so I don't understand why you can't just do something like:

for k, v in self.symbols:
    if k not in ["punctuation", "silence"]:
        self.symbols[k] = [
            normalize_text_helper(x, self.to_replace, self.cleaners) for x in v
        ]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this was my first solution, but Pydantic doesn't allow item assignment like this. I get an error.

Copy link
Member

@joanise joanise Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, of course, they're attributes rather than dictionary entries. So you can use setattr(self.symbols, k, [ ... ]) instead. I just tested and this works, at least in a toy example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha right, of course

Copy link
Member

@joanise joanise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change the loop with setattr() (see comments) and handle norm_form = none, then this will be good.

@@ -40,6 +41,9 @@ def g2p_engine(normalized_input_text: str) -> list[str]:
tokens = tokenise(
text, replace=False, tones=True, strict=False, unknown=True
)
# normalize the output since ipatok applies NFD
unicode_normalization_form = phonemizer.transducers[-1].norm_form.value
tokens = [normalize(unicode_normalization_form, token) for token in tokens]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will raise a ValueError when norm_form is "none", which is one of the options we allow in g2p, even though I'm not sure we ever use it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right - good point. thanks

refactors text normalization code out to utils
so that it can be used before initializing the text config

and ensures that the phonemizer applies the same normalization as
the final transducer since ipatok outputs NFD

fixes #407
@roedoejet roedoejet force-pushed the dev.ap/phonemize-normalization branch from 8bcf334 to 0d4cc94 Compare April 30, 2024 16:29
@roedoejet roedoejet merged commit c80c8ef into main Apr 30, 2024
4 checks passed
@roedoejet roedoejet deleted the dev.ap/phonemize-normalization branch April 30, 2024 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read in symbols in NFC/NFD according to the user's choice on text normalization
2 participants