feat(text): apply global cleaners to symbol sets #408

roedoejet · 2024-04-29T20:37:44Z

refactors text normalization code out to utils
so that it can be used before initializing the text config

and ensures that the phonemizer applies the same normalization as the final transducer since ipatok outputs NFD

fixes #407

PR Goal?

This PR ensures that symbols are processed with the defined (global) cleaners. And also fixes #407 by ensuring that the tokenizer doesn't change the normalization form of the g2p engine.

Fixes?

#407

Feedback sought?

Just sanity check I didn't do something silly

Priority?

Medium/high, since @MENGZHEGENG would like to base his work on cleaners on top of this.

Tests added?

Basic unit tests added.

How to test?

I think just reviewing the code is sufficient. In particular, I wonder if there's a better way to write the model_validator. I can't directly assign values to items that I don't know the names of in-advance, which is why I have that somewhat funny creation of a new Symbols object.

If you really want to see it in action:

Create a config, change the symbols to uppercase or NFD or something, and run preprocessing. The uppercase and NFD symbols should have been lowercased and NFC normalized

Confidence?

medium/high

Version change?

patch only

codecov · 2024-04-29T20:39:39Z

Codecov Report

Attention: Patch coverage is 86.11111% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 73.45%. Comparing base (c863f0c) to head (0d4cc94).

Files	Patch %	Lines
everyvoice/text/utils.py	80.00%	2 Missing and 2 partials ⚠️
everyvoice/text/phonemizer.py	75.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #408      +/-   ##
==========================================
+ Coverage   73.29%   73.45%   +0.16%     
==========================================
  Files          43       43              
  Lines        2816     2837      +21     
  Branches      462      467       +5     
==========================================
+ Hits         2064     2084      +20     
  Misses        668      668              
- Partials       84       85       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-04-29T20:40:36Z

CLI load time: 0:00.28
Pull Request HEAD: 0d4cc94f61c4b73d56915ad2032cb3b5557bd505
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

joanise

just started reading the code, but I have to go home, so here's my comment so far.

joanise · 2024-04-29T21:30:27Z

everyvoice/config/text_config.py

+        for k, v in self.symbols:
+            normalized_symbols[k] = v
+            if k in ["punctuation", "silence"]:
+                continue
+            normalized_symbols[k] = [
+                normalize_text_helper(x, self.to_replace, self.cleaners) for x in v
+            ]
+        self.symbols = Symbols(**normalized_symbols)


You're not changing which keys are in self.symbols here, so I don't understand why you can't just do something like:

for k, v in self.symbols: if k not in ["punctuation", "silence"]: self.symbols[k] = [ normalize_text_helper(x, self.to_replace, self.cleaners) for x in v ]

Yea, this was my first solution, but Pydantic doesn't allow item assignment like this. I get an error.

Right, of course, they're attributes rather than dictionary entries. So you can use setattr(self.symbols, k, [ ... ]) instead. I just tested and this works, at least in a toy example.

haha right, of course

joanise

I would change the loop with setattr() (see comments) and handle norm_form = none, then this will be good.

joanise · 2024-04-30T14:16:26Z

everyvoice/text/phonemizer.py

@@ -40,6 +41,9 @@ def g2p_engine(normalized_input_text: str) -> list[str]:
            tokens = tokenise(
                text, replace=False, tones=True, strict=False, unknown=True
            )
+            # normalize the output since ipatok applies NFD
+            unicode_normalization_form = phonemizer.transducers[-1].norm_form.value
+            tokens = [normalize(unicode_normalization_form, token) for token in tokens]


This will raise a ValueError when norm_form is "none", which is one of the options we allow in g2p, even though I'm not sure we ever use it.

ah right - good point. thanks

refactors text normalization code out to utils so that it can be used before initializing the text config and ensures that the phonemizer applies the same normalization as the final transducer since ipatok outputs NFD fixes #407

roedoejet requested review from SamuelLarkin, joanise and MENGZHEGENG April 29, 2024 20:37

joanise reviewed Apr 29, 2024

View reviewed changes

joanise requested changes Apr 30, 2024

View reviewed changes

feat(text): apply global cleaners to symbol sets

0d4cc94

refactors text normalization code out to utils so that it can be used before initializing the text config and ensures that the phonemizer applies the same normalization as the final transducer since ipatok outputs NFD fixes #407

roedoejet force-pushed the dev.ap/phonemize-normalization branch from 8bcf334 to 0d4cc94 Compare April 30, 2024 16:29

roedoejet merged commit c80c8ef into main Apr 30, 2024
4 checks passed

roedoejet deleted the dev.ap/phonemize-normalization branch April 30, 2024 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(text): apply global cleaners to symbol sets #408

feat(text): apply global cleaners to symbol sets #408

roedoejet commented Apr 29, 2024 •

edited

Loading

codecov bot commented Apr 29, 2024 •

edited

Loading

github-actions bot commented Apr 29, 2024 •

edited

Loading

joanise left a comment

joanise Apr 29, 2024

roedoejet Apr 30, 2024

joanise Apr 30, 2024 •

edited

Loading

roedoejet Apr 30, 2024

joanise left a comment

joanise Apr 30, 2024

roedoejet Apr 30, 2024

feat(text): apply global cleaners to symbol sets #408

feat(text): apply global cleaners to symbol sets #408

Conversation

roedoejet commented Apr 29, 2024 • edited Loading

PR Goal?

Fixes?

Feedback sought?

Priority?

Tests added?

How to test?

Confidence?

Version change?

codecov bot commented Apr 29, 2024 • edited Loading

Codecov Report

github-actions bot commented Apr 29, 2024 • edited Loading

joanise left a comment

Choose a reason for hiding this comment

joanise Apr 29, 2024

Choose a reason for hiding this comment

roedoejet Apr 30, 2024

Choose a reason for hiding this comment

joanise Apr 30, 2024 • edited Loading

Choose a reason for hiding this comment

roedoejet Apr 30, 2024

Choose a reason for hiding this comment

joanise left a comment

Choose a reason for hiding this comment

joanise Apr 30, 2024

Choose a reason for hiding this comment

roedoejet Apr 30, 2024

Choose a reason for hiding this comment

roedoejet commented Apr 29, 2024 •

edited

Loading

codecov bot commented Apr 29, 2024 •

edited

Loading

github-actions bot commented Apr 29, 2024 •

edited

Loading

joanise Apr 30, 2024 •

edited

Loading