Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of inputs #348

Merged
merged 36 commits into from
Apr 11, 2024
Merged

Improve handling of inputs #348

merged 36 commits into from
Apr 11, 2024

Conversation

roedoejet
Copy link
Member

@roedoejet roedoejet commented Mar 20, 2024

Fixes #216
Fixes: #361

This PR implements the following:

  • We require the user to indicate whether a filelist represents characters, phones, or arpabet. The text column header is no longer used by the preprocessor. The Wizard converts the text column in a filelist to either characters, phones, or arpabet.
  • We always handle arpabet filelists by converting the data into phones.
  • The wizard guesses the symbol set by applying Unicode Grapheme Clustering rules instead of just iterating through each character in the filelist. This is done after the wizard applies the defined cleaners to the text.
  • Added a g2p module that specifies all supported g2p engines and ISO codes. TODO: Add documentation for adding a custom g2p engine: add documentation for adding a custom g2p engine #354
  • If the user provides a filelist representing characters and indicates that the language of their dataset is one that is supported by a supported g2p engine, then phones will be automatically calculated, and phone tokens will be guessed using IPA tokenization library ipatok.
  • I removed all the symbol questions in the wizard
  • Text tensors are not saved to disk anymore. Instead, they are saved to filelists where the tokenization can be inspected (like h/e/l/l/o h/o/w a/r/e y/o/u). The DataLoader then encodes them on-the-fly. Multi-hot phonological features are still saved to disk.
  • Models can specify whether they would like to train using characters, phones, or phonological features.
  • All punctuation is mapped to specific categories. TODO: allow different encoding of word-final punctuation: punctuation config should handle word-final punctuation #355
  • When synthesizing, you also now need to specify which representation of text you are synthesizing from (i.e. everyvoice synthesize from-text model.ckpt --text 'hello' --text-representation 'characters')

Minor changes:

  • The pad token is changed from _ to \x80 (Unicode Pad token)
  • whitespace is handled internally (not by the text config)
  • consolidated the code we use to load filelists. clarify in the wizard whether we are loading a list of lists or a list of dicts.
  • update features.py to use numpy everywhere instead of just adding lists of ints together.
  • refactored the TextProcessor to consolidate methods and added doctests for improved clarity

Related PRs:

Remaining issues:

To illustrate the major changes related to preprocessing, previously for LJ text we would provide a filelist with an ambiguous text column and receive the following:

basename text language speaker
LJ001-0001 printing, in the only sense eng default

Whereas now, still providing same information, we get the following:

basename characters phones language speaker character_tokens phone_tokens
LJ001-0001 printing, in the only sense pɹɪntɪŋ, ɪn ðʌ oʊnli sɛns eng default p/r/i/n/t/i/n/g/<SB>/ /i/n/ /t/h/e/ /o/n/l/y/ /s/e/n/s/e/ p/ɹ/ɪ/n/t/ɪ/ŋ/<SB>/ /ɪ/n/ /ð/ʌ/ /o/ʊ/n/l/i/ /s/ɛ/n/s

@roedoejet roedoejet force-pushed the dev.ap/inputs branch 2 times, most recently from 2ce7fbc to 15e3226 Compare March 22, 2024 00:10
Copy link

codecov bot commented Mar 26, 2024

Codecov Report

Attention: Patch coverage is 83.69352% with 83 lines in your changes are missing coverage. Please review.

Project coverage is 73.15%. Comparing base (0a99b5d) to head (887a633).
Report is 1 commits behind head on main.

Files Patch % Lines
everyvoice/text/text_processor.py 81.96% 15 Missing and 7 partials ⚠️
everyvoice/preprocessor/preprocessor.py 74.02% 11 Missing and 9 partials ⚠️
everyvoice/text/features.py 78.94% 6 Missing and 6 partials ⚠️
everyvoice/text/utils.py 50.00% 8 Missing ⚠️
everyvoice/wizard/dataset.py 86.66% 3 Missing and 5 partials ⚠️
everyvoice/utils/__init__.py 87.17% 3 Missing and 2 partials ⚠️
everyvoice/wizard/utils.py 90.38% 1 Missing and 4 partials ⚠️
everyvoice/model/e2e/dataset.py 40.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #348      +/-   ##
==========================================
+ Coverage   71.49%   73.15%   +1.66%     
==========================================
  Files          39       43       +4     
  Lines        2543     2786     +243     
  Branches      390      459      +69     
==========================================
+ Hits         1818     2038     +220     
- Misses        653      664      +11     
- Partials       72       84      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@roedoejet roedoejet changed the title [DRAFT] Improve handling of inputs Improve handling of inputs Mar 26, 2024
Copy link
Contributor

github-actions bot commented Mar 26, 2024

CLI load time: 0:00.28
Pull Request HEAD: 887a63329946b53270506625a636ffe4bc0db8a9
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package
import time:       424 |     100437 |       everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli.synthesize
import time:       313 |     116449 |     everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli.cli
import time:       259 |     117300 |   everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli
import time:        21 |     117320 | everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli.preprocess

@marctessier
Copy link
Collaborator

I ran a few test with success but getting lots of warnings. ( I did not really look at the results , just ran the full pipeline all the way to synth with success using CRK data.

BUT Noticing this bug when you use a filelist that includes a headerline during the wizard.

EX: I give a file that included a "header line" like below:
basename|text|speaker|language
When I tell the wizard to use the info from that headerline, the TEST-filelist.pst produced is missing the |phones field . Afterwards when you run PREPROCESS , it will crash.

pwd
/home/tes001/u/TxT2SPEECH/MOH/FILELIST
(EveryVoice_pr348_dev.ap_inputs) [U20-GPSC7]:$ head -1 TEST-filelist.psv 
basename|characters|speaker|language

VS using the same file but ignoring the header lines, no crash on proprocess.

pwd
/home/tes001/u/TxT2SPEECH/MOH/FILELIST_manual
(EveryVoice_pr348_dev.ap_inputs) [U20-GPSC7]:$ head -1 TEST-filelist.psv 
basename|characters|speaker|language|phones

@roedoejet
Copy link
Member Author

I ran a few test with success but getting lots of warnings. ( I did not really look at the results , just ran the full pipeline all the way to synth with success using CRK data.

BUT Noticing this bug when you use a filelist that includes a headerline during the wizard.

EX: I give a file that included a "header line" like below: basename|text|speaker|language When I tell the wizard to use the info from that headerline, the TEST-filelist.pst produced is missing the |phones field . Afterwards when you run PREPROCESS , it will crash.

pwd
/home/tes001/u/TxT2SPEECH/MOH/FILELIST
(EveryVoice_pr348_dev.ap_inputs) [U20-GPSC7]:$ head -1 TEST-filelist.psv 
basename|characters|speaker|language

VS using the same file but ignoring the header lines, no crash on proprocess.

pwd
/home/tes001/u/TxT2SPEECH/MOH/FILELIST_manual
(EveryVoice_pr348_dev.ap_inputs) [U20-GPSC7]:$ head -1 TEST-filelist.psv 
basename|characters|speaker|language|phones

I turned this into an issue: #363

@roedoejet roedoejet force-pushed the dev.ap/inputs branch 3 times, most recently from 03df507 to 4966efd Compare April 2, 2024 16:51
@MENGZHEGENG
Copy link
Collaborator

Thanks Adian for your great effort!

It seems that we need to include grapheme in the requirements.txt.

@joanise
Copy link
Member

joanise commented Apr 3, 2024

It seems that we need to include grapheme in the requirements.txt.

Yeah, I had the same reaction when I started testing, but it's because I had not reinstalled. It's already declared, but you need to rerun pip install -e . after updating your sandbox.

@joanise
Copy link
Member

joanise commented Apr 3, 2024

Some disorganized feedback...

  • I love how the character set questions are gone, those were just annoying.
    • But I wonder, what do with do now with languages like moh or west coast ones that might use a command or an apostrophe as a letter?
  • found a bug running preprocess, got a crash and lots of warnings - sent that log by e-mail

Some new mypy warnings that should be fixable:

everyvoice/utils/__init__.py:375: error: Incompatible types in assignment (expression has type "islice[str]", variable has type "TextIOWrapper")  [assignment]
everyvoice/text/text_processor.py:355: error: Argument "lang_id" to "apply_g2p_and_tokenization" of "TextProcessor" has incompatible type "str | None"; expected "str"  [arg-type]
everyvoice/preprocessor/preprocessor.py:687: error: Incompatible types in assignment (expression has type "list[int] | ndarray[Any, dtype[signedinteger[_64Bit]]]", variable has type "list[int] | None")  [assignment]
everyvoice/preprocessor/preprocessor.py:696: error: Incompatible types in assignment (expression has type "list[int] | ndarray[Any, dtype[signedinteger[_64Bit]]]", variable has type "list[int] | None")  [assignment]
everyvoice/preprocessor/preprocessor.py:707: error: Incompatible types in assignment (expression has type "list[int] | ndarray[Any, dtype[signedinteger[_64Bit]]]", variable has type "list[int] | None")  [assignment]
everyvoice/preprocessor/preprocessor.py:716: error: Incompatible types in assignment (expression has type "list[int] | ndarray[Any, dtype[signedinteger[_64Bit]]]", variable has type "list[int] | None")  [assignment]
everyvoice/model/feature_prediction/FastSpeech2_lightning/fs2/cli/check_data.py:35: error: Incompatible types in assignment (expression has type "list[dict[Any, Any]]", variable has type "Path")  [assignment]

@roedoejet
Copy link
Member Author

Thanks for the feedback @joanise !

But I wonder, what do with do now with languages like moh or west coast ones that might use a command or an apostrophe as a letter?

Yes, I also worry about this. I think we either:

  • Prompt the user in the wizard about punctuation
  • Prompt the user in the wizard to review their text_config

Related, but orthogonal to that, perhaps we should remove comma ,, apostrophe ', and colon : from the default punctuation. Or maybe just prompt about those? I'm not really sure. Open to suggestions.

@MENGZHEGENG
Copy link
Collaborator

MENGZHEGENG commented Apr 3, 2024

Now we may have a symbol defined twice in everyvoice-shared-text.yaml (once in lower_ascii and the other in cluster detected symbols):

... | WARNING | everyvoice.text.text_processor:__init__:96 - Symbol 'a' has already been declared at position 18 so we will use that index instead of the current index 45. Please remove duplicates from your configuration.

Not sure if this is a desired operation. Perhaps we can modify the lower_ascii set once automatically when we get the cluster detected symbols.

Again thanks Aidan for the great effort in making the input so much better!

@roedoejet
Copy link
Member Author

roedoejet commented Apr 3, 2024

Now we may have a symbol defined twice in everyvoice-shared-text.yaml (once in lower_ascii and the other in cluster detected symbols):

Thanks Tim! I just added 4a0dc56 and 3fdc561 which I think fixes this. It removes lower_ascii, removes all duplicates automatically, and sorts the symbols first based on length, and then based on standard Python sorting rules. I think this provides better stability and doesn't annoy the user with duplicate-defined symbol warnings because it doesn't really matter if the user defines duplicates, it just matters that there is only one place in the embedding table that the symbol is stored. Thanks for this feedback!

Copy link
Collaborator

@SamuelLarkin SamuelLarkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this is a big one, may be too big. It was hard to review.

everyvoice/config/shared_types.py Outdated Show resolved Hide resolved
everyvoice/config/text_config.py Show resolved Hide resolved
everyvoice/text/text_processor.py Show resolved Hide resolved
everyvoice/tests/data/metadata.csv Show resolved Hide resolved
everyvoice/tests/data/metadata_arpabet.csv Outdated Show resolved Hide resolved
everyvoice/tests/test_preprocessing.py Outdated Show resolved Hide resolved
everyvoice/tests/test_text.py Show resolved Hide resolved
everyvoice/tests/test_text.py Show resolved Hide resolved
everyvoice/exceptions.py Outdated Show resolved Hide resolved
everyvoice/exceptions.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@SamuelLarkin SamuelLarkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm starting yet another review.
I forgot a suggestion.

everyvoice/text/features.py Outdated Show resolved Hide resolved
joanise and others added 22 commits April 10, 2024 17:54
the only thing that is required is that the filelist has the appropriate column headers
refactor to process cleaners on filelist-lists before parsing into a list of dicts
@roedoejet
Copy link
Member Author

Wow, what a monster! Lots of comments below. I would say we want to merge this sooner than later, as soon as it's stable enough to open the repo, so these comments can be turned into issues for future work whenever they're too much to do just now.

Thanks so much for the super helpful review! I think this is good to go now, modulo the bug that you reported when training. I'm happy to merge it now and sort that bug out later though, if all looks good otherwise?

Copy link
Member

@joanise joanise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this now, and do further improvements as new PRs.

It's in good enough shape to merge, and we know this is where we want to work from.

@roedoejet roedoejet merged commit 4bc5861 into main Apr 11, 2024
4 checks passed
@roedoejet roedoejet deleted the dev.ap/inputs branch April 11, 2024 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove lowercase_ascii default from symbol set. Properly handle inputs for IPA, character, Arpabet modelling
5 participants