-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Gruut espeak inconsistencies makes the training harder. #680
Comments
It looks like this is happening because pronunciations in espeak-ng -v en-us -q --ipa 'to'
tˈuː versus espeak-ng -v en-us -q --ipa 'to develop'
tə dɪvˈɛləp When creating my eSpeak lexicons, I get phonemes for each word individually. So the pronunciation for "to" is coming through as For English, I was able to get part-of-speech aware eSpeak pronunciations by prepending a specific word. For example, espeak-ng -v en-us -q --ipa 'wound'
wˈuːnd
espeak-ng -v en-us -q --ipa 'had wound'
hæd wˈaʊnd Similarly, adding another word after espeak-ng -v en-us -q --ipa 'to to'
tʊtˈʊ EDIT: After some more tests, it looks like I can stop this behavior by injecting phonemes for the second word instead: espeak-ng -v en-us -q --ipa "to [[ x ]]"
tˈə x Unless you have another suggestion, I can re-generate the eSpeak lexicons with this method and see if it helps. |
Sounds good to me but seemingly the contextual differences will be a problem if not solved by the TTS model. Do you know how Espeak deals with the context? Just a side note, training models with |
For part of speech at least, it seems to maintain a set of counters that mean things like "expect perfect tense in the next N words" or "don't expect a verb next". I haven't found yet where and how it decides to alter the pronunciation of a word just due to the presence of other words, or what triggers the decision to combine two words into one.
OK, this is good to know. At least there's a viable alternative until I can get the eSpeak problems sorted out. |
Small side note: you can actually trigger this behavior in phonemizer as well due to how it handles punctuation preservation! echo 'To, be.' | phonemize
tə biː
echo 'To, be.' | phonemize --preserve-punctuation
tuː , biː .
echo 'To be.' | phonemize
tə biː The punctuation mark right after "to" causes it to get run through eSpeak alone, triggering its "single word" pronunciation. It's only a corner case here, but makes me wonder what other weird behaviors of eSpeak lurk beneath... 😉 |
I do not want to hijack this discussion and am not sure if this has been discussed somewhere else, but i'm confused how Gruut deals with foreign language words in a sentence? In Germany we use some english words in german sentences which are pronounced wrong. espeakecho "Ein Song geht mir nicht mehr aus den Ohren." | phonemize -b espeak -l de
TTS server with GruutIt seems that Gruut recognizes it's an english word and is tagging it right, but the spoken audio doesn't sound english.
I tried following sentence with current Coqui release:
Leading to following phonemes:
Here's the spoken output: https://sndup.net/4jcy It sounds a little bit as the language tags will be spoken too ;-) |
Thank you for this, Thorsten. Just when I think I've figured out all of eSpeak's quirks, it hits me something new 🤦♂️ There does not appear to be any option for disabling the "language switching" flags that eSpeak so "helpfully" inserts into the phoneme output. I had assumed they were opt-in, but phonemizer must have been manually removing them with a regex or something. I will rebuild the eSpeak lexicons with these flags removed. Thank you! |
I've pushed up a new version of gruut (1.3.1) with updated eSpeak lexicons and g2p models. I can create a PR for the new version if that's OK with you, @erogol The updated lexicons are produced by doubling each word like this: echo 'a,a' | espeak-ng -v en-us -q --ipa
ɐ ˈeɪ and then taking the phonemes from the first word ( This works for the original bug @erogol reported, but produces a new "problem" with other words: echo 'it,it' | espeak-ng -v en-us -q --ipa
ɪɾ ˈɪt eSpeak replaces If I can understand eSpeak's rules better, I should be able to create context-sensitive pronunciations for gruut that follow these rules. Right now, part of speech is the only feature used to disambiguate multiple pronunciations, but others (like "precedes vowel") could be added pretty easily. |
Great news @synesthesiam . Feel free to send a PR ✨ |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels. |
@synesthesiam sorry for bothering, but am curious if you've pushed the PR? |
Hi @skol101, no bother 🙂 Thanks for your patience with this issue. I did push a PR, but it's since been removed and those changes will be bundled together with fixes/additions for two other issues: You may be able to use my fixes anyway by simply upgrading the version of |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels. |
Describe the bug
Inconsistency b/w gruut with speak phonemes vs phonemizer. Gruut adds additional
:
between characters. It breaks the pronunciation especially as sayingto
orto be
.To Reproduce
For the sentence
It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.
Phonemizer with espeak-ng:
Gruut:
Additional context
I see that these inconsistencies make the learning harder for 🐸 TTS models.
In general training, a model with raw chars produces good results faster than phoneme-based training. I assume this is because of such inconsistencies between the phonemizer and gruut.
I am not training a model with
use_espeak_phonemes=False
and see if it makes any difference.The text was updated successfully, but these errors were encountered: