-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How the different phonemes sounds exactly? (Preparation for fine-tuning...) #40
Comments
Two more things...
|
For the phoneme, you do not need to use that exact phoneme inventory, most of them are a standard phoneme + some diacritics attaching to it. For the diacritics, you can find info, for example, here You can use much simpler phoneme inventory if that satisfies your purpose. Actually, I think that the default phoneme inventory is hard to recognize. I am not very familiar with espeak-ng's inventory. if it is x-sampa format, you can convert them using this file from panphon In the fine-tuning, you should only include phonemes separated by space, do not use other special signs as they might be interpreted as phonemes. |
Thanks for the answer, @xinjli, I have started to check those links, that you have mentioned. I would like to use a simple phoneme inventory, of course, if it will be possible, but still I have a lot of question regarding the actual phoneme inventory (I think I can understand the diacritics, so that part is not a question). Let me give you one example! And I have even tried to use the topk parameter, but even with topk=5. t͡s does not come out in the result... (By the way, I have checked Phoible, and according to [https://phoible.org/languages/hung1274], all the different inventories for Hungarian has this ts phoneme (either t͡s, or ts, or t̪s̪).) Regarding the fine-tuning... You mentioned I should not use special signs... Thank you and best regards! |
it might be the likelihood is tɕ >= t >= t͡s. This is typically caused by the unbalanced training set when I trained the model. For the special signs, you can use 'zː', 'z̪' or 'z̻' as long as they are valid IPA. |
Thanks for the explanation. So I will prepare the datasets for model fine-tuning. I have read in the README, that the audio files should be shorter than 10 seconds. But can you tell me, which one is better, to have only one word in these audio files, or better to have complete (short) sentences? And is there a need for a short silence at the beginning and the end of the audio files or not? And may I ask, what kind of dataset are you using for the training? Is it contains samples from all the languages? If there are Hungarian samples in it, is it possible to check them? Thanks! |
I think both styles are possible (one word per file, short sentence per file), it depends on your final application, you can use whatever you think appropriate. There does not need to contain silence at the beginning for training. About the dataset other than English, it was mainly from a corpus collection called Babel dataset, its telephone conversation corpus. You can see the list of corpus from the linked paper. The model available here is using the exact same corpus set, but it is very similar. There were no Hungarian samples when I trained it. |
I have started the fine-tuning... So what should I do? Do I have to put the long consonants into the phoneme inventory as well, or something else? |
I might be wrong but as far as I know, only vowels can have this "long version", in your case, k itself is a very short consonant probably should not have a long version, it seems more reasonable to be something like k o: r. If you still want to distinguish them, you can treat them as two different phonemes (o, o:) and train it |
This is not correct. There are two ways of transcribing a geminate (or long) consonant: [kk] or [kː]. The first is ambiguous, since it can represent a sequence of two [k]s or a long counterpart of [k]. |
so maybe we can decompose [k:] into [k] [k] in this case? |
I don't think we can decompose [k:] into [k] [k]. |
Hi,
When I use allosaurus with the eng2102 model for an English wav file, the results looks quite good (although there is one issue, if there is no silence at the beginning of the wav file, some phonemes from the beginning of the speech will be missing - I am still testing this, maybe later I will start a separate issue on this topic).
But when I use the universal model for a Hungarian wav file, the results are not so good (of course, I know it is not a very well known language ;-)).
So I would like to fine-tune the model. But for this, I need to create the text files about the phonemes of the sentences. As it is stated in the doc, the phones here should be restricted to the phone inventory of my target language.
The phone inventory for the Hungarian language is the following:
aː b bː c d dː d̠ d̪ d̪ː d̻ eː f fː h hː i iː j jː k kː l lː l̪ l̪ː m mː n nː n̪ n̪ː o oː p pː r rː r̪ r̪ː s sː s̪ s̻ t tː t̠ t̪ t̪ː t̻ u uː v vː w y yː z zː z̪ z̻ æ ø øː ɑ ɒ ɔ ɛ ɟ ɡ ɡː ɲ ɲː ɾ ʃ ʃː ʒ ʒː ʝ ʝː
But for some phonemes I cannot recognize.
Here is the explanation for the IPA signs for the Hungarian language:
https://hu.wikipedia.org/wiki/IPA_magyar_nyelvre
(unfortunately, it is in Hungarian, but the IPA signs are easy to find...)
Can you help me to understand this, or give me a link to any document, describing these phonemes?
Thanks!
The text was updated successfully, but these errors were encountered: