Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train new language #18

Open
thewh1teagle opened this issue Jun 27, 2024 · 2 comments
Open

Train new language #18

thewh1teagle opened this issue Jun 27, 2024 · 2 comments

Comments

@thewh1teagle
Copy link

Can I use this repo for training new tts model in another language?
How much hours of audio + transcripts do I need?
Does the text should have diacritical signs?

@nipponjo
Copy link
Owner

nipponjo commented Jun 27, 2024

You can certainly do that. In the end, the models in this repo will learn a token ids -> mel frames mapping; independent of the language. In order to train on some dataset, you will have to make a data loader that maps your text to token ids, as it is done for Modern Standard Arabic in this repo. The Arabic Speech Corpus has around 2 hours and I sampled 30-60 minutes per speaker for the multi-speaker model. In my experience it is usually better to have 10+ hours for the prosody, but that will also depend io the quality of the audio files. So far, I have only trained on diacritized text. I assume that it is possible for these models to learn the diacritization, but I haven't tried so far since I don't know any good quality dataset for that. Of course, it is possible to train a model with diacritized text, sample audio files for diacritized text, remove the diacrits and train on that.

@thewh1teagle
Copy link
Author

thewh1teagle commented Jul 15, 2024

Thanks a lot for the comment!

the models in this repo will learn a token ids -> mel frames mapping; independent of the language.

Let me know if that process sounds good:

  1. Find 10-20 hours of high quality recording of single speaker
  2. Split them into multiple files of 5-20 seconds (I can use voice activity detection for good splitting)
  3. Fix the transcriptions by converting numbers(1,2,3) to their text names
  4. Fix the transcriptions by converting symbols (such as $) to their text names
  5. Remove punctuation marks (Can I keep them? I think it's important when speaking)
  6. Removing any character that is not in the whitelisted characters (used later for token ids)
  7. Add vowel points to the transcriptions

From here I'm not sure. how do I convert the cleaned voweled text into token IDS?
Do I start the training from pretrained model? but the pre trained models are mostly English no?
How do I make sure that punctuation marks will be sound eg between sentences?

Can I use the repo source code for the training or it's too different when using different language?

Regarding FastPitch and HiFi-GAN. do I need to change something or it should be used identical to this repo?
Also, do you think Google colab is suitable for such training?

Also, did you trained it from scratch or used pre trained English model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants