-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine-tuning for hindi #525
Comments
Hi @thehetpandya , please start by reading this: #431 (comment) It is not possible to finetune the English model to another language. A new model needs to be trained from scratch. This is because the model relates the letters of the alphabet to their associated sounds, so what the model knows about English does not transfer over to Hindi. At a minimum, you will need a total of 50 hours of transcribed speech from at least 100 speakers. For a better model get 10 times this number. This is what you need to do. Good luck and have fun!
|
@blue-fish Thanks a lot for the response! Yes, I have begun exploring the issues for now for a better understanding of the workflow before beginning with the training process. I also read on #492 (comment) that beginning with training the synthesizer would be a good start and then only if the encoder doesn't seem to give proper results, one can proceed with training/fine-tuning the encoder. Does that same apply in case for a totally different language too, like in my case i.e. hindi? |
I'm working on a forked version of sv2tts to train local dialect of chinese. Using the dataset from Common Voice (about 22k of utterances) , i couldn't get the data to converge. But if I add the local dialect on top of a pre-trained model (the main dialect of chinese), seems like the result is actually quite good. Fyi, the local dialect and the main dialect have different, but similar alphabet romanization system (for example, the main has 4 tones, but the local dialect has 8) using pre-trained and then add local dataset: @blue-fish |
@lawrence124 Interesting, thanks for sharing that result! Occasionally the model fails to learn attention, you might try restarting the training from scratch with a different random seed. It might also help to trim the starting and ending silences. If your data is at 16 kHz then webrtcvad can do that for you (see the trim_long_silences function in encoder/audio.py). |
Thanks @blue-fish I went through the issues you mentioned. You gave me a good amount of resources for a start. Much appreciated! |
@lawrence124 Glad to see your results! Did you have to train the encoder from scratch? Or using the pre-trained decoder/synthesizer worked for you? |
i'm using the pretrained encoder from Kuangdd, but according to the file size and date...seems like it is the same as the pretrained encoder from here |
Okay, thanks @lawrence124 ! Seems like using the pretrained encoder is good to go for now. |
btw, i modified a script from adueck a bit. This script will convert video/audio with srt to audio with script for training. I'm not quite sure the format for sv2tts though, but i think u may find it useful if u are trying to get some more date set to train on. https://github.com/adueck/split-video-by-srt |
would like to ask a rather random question...have u tried using the demo TTS from https://www.readspeaker.com/ ?? from my point of view, the result in Chinese/Cantonese is pretty good and i would like to discuss...is that their proprietary algorithm is simply superior ?? or they simply has the resources to build a better dataset to train on ?? based on the job description, what they are doing is not too different from tacotron / sv2tts https://www.isca-speech.org/iscapad/iscapad.php?module=article&id=17363&back=p,250 |
@lawrence124 That website demo uses an different algorithm that probably does not involve machine learning. It sounds like a concatenative method of synthesis where prerecorded sounds are joined together. Listening closely, it is unnatural and obviously computer-generated. To their credit, they do use high-quality audio samples to build the output. Here's a wav of the demo text synthesized by zhrtvc, using Griffin-Lim as the vocoder. Tacotron speech flows a lot more smoothly than their demo. zhrtvc could sound better than the demo TTS if 1) it is trained on higher quality audio, and 2) a properly configured vocoder is available. |
@blue-fish yea, as with other data analysis....getting the good/clean dataset is always difficult. (the prelim result of adding youtube clips is not good) 20200915-204053_melgan_10240ms.zip This is an example of using "mandarin + cantonese" as synthesizer, along with Melgan vocoder. I dont know if it is my ear or not, i dont really like the Griffin-Lim from zhrtvc, it has the "robotic" noise in the background. btw, seems like u are updating the synthesizer of sv2tts ?? the backbone is still tacotron ?? |
@lawrence124 thanks I shall take a look at it since I might need more data if I cannot find any public dataset |
@thehetpandya were you able to generate the model for cloning hindi sentences? |
@GauriDhande I'm still looking for a good hindi speech dataset. Do you have any sources? |
Was going to ask the same thing. I didn't find the Hindi open speech dataset on the internet yet. |
You might be able the combine the two sources below. First train a single-speaker model on source 1, then tune the voice cloning aspect on source 2. Some effort and experimentation will be required. Source 1 (24 hours single-speaker): https://cvit.iiit.ac.in/research/projects/cvit-projects/text-to-speech-dataset-for-indian-languages |
Thanks @blue-fish, I've already applied for Source 1. Will also check out the second one. Your efforts on this project are much appreciated! |
Hi @thehetpandya , have you made any progress on this recently? |
Hi @blue-fish , no I coudn't find progress on this one. I tried fine-tuning https://github.com/Kyubyong/dc_tts instead, which gave clearer pronunciation of hindi words. |
Thanks for trying @thehetpandya . If you decide to work on this later please reopen the issue and I'll try to help. |
Greetings @thehetpandya Are you able to do a real time voice cloning for the given english text, with your experiment in indian accent? Could you please help/guide me with Voice cloning of english Text In My voice with indian accent? Thanks |
Hi @amrahsmaytas no I couldn't land to good results. And then I had to shift to another task. Still I'd be glad if I cloud be of any help. |
Thanks for the reply,het! Thanks ✌, |
@GauriDhande and @thehetpandya were you guys able to generate the model for cloning Hindi sentences? Please reply. Thanks. |
Hi @rajuc110 , sorry for the delayed response. No, I couldn't reproduce the results in hindi and had to shift to another task meanwhile. |
Can you share your work? |
I am also facing this issue has anyone have update on this issue |
Hey guys, has anyone found a solution for hindi voice cloning? Thanks |
Anybody has already trained model for Hindi language? |
any progress done on training real time voice cloning on hindi data set ? |
Hi @blue-fish , I am trying to fine-tune the model to clone voices of hindi speakers. I wanted to know the steps to follow for the same and also the amount of data I'd need for the model to work well.
Edit - I shall use google colab for fine-tuning
The text was updated successfully, but these errors were encountered: