-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
English synthsis is good, how about Chinese? #58
Comments
Never tried sorry but it'd be interesting to see. |
Chinese is also good in this model. And compared with other tacotron model, this model can get clear voice with less time. in my test, with same dataset, 10000 steps can synthesis the voice which the quality similar to tacotron 50000 steps. |
@dvbfuns great to hear that. Do you have any samples to share? It'd be great to put into the main page, if you don't mind. |
@dvbfuns Which training dataset are u using? A Chinese version TTS would be good to enhance this great repo |
@erogol would like to share the samples, just I have problem to access soundcloud.com, any suggestions to do the sharing? or I can share them to you with e-mail ? |
@dvbfuns e-mail would work [email protected] . Thanks for your help. |
@erogol , already send your mail with the model and samples, please kindly refer. |
@jinfagang I can put whatever @dvbfuns can provide. But also understand if he doesn't like to share the model. |
@erogol Could u resend the voice samples to me? I'd like to check the performance of Chinese result. [email protected] , thanks in advance |
@jinfagang anything I've will be posted on Github as soon as I receive. |
I close this due to inactivity. Feel free to reopen. |
@mazzzystar Thanks for sharing your results. They sound to me quite okay but I am not a Chinese speaker. I'd suggest you to replace the stop token layer with a RNN as it was in the previous versions. RNN based model is larger but it is more reliable. Here is a snapshot:
|
@erogol One part I think need to be improved is that, the voice texture is still a little bit "electronic" and unlike real human, though it's good enough. I may start to focus on this part and try out some methods, such as different Vocoder, or other attention method. BTW have you considered of using Finally, thanks again for your great work ! |
@mazzzystar I'd say attention is more about laying the right pronunciation but naturalness is a matter of the vocoder. You can also try attention windowing implemented in dev branch When it comes to BERT, I've not tried yet. One problem with BERT, it requires more memory compared to RNN. Therefore it might be edgy in low budget systems to train which I prefer to stay away. However, if you like to try, I am here to help. Thanks again! |
@mazzzystar Your Chinese result is really impressive! May I ask which Chinese voice corpus did you use? Or which way did u organize your data? |
@jinfagang |
@mazzzystar hello man, the |
@mazzzystar @jinfagang @dvbfuns @erogol yes, i also tried it out in chinese corpus. The model just get a better alignment than the other tacotron2 project, especially |
@tsungruihon Which repo are u using? |
@jinfagang just use |
@tsungruihon Sorry, I mean, which corpus |
@jinfagang audio that post in some app. |
Hello @erogol, thanks for you great work! I'm new to TTS domain and trying to adapt your repo to some Chinese dataset(10000 sentences, 12H). Training is still ongoing but seems promising. I have several doubts when looking into details, hope that you could give me some advices:
|
|
@erogol thanks for the reply, now I'm training without forward attention and the problem in the figure above seems dissapeared for now, I will wait for longer to see what il will become. For the fine-tuning, unfortunately I don't even have the chance to get a loss spike because I could not launch restore(or continue) traing due to the issue that I described here #318. Any idea for this? I tried many modifications but none of them worked. |
@erogol Hello erogol, thanks for your great work and replies for my questions. I finally succeeded to train a tacotron2 model with a public Chinese dataset, as well as a WaveRNN using predicted mels. The results sound good. I'd like to share some audio samples here in a few days. |
@puppyapple Great to hear that !! Your question ... if you train wavernn with the final mel specs you are likely to get better results. However, without that it should sound good enough. |
@erogol OK. Then I think I will give it a try anyway! 😁 |
Here are two samples from my Tacotron 2 + WaveRNN using dev branch of this repo, thanks for your work! The alignment is showed in figure(forward attention is enabled during inference). It seems the 'target' parameter has significant impact on voice quality: the audio with target=4000 sounds 'trembling' than the other one with target=22000 which is much more 'clean'. |
@puppyapple Amazing, the result is the most good I have ever seen on Chinese dataset. Will u share some branch on this? |
@jinfagang Thanks, nothing special has been added. You could check my forked code which are all from @erogol 's work. Few modifications are made to fit Chinese data(Biaobei 10000) |
@puppyapple would you mind sharing your |
@puppyapple On which branch? How to prepare for training on Biaobai? |
@jinfagang @tsungruihon All is in dev branch. For Biaobei dataset I have not made any extra preparations, just followed the implementation in erogol's and got positive results. But still, this public dataset is too small and is lack of punctuation symbols in the scripts, not all sentences synthesised are as natural as showed in my samples, some have also bad or wrong punctuations. In general the results are not bad. |
@puppyapple thanks my friend. It seems that you use |
@tsungruihon yes and I also finetuned with BN prenet like erogol described in #26. |
@puppyapple, I got two questions, since I am new to the project:
|
|
@puppyapple . Thanks my friend. |
@puppyapple I find the audio that you offer is 48000Hz. your sample_rate in config.json is 48000? Because upsampling(22kHz -> 48kHz) doesn't have high frequency details . |
@WhiteFu Yes, since the Biaobei dataset is 48khz, I just keep it the way as it is, without any upsampling. |
Thank you for your reply. I will check more details in you fork branch:) |
@erogol @puppyapple Hi, I am a newbie in this area. I'm trying to use TTS2 to train a Chinese muti-speaker model. Here are my samples. And I have some questions.
The format of the file name is |
@chynphh Since I'm also fresh in TTS domain, I can only try to answer you question from my own point of view, which may be not correct.
|
@puppyapple thanks for your reply! After my experiments, using Pinyin is indeed better than phonemes. |
@chynphh mels generated by trained Tacotron2 model as input and ground truth audio files as target. Have you extracted mels using the right config? You could refer to the benchmark notebook in this repo to do that, maybe a few modifications are needed. For #26 (comment), maybe try to locate the out of range sample to find out the reason(like 'hop_length' mismatch, etc.) |
@puppyapple Thanks for your suggestions and answers, I will double check my code. |
Does this got any blog or attempt on do tts on Chinese?
The text was updated successfully, but these errors were encountered: