English synthsis is good, how about Chinese? #58

lucasjinreal · 2018-10-26T13:39:52Z

Does this got any blog or attempt on do tts on Chinese?

erogol · 2018-10-26T22:05:48Z

Never tried sorry but it'd be interesting to see.

dvbfuns · 2018-11-21T08:51:53Z

Chinese is also good in this model. And compared with other tacotron model, this model can get clear voice with less time. in my test, with same dataset, 10000 steps can synthesis the voice which the quality similar to tacotron 50000 steps.

erogol · 2018-11-21T10:45:43Z

@dvbfuns great to hear that. Do you have any samples to share? It'd be great to put into the main page, if you don't mind.

lucasjinreal · 2018-11-22T02:00:10Z

@dvbfuns Which training dataset are u using? A Chinese version TTS would be good to enhance this great repo

dvbfuns · 2018-11-22T02:54:32Z

@erogol would like to share the samples, just I have problem to access soundcloud.com, any suggestions to do the sharing? or I can share them to you with e-mail ?

erogol · 2018-11-22T09:39:54Z

@dvbfuns e-mail would work [email protected] . Thanks for your help.

erogol · 2018-11-22T09:41:11Z

@dvbfuns you might even consider PR your Chinese changes. I agree @dvbfuns, that would be great addition.

dvbfuns · 2018-11-23T01:41:17Z

@erogol , already send your mail with the model and samples, please kindly refer.

lucasjinreal · 2018-11-23T01:51:30Z

@erogol Would u like update into README or model zoo? @dvbfuns BTW, did u using your own labeling dataset?

erogol · 2018-11-23T10:06:25Z

@jinfagang I can put whatever @dvbfuns can provide. But also understand if he doesn't like to share the model.

lucasjinreal · 2018-11-24T02:41:04Z

@erogol Could u resend the voice samples to me? I'd like to check the performance of Chinese result. [email protected] , thanks in advance

erogol · 2018-11-24T22:37:18Z

@jinfagang anything I've will be posted on Github as soon as I receive.

erogol · 2018-12-14T11:32:55Z

I close this due to inactivity. Feel free to reopen.

mazzzystar · 2019-01-17T09:32:14Z

@jinfagang @erogol
Hi！ I'd like to share some Chinese results. You can download demo.zip

And still, Decoder stopped with 'max_decoder_steps will sometimes happen when infer some long sentences(>20). Glad to see if you know good way to handle it.

erogol · 2019-01-17T11:40:34Z

@mazzzystar Thanks for sharing your results. They sound to me quite okay but I am not a Chinese speaker.

I'd suggest you to replace the stop token layer with a RNN as it was in the previous versions. RNN based model is larger but it is more reliable. Here is a snapshot:

class StopNet(nn.Module):
    r"""
    Predicting stop-token in decoder.
    
    Args:
        r (int): number of output frames of the network.
        memory_dim (int): feature dimension for each output frame.
    """
    
    def __init__(self, r, memory_dim):
        super(StopNet, self).__init__()
        self.rnn = nn.GRUCell(memory_dim * r, memory_dim * r)
        self.relu = nn.ReLU()
        self.linear = nn.Linear(r * memory_dim, 1)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, inputs, rnn_hidden):
        """
        Args:
            inputs: network output tensor with r x memory_dim feature dimension.
            rnn_hidden: hidden state of the RNN cell.
        """
        rnn_hidden = self.rnn(inputs, rnn_hidden)
        outputs = self.relu(rnn_hidden)
        outputs = self.linear(outputs)
        outputs = self.sigmoid(outputs)
        return outputs, rnn_hidden

mazzzystar · 2019-01-17T12:07:56Z

@erogol
Thanks for your reply, I will try out.
Actually in Chinese, it's really important to know where to pause and how long should it pause in a single sentence , normally pause happens several times and if all pause are correct, the result will be considered as "good naturalness" . And as far as I know, my model based on mozilla-TTS outperform most current Mandarin Chinese TTS in naturalness, thanks for your work !

One part I think need to be improved is that, the voice texture is still a little bit "electronic" and unlike real human, though it's good enough. I may start to focus on this part and try out some methods, such as different Vocoder, or other attention method. BTW have you considered of using Transformer to replace current RNN part ? I noticed that more and more people prefer Transformer than RNN after BERT came out.

Finally, thanks again for your great work !

erogol · 2019-01-17T13:14:34Z

@mazzzystar
Thanks for your words :). Yeah I'd guess things would be much better, if we could combine TTS with a neural vocoder. It is in progress but, we need sometime to solve some internal technicalities before we continue. You could also try World vocoder. There is a discussion about it in issues as well with some example scripts to help you. It shouldn't be so hard.

I'd say attention is more about laying the right pronunciation but naturalness is a matter of the vocoder. You can also try attention windowing implemented in dev branch layers/attention.py. It would give better monotonic attention with less noise. Based on the window size you can also barely define the pace of the speech. You can also try to multiply attention weights with ~4 before applying normalization. That would also lead to more clear alignment.

When it comes to BERT, I've not tried yet. One problem with BERT, it requires more memory compared to RNN. Therefore it might be edgy in low budget systems to train which I prefer to stay away. However, if you like to try, I am here to help.

Thanks again!

lucasjinreal · 2019-01-18T02:14:18Z

@mazzzystar Your Chinese result is really impressive! May I ask which Chinese voice corpus did you use? Or which way did u organize your data?

mazzzystar · 2019-01-18T02:40:34Z

@jinfagang
Sorry, I can't tell you the detail for it's one of my current work, and may hurts company's interest. Hope you can understand. I'm here just to let you know mozilla-TTS works well on Chinese synthesis.

OswaldoBornemann · 2019-02-15T02:32:41Z

@mazzzystar hello man, the demo.zip file seems not work. How could i download it ?

OswaldoBornemann · 2019-02-25T02:19:31Z

@mazzzystar @jinfagang @dvbfuns @erogol yes, i also tried it out in chinese corpus. The model just get a better alignment than the other tacotron2 project, especially nvidia/tacotron2. But I haven't tried to listen the voice synthsis effectiveness

lucasjinreal · 2019-02-25T02:32:03Z

@tsungruihon Which repo are u using?

OswaldoBornemann · 2019-02-25T03:01:54Z

@jinfagang just use mozilla TTS

lucasjinreal · 2019-02-25T03:56:19Z

@tsungruihon Sorry, I mean, which corpus

OswaldoBornemann · 2019-02-25T05:45:06Z

@jinfagang audio that post in some app.

puppyapple · 2019-11-27T02:04:37Z

Hello @erogol, thanks for you great work! I'm new to TTS domain and trying to adapt your repo to some Chinese dataset(10000 sentences, 12H). Training is still ongoing but seems promising. I have several doubts when looking into details, hope that you could give me some advices:

I noticed that for character(use_phonemes=false) training mode, we don't have an 'enable_eos_bos' option to add end token to the end of sentences which I saw a lot in some other discussions like Nvidia/Tacotron2, but just let the model learn through stopnet, so in this case should I always waiting for the stop loss converges to zero? For now, my alignment has always gaps after the stop point like showed below(along with the 'Decoder stopped with 'max_decoder_steps' warning, so I can assume that the model does not learn when to stop. Why not add stop token here to help?)
For the training time, I saw your shared pretrained models with LJSpeech on GoogleDrive where you trained 160k steps with 16 batch size. So my question is, should we care about the eval loss to stop training or just let the training continue so long as the training loss improves(overfitting?)
When I try with repo of NVIDIA/Tacotron2 there is problem with the restore training(loss spike after first step and model starts from scratch), which I found is probably related to the Adam optimizier, have you ever encountered such issue?
Thanks!

erogol · 2019-12-02T09:29:12Z

it should learn to stop after enough training and it is more reliable than using eos. You can also try eos , otherwise.
eval or train loss does not exactly show the final performance. The best is to check yourself for the best sounding model.
in my implementation fine-tuning should work flawlessly.

puppyapple · 2019-12-02T11:35:51Z

@erogol thanks for the reply, now I'm training without forward attention and the problem in the figure above seems dissapeared for now, I will wait for longer to see what il will become. For the fine-tuning, unfortunately I don't even have the chance to get a loss spike because I could not launch restore(or continue) traing due to the issue that I described here #318. Any idea for this? I tried many modifications but none of them worked.

puppyapple · 2019-12-11T07:44:29Z

@erogol Hello erogol, thanks for your great work and replies for my questions. I finally succeeded to train a tacotron2 model with a public Chinese dataset, as well as a WaveRNN using predicted mels. The results sound good. I'd like to share some audio samples here in a few days.
And following #26, I'm now trying to finetune the tacotron2 with 'BN' prenet, the improvement of loss is significant! Nearly the same as the figures you shared. The training is still on going and I will compare the audios created after.
Just a small doubt, after finetuning with 'BN' prenet, is it necessary to retrain(or finetune) my WaveRNN model with the new predicted mels? Thanks!

erogol · 2019-12-11T08:05:56Z

@puppyapple Great to hear that !!

Your question ... if you train wavernn with the final mel specs you are likely to get better results. However, without that it should sound good enough.

puppyapple · 2019-12-11T08:55:49Z

@erogol OK. Then I think I will give it a try anyway! 😁

puppyapple · 2019-12-16T01:07:03Z

Here are two samples from my Tacotron 2 + WaveRNN using dev branch of this repo, thanks for your work! The alignment is showed in figure(forward attention is enabled during inference). It seems the 'target' parameter has significant impact on voice quality: the audio with target=4000 sounds 'trembling' than the other one with target=22000 which is much more 'clean'.
samples.zip

lucasjinreal · 2019-12-16T14:21:00Z

@puppyapple Amazing, the result is the most good I have ever seen on Chinese dataset. Will u share some branch on this?

puppyapple · 2019-12-17T02:59:51Z

@jinfagang Thanks, nothing special has been added. You could check my forked code which are all from @erogol 's work. Few modifications are made to fit Chinese data(Biaobei 10000)

OswaldoBornemann · 2019-12-17T03:59:37Z

@puppyapple would you mind sharing your config.json file ?

lucasjinreal · 2019-12-17T06:45:02Z

@puppyapple On which branch? How to prepare for training on Biaobai?

puppyapple · 2019-12-17T15:20:49Z

@jinfagang @tsungruihon All is in dev branch. For Biaobei dataset I have not made any extra preparations, just followed the implementation in erogol's and got positive results. But still, this public dataset is too small and is lack of punctuation symbols in the scripts, not all sentences synthesised are as natural as showed in my samples, some have also bad or wrong punctuations. In general the results are not bad.

OswaldoBornemann · 2019-12-18T04:16:43Z

@puppyapple thanks my friend. It seems that you use Tacotron2 with location sensitive attention instead of forward attention, according to the config.json from your dev branch.

puppyapple · 2019-12-18T06:55:45Z

@tsungruihon yes and I also finetuned with BN prenet like erogol described in #26.

shad94 · 2019-12-18T09:28:35Z

@puppyapple, I got two questions, since I am new to the project:

Have you changed content of files in TTS/tests for purpose of Chinese? The same with TTS/mozilla-us-phonemes
How to generate encoder VS decoder graph?
Thank you

puppyapple · 2019-12-18T09:38:37Z

@shad94

I didn't use TTS/tests for testing, but with the benchmark jupyter notebook in TTS/notebooks with some modifications;
It's already implemented by erogol in the logger class.

OswaldoBornemann · 2019-12-19T06:19:45Z

@puppyapple . Thanks my friend.

WhiteFu · 2020-01-07T07:23:49Z

@puppyapple I find the audio that you offer is 48000Hz. your sample_rate in config.json is 48000? Because upsampling(22kHz -> 48kHz) doesn't have high frequency details .

puppyapple · 2020-01-07T07:30:21Z

@WhiteFu Yes, since the Biaobei dataset is 48khz, I just keep it the way as it is, without any upsampling.

WhiteFu · 2020-01-07T07:32:33Z

Thank you for your reply. I will check more details in you fork branch:)

chynphh · 2020-01-12T18:57:44Z

@erogol @puppyapple Hi, I am a newbie in this area. I'm trying to use TTS2 to train a Chinese muti-speaker model. Here are my samples. And I have some questions.

Generated audio files are understandable but very noisy(the samples are in samples/phonemes/120Kstep/). I done not use any vocode(GL or WaveRNN). Is this normal?
How to deal with this problem? Using a vocode or any other idea?
For Chinese, is it better to use pinyin or phonemes? When I use phonemes, some tones are not accurate, like a non-native speaker speaks Chinese. My model using pinyin has not yet converged.
Why is there a big difference between training and testing? I set the same parameters for the function synthesis. The results of test-text in training(train.py) are much better than in testing(Benchmark.ipynb). The training time samples are in samples/without_phonemes(use pinyin)/29037steps and samples/without_phonemes(use pinyin)/30886steps" The testing time samples are in samples/without_phonemes(use pinyin)/30000steps.
Is there a big difference between training WaveRNN with raw wav files or TTS2 model? Which is better? Is there a guide to training the WaveRNN model?

The format of the file name is {text}-{speaker id}-{train steps}.
Thank you very much! :)

puppyapple · 2020-01-16T02:37:09Z

@chynphh Since I'm also fresh in TTS domain, I can only try to answer you question from my own point of view, which may be not correct.

It is sure that using a vocoder will give better audio quality. In this repo, erogol has already implemented GL to generate test audio for tensorboard display, have you listened to the result? I've tried both WaveRNN and ParallelWaveGAN, WaveRNN could get high quality but with large 'overlap' parameters which will increase inference time. ParallelWaveGAN result is a little noisy but not quite obvious, however it is much more faster.
In my own test, pinyin is sufficient to get good pronunciation.
30k steps seems far from enough, you could wait longer.
I have not tried Ground Truth mel from raw wav file for WaveRNN, Tacotron 2 generated mels seem to work well. You can try to understand erogol's implementation and give it a try, for me it's clear enough.

chynphh · 2020-01-16T06:46:56Z

@puppyapple thanks for your reply!

After my experiments, using Pinyin is indeed better than phonemes.
I trained Tacotron 2 with 240K steps. The results were good but still a bit noisy.
Now, I'm trying to train a WaveRNN model. I tried to use the mels generated by Tacsotron 2, but it cannot work with the raw wav file. It seems to be caused by a mismatch between the raw wav file and mels generated by Tacsotron 2( #26 (comment)).
So, I trained WaveRNN with raw wav files and Ground Truth mels. Until now, it hasn't worked with 180K steps.
When training WaveRNN with mels from Tacotron2, which wavs do you use, the ground truth wavs file or the wavs generated by Tacotron2?

puppyapple · 2020-01-16T07:05:27Z

@chynphh mels generated by trained Tacotron2 model as input and ground truth audio files as target. Have you extracted mels using the right config? You could refer to the benchmark notebook in this repo to do that, maybe a few modifications are needed. For #26 (comment), maybe try to locate the out of range sample to find out the reason(like 'hop_length' mismatch, etc.)

chynphh · 2020-01-16T07:09:26Z

@puppyapple Thanks for your suggestions and answers, I will double check my code.

erogol closed this as completed Dec 14, 2018

puppyapple mentioned this issue Nov 28, 2019

Problem with finetune model #318

Closed

English synthsis is good, how about Chinese? #58

English synthsis is good, how about Chinese? #58

Comments

lucasjinreal commented Oct 26, 2018

erogol commented Oct 26, 2018

dvbfuns commented Nov 21, 2018

erogol commented Nov 21, 2018

lucasjinreal commented Nov 22, 2018

dvbfuns commented Nov 22, 2018

erogol commented Nov 22, 2018

erogol commented Nov 22, 2018

dvbfuns commented Nov 23, 2018

lucasjinreal commented Nov 23, 2018

erogol commented Nov 23, 2018

lucasjinreal commented Nov 24, 2018

erogol commented Nov 24, 2018

erogol commented Dec 14, 2018

mazzzystar commented Jan 17, 2019 • edited Loading

erogol commented Jan 17, 2019 • edited Loading

mazzzystar commented Jan 17, 2019 • edited Loading

erogol commented Jan 17, 2019

lucasjinreal commented Jan 18, 2019

mazzzystar commented Jan 18, 2019 • edited Loading

OswaldoBornemann commented Feb 15, 2019

OswaldoBornemann commented Feb 25, 2019

lucasjinreal commented Feb 25, 2019

OswaldoBornemann commented Feb 25, 2019

lucasjinreal commented Feb 25, 2019

OswaldoBornemann commented Feb 25, 2019

puppyapple commented Nov 27, 2019 • edited Loading

erogol commented Dec 2, 2019

puppyapple commented Dec 2, 2019 • edited Loading

puppyapple commented Dec 11, 2019 • edited Loading

erogol commented Dec 11, 2019

puppyapple commented Dec 11, 2019

puppyapple commented Dec 16, 2019

lucasjinreal commented Dec 16, 2019

puppyapple commented Dec 17, 2019

OswaldoBornemann commented Dec 17, 2019 • edited Loading

lucasjinreal commented Dec 17, 2019

puppyapple commented Dec 17, 2019

OswaldoBornemann commented Dec 18, 2019 • edited Loading

puppyapple commented Dec 18, 2019

shad94 commented Dec 18, 2019

puppyapple commented Dec 18, 2019

OswaldoBornemann commented Dec 19, 2019

WhiteFu commented Jan 7, 2020

puppyapple commented Jan 7, 2020

WhiteFu commented Jan 7, 2020

chynphh commented Jan 12, 2020

puppyapple commented Jan 16, 2020

chynphh commented Jan 16, 2020

puppyapple commented Jan 16, 2020 • edited Loading

chynphh commented Jan 16, 2020

mazzzystar commented Jan 17, 2019 •

edited

Loading

erogol commented Jan 17, 2019 •

edited

Loading

mazzzystar commented Jan 17, 2019 •

edited

Loading

mazzzystar commented Jan 18, 2019 •

edited

Loading

puppyapple commented Nov 27, 2019 •

edited

Loading

puppyapple commented Dec 2, 2019 •

edited

Loading

puppyapple commented Dec 11, 2019 •

edited

Loading

OswaldoBornemann commented Dec 17, 2019 •

edited

Loading

OswaldoBornemann commented Dec 18, 2019 •

edited

Loading

puppyapple commented Jan 16, 2020 •

edited

Loading