Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poor performance in compare to the main paper? #411

Closed
amintavakol opened this issue Jul 9, 2020 · 6 comments
Closed

poor performance in compare to the main paper? #411

amintavakol opened this issue Jul 9, 2020 · 6 comments

Comments

@amintavakol
Copy link

Hi,
For those of you who are working with this repo to synthesize different voices;
Have you noticed a huge difference between the voices generated by this repo and the samples released by the main paper here?
If yes, let's discuss and find out the reason(s).

@ghost
Copy link

ghost commented Jul 9, 2020

I would recommend reading #41 and the first few posts of #126 for some context. The main points as I understand them:

  • SV2TTS authors trained speaker encoder to 50M steps, this one is just over 1M.
  • SV2TTS authors used embedding size of 768, this one uses 256.
  • SV2TTS authors used a larger, proprietary dataset for encoder training which gets better results.

The speaker encoder was trained on a proprietary voice search corpus containing 36M utterances with median duration of 3.9 seconds from 18K English speakers in the United States. This dataset is not transcribed, but contains anonymized speaker identities. It is never used to train synthesis networks.

image

@CorentinJ
Copy link
Owner

CorentinJ commented Jul 10, 2020

Don't forget this too:

  • Use LibriTTS instead of LibriSpeech in order to have punctuation.
  • LibriTTS needs to be curated of speakers with bad prosody.
  • You can lower the upper bound I put on utterance duration, which I suspect has for effect of removing long utterances that are more likely to have more pauses (I formally evaluated models trained this way to generate less frequent long pauses). It also trains faster and does not have drawbacks (with a good attention paradigm, the model can generate longer sentences than seen in training).
  • The attention paradigm needs to be replaced, forward attention is poor.
  • If the attention paradigm holds prosody-specific parameters, it may be complemented with a speaker embedding mechanism

#364 (comment)

@CorentinJ
Copy link
Owner

At Resemble.AI we also have better results by using a new vocoder that my colleague @fatchord developed. I believe he's about to publish the paper he wrote about it.

@ghost
Copy link

ghost commented Jul 12, 2020

We can reduce artifacts in the vocoder with additional training (#126 (comment)) . However, it does not make a perceptible difference in the cloned voice. This result also suggests that to the extent the vocoder has an impact on the output quality, we are reaching the limits of what is possible with WaveRNN.

@ghost ghost mentioned this issue Jul 13, 2020
@winterfate
Copy link

At Resemble.AI we also have better results by using a new vocoder that my colleague @fatchord developed. I believe he's about to publish the paper he wrote about it.

Absolutely astounding what you're all doing at Resemble, as well. Saw the LTT videos done in cooperation with you lot as well; was very happy to see some publicity in front of the average tech nerd.

@CorentinJ
Copy link
Owner

Yeah and the LTT video is using models dating from january, our sound quality has way improved since

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants