Tacotron 1 libritts multispeaker model #210

JRMeyer · 2021-03-07T08:38:50Z

JRMeyer
Mar 7, 2021
Maintainer

>>> maneeshkyadav
[May 26, 2020, 9:06pm]

Just thought I'd share some simple observations, would enjoy hearing
from other to see if what I am seeing makes sense.

These are some of the better samples from a tt1 model (no bn, no fwd
attn and using phonemes) on a random 500 speaker subset of LibriTTS: slash
soundcloud
link

Some observations: slash
quality is clearly much better.

sound borgy than females (I set mel_fmin to 50.0, not sure how to set
with a mixed sex dataset). I'm still not totally sure what contributes
the most to 'borginess'.

sound reasonably good. I am sure they can be improved with better
vocoding, but I don't think better vocoding can 'rescue' models that
sound worse than the samples I currently have.

Have people been able to do much better with multi speaker? I believe
the average length of a speaker is reasonably short in LibriTTS ( slash ~21
min IIRC), but I haven't been able to get high quality across all
speakers. I've tried with TT2 and the WaveRNN universal vocoder without
any luck so far (I'm currently trying to train my own WaveRNN model).

Screen Shot 2020-05-26 at 12.46.21
PM slash |587x500

[This is an archived TTS discussion thread from discourse.mozilla.org/t/tacotron-1-libritts-multispeaker-model]

JRMeyer · 2021-03-07T08:38:52Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> georroussos
[May 26, 2020, 9:14pm]

It is very interesting that you get better results with Taco and not
Taco2; my experience is the exact opposite, both single and
multispeaker. But, I use
's fork that is
multispeaker oriented and includes GST support for Taco2 as well.
However, I can see why Taco might give you better results much quicker
(smaller network).

I wonder why you disabled forward attention; I would bet enabling it
would improve performance. I would also totally recommend checking out
Edresson's fork if you're very serious about multispeaker.

The length of every speaker shouldn't be a problem, because the dataset
is large and the alignments concern all speakers in the end; so if the
vocabulary is large, it is fine.

Vocoding may help, but not always. In short, for multispeaker, if you
want good results, what matters is a good embeddings representation, a
reasonably large dataset (minimum VCTK duration which is 109 hours)
which is also of good recording quality, maybe GST and a vocoder. In
which case I would recommend a universal ParallelWaveGAN (which I am
trying to train right now).

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:38:55Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> maneeshkyadav
[May 26, 2020, 9:21pm]

I swear I tried it with fwd attn on and got a worse result...I'll be
sure to really check next time.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:38:57Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> georroussos
[May 26, 2020, 9:33pm]

Taco and Fwd Attention is also what got me bad results, so it checks
out. I would suggest getting familiarized with Edresson's fork, it's
multispeaker

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:39:00Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> maneeshkyadav
[May 26, 2020, 9:40pm]

Do all your speakers 'sound good' with that fork? I.e. do any sound
'borgy'?

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:39:03Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> georroussos
[May 26, 2020, 9:51pm]

They sound okay. I am using a private dataset though.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:39:05Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> maneeshkyadav
[May 27, 2020, 4:50am]

Cool, I have small data I am training on ( slash ~20min) with LibriTTS which
ends up sounding pretty bad. Trying to understand if it is the data
quality or training params. If anyone else can share their multispeaker
LibriTTS experience/results to compare, that would be awesome.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:39:08Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> georroussos
[May 27, 2020, 7:17am]

My suggestion would be to try training a new model on Edresson's fork,
using Taco2 and GST. There is no reason why LibriTTS should be the
culprit because it is a big dataset. If you read
this paper, it is clear that the
speaker encoder plays a big role in the representations and Edresson has
changed the way the speaker embeddings are handled. I should also say
that, using his fork, I am able to get much better alignments much
sooner.

I will try to train the LibriTTS PWGAN soon, so you can try it too

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:39:10Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> maneeshkyadav
[May 27, 2020, 7:23am]

> There is no reason why LibriTTS should be the culprit because it is a
> big dataset.

Just to be sure I am interpreting correctly, you believe that (in
principle) we should not get 'borgy' speakers training on LibriTTS? Do
you believe that some of the better speakers in the soundcloud link in
the OP are probably reasonable quality (they do sound like it to me)?

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:39:13Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> georroussos
[May 27, 2020, 7:36am]

My guess is that it would definitely be possible, but it may be for a
lot of reasons. It is a fact that some voices are a bit harder to model
than others and maybe the approach to how embeddings are learnt is not
good for them. In terms of vocabulary, I think LibriTTS should be very
extensive, so even a voice that sounds worse, should present with an
aligned output either way. The paper I mentioned above highlights that
they do embeddings learning by sampling; that is, they extract a speaker
embedding from all utterances of this speaker, and then average it. I
guess this might help in having a wider representation of the voice. In
the small dataset I have, I haven't really noticed this problem and any
voice I have given it that is in the dataset, it was able to do
synthesis. But I remember I did use a vocoder and it is also known that
vocoders are good with spectrogram representations. In specifics, I
tried to synthesize using a specific voice both using GL and WaveRNN,
and I do remember that the GL quality was much worse than I expected. So
there is a lot of factors at play.

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tacotron 1 libritts multispeaker model #210

{{title}}

Replies: 9 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Tacotron 1 libritts multispeaker model #210

JRMeyer Mar 7, 2021 Maintainer

Replies: 9 comments

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author