Replies: 9 comments
-
>>> georroussos |
Beta Was this translation helpful? Give feedback.
-
>>> maneeshkyadav |
Beta Was this translation helpful? Give feedback.
-
>>> georroussos |
Beta Was this translation helpful? Give feedback.
-
>>> maneeshkyadav |
Beta Was this translation helpful? Give feedback.
-
>>> georroussos |
Beta Was this translation helpful? Give feedback.
-
>>> maneeshkyadav |
Beta Was this translation helpful? Give feedback.
-
>>> georroussos |
Beta Was this translation helpful? Give feedback.
-
>>> maneeshkyadav |
Beta Was this translation helpful? Give feedback.
-
>>> georroussos |
Beta Was this translation helpful? Give feedback.
-
>>> maneeshkyadav
[May 26, 2020, 9:06pm]
Just thought I'd share some simple observations, would enjoy hearing
from other to see if what I am seeing makes sense.
These are some of the better samples from a tt1 model (no bn, no fwd
attn and using phonemes) on a random 500 speaker subset of LibriTTS: slash
soundcloud
link
Some observations: slash
quality is clearly much better.
sound borgy than females (I set mel_fmin to 50.0, not sure how to set
with a mixed sex dataset). I'm still not totally sure what contributes
the most to 'borginess'.
sound reasonably good. I am sure they can be improved with better
vocoding, but I don't think better vocoding can 'rescue' models that
sound worse than the samples I currently have.
Have people been able to do much better with multi speaker? I believe
the average length of a speaker is reasonably short in LibriTTS ( slash ~21
min IIRC), but I haven't been able to get high quality across all
speakers. I've tried with TT2 and the WaveRNN universal vocoder without
any luck so far (I'm currently trying to train my own WaveRNN model).
Screen Shot 2020-05-26 at 12.46.21
PM slash |587x500
[This is an archived TTS discussion thread from discourse.mozilla.org/t/tacotron-1-libritts-multispeaker-model]
Beta Was this translation helpful? Give feedback.
All reactions