-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training the model end2end #11
Comments
I'm not sure what you mean by never being able to predict useful values for inference. It looks like you never used |
sorry that it is unclear what I mean from the code. Let me try to explain. This is just the training pass. It is quite the same idea as Fast speech like models. I try to optimize the predictors on L1loss alongside the model instead of a separate 2nd stage training. Inference is pretty much same as your code. What I mean is that the predicted f0 and energy values do not really approximate the correct values. When actually you plot theu look quite random. I can get you the other parts of the training code if you are interested. |
How does this compare to those predicted when you trained them in two stages? Did you find the same problem when you trained stage 1 and 2 separately? How does the gradient flow to the predictor with your |
Training 2 stages worked way better but I did not wait for the 2nd stage to converge. I produced a speech with reasonable quality. BTW how do you decide when to stop training each stage? |
Then it could be some problems when you train them E2E. Can you try to include all objectives in the second stage, not just the L1 loss? FastSpeech 2 uses binned representation but here we use the exact F0 curve, so just using L1 loss on the curve alone may not be sufficient. As for how I decide when to stop training, it is simple for the first stage as it is just like training a vocoder for reconstruction, so you can just focus on the validation mel loss. For the second stage, it can be a little bit difficult, but usually I focus on the duration loss. Once the duration loss converges, the quality of the model is pretty much fixed. The F0 might still change a little bit, but the effect is negligible so you can stop at any time after the duration loss converges. This is likely because duration loss converges means the representation learned from the text does not change anymore, and F0 prediction depends on these representations learned for duration prediction. |
I think I can close this issue because I have managed to train it end-to-end in StyleTTS 2: https://github.com/yl4579/StyleTTS2. Please follow this repo. I will clean up the code and make it public around July or August. |
Hi Eren, not sure if you are still interested in StyleTTS (and StyleTTS 2). Now StyleTTS 2 has got some attentions, and people are interested in multilingual supports: yl4579/StyleTTS2#41. It would be greatly appreciated if you help integrate this model into Coqui with multilingual support. You can email me at [email protected] if you have any further questions. |
This is just a heads-up about the discussion in #7.
I tried training the model end2end in different ways but F0 and Energy predictor was always underfitting although eval loss was also going down. They never were able to predict useful values for inference.
Here is roughly my forward pass. I can also share the branch if useful. Happy to see any feedback.
The text was updated successfully, but these errors were encountered: