-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about the toolbox from @mbdash #433
Comments
Q1Do you see a way in the future to reduce / tweak the minimum output audio length below the minimum 5 sec? It can be worked around by padding the input with extra words, and then post-processing to remove the padding. For example @plummet555 found this workaround (#360 (comment)):
Q2
Yes, the toolbox should perform much better on speakers that it is trained on.
If I were attempting this, I would extract a single embedding for the desired speaker and then fine-tune the synthesizer and vocoder models using that hardcoded embedding, following this training process: #429 (comment) (substituting your single-speaker dataset for the accent datasets). The amount of data would depend on how well the existing models work on the target speaker. It would be an interesting project to attempt an open-source version of the resemble.ai voice cloner, where we define a set of utterances to be recorded and fine-tune a single-speaker model using the above process. I would guess 5-10 minutes of data should be sufficient for most voices. |
Also see here for more info on what it would take to properly improve the models to fix your Q1: #364 (comment) . The problem has been reported previously in #53 and #227 . |
Re: Q1: Re: Q2: I will look into both your suggestions. |
I'd be happy to share the change I made to add and then cut out a leading word. Just need to find some time to tidy up the repo. |
@plummet555, That would be great! This is my 1st time interacting with a GitHub community and you guys are awesome. I originally really just wanted to say thanks to @blue-fish and wasn't expecting this kind of exchange. I will try to learn how to use GitHub properly to follow your example and also share back any changes I will make to the libraries I currently am experimenting with. |
Looks like the project has moved forward (which is great!) but I think it will be a while before I get a chance to rebase and try it out. So, if it helps. for now I'll just copy here the code I wrote to add the word 'skip' to the start of each line, then to find the silence following it so it can be trimmed back out from the output. It's a copy of demo_cli.py (which I called sv2tts_cli.py). You can run it as e.g.: where input.txt contains one or more lines of text. --cpu is optional. Hope this helps
|
Thank you for sharing your code with us @plummet555 . Please ask any follow-up questions as needed @mbdash and close the issue when you are satisfied. |
@mbdash wrote this in #449 but I am moving it here just to keep the issues organized:
In general it takes a lot of effort to make a practical implementation of whatever is demonstrated in research papers. This project is one example, and Corentin made a masters thesis out of it, which is on the order of 1,000 hours of work. So my reaction to most new research tends to be "cool, but I'll wait for someone else to build it." Just because you can do it doesn't mean you should. Life is too short. |
haha Great answer. thx again. |
I don't understand Korean so the demo didn't make much of an impact. And I am also new to TTS and ML in general so I can't claim to understand the paper either. The general concept is promising though. I wonder if others have attempted something similar. From the paper:
Training a multispeaker TTS requires a lot of input data, which could also be used to train an "emotion encoder" to automatically assign (P,A,D) values based on clues from text and the recorded speech. (Section 2.2 says that the actual model uses 32 dimensions for emotion so the emotion encoder could output in 32-D.) Then use that in synthesizer training. I think it should generalize well because how the emotion manifests itself in an utterance should be independent of the voice of the person speaking it. Furthermore you could also use the info to correct or normalize the utterance embeddings generated by the speaker encoder. |
@mbdash If you don't have plans for your GPU after the LibriTTS model finishes, would you be willing to help train a new encoder for better voice cloning quality? You would use the same process as wiki/Training, but change the params for a hidden layer size of 768 instead of the current 256. There is a lot of info on this in #126 but the model in that issue was trained with an output size of 768 which makes it incompatible with everything else we have. According to wiki/Pretrained-models the current encoder trained to 1.56M steps in 20 days on a 1080ti. |
I will gladly put my GPU to work whenever I can. We could go in milestones, Eventually I will need it for other things, |
Thank you for contributing your time and hardware @mbdash . If you've had a chance to look at the figure in #30 (comment) you'll notice that:
If an upstream element is changed, the downstream elements need to be retrained in most cases. Therefore if we are changing the encoder we should also retrain or at least finetune the synthesizer. If the synthesizer changes, then similarly update the vocoder. So we should make our best effort to train the encoder and do any follow-on work with the synth and vocoder. If it turns out that the outputs are similar enough, we should be able to jump back and forth and finetune as you are proposing. Though it may still be good to proceed serially if there is any desire to make the training process repeatable for those who want to improve on the models in the future. |
Alright, then, let's switch training on the Encoder then. Provide me with the instructions and i'll do it |
Please start by downloading the following datasets. These datasets are huge!
Later I'll open a new issue for the encoder training and post instructions there. |
I already got train-other-500. |
Just to double-check, you have the LibriSpeech (not LibriTTS) version of train-other-500? I know it's not going to make a big difference but I'd prefer the LibriSpeech version so we can precisely replicate Corentin's setup with only one change (hidden model size). Let's also plan on training the synth to 278k before switching to encoder training. In the meantime I am trying to work out the pytorch synthesizer (#447). |
In #432 , @mbdash wrote:
Originally posted by @mbdash in #432 (comment)
The text was updated successfully, but these errors were encountered: