-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training a new encoder model #458
Comments
@mbdash I will be unavailable this weekend. Hopefully the commands will just work. You can reach out to the community for help, in particular @sberryman who has gone through this in #126 . There are two things that I hope to learn from training this new encoder model.
My hypothesis is that for 1, we will not see a difference unless we also retrain the synth with many more voices. And for 2, that it should be compatible since the dimensions, input data and loss function are not changing. I may very well be wrong on that since I have not studied the encoder in detail. |
update: still preparing the data. I might start the training tomorrow. |
Preprocess started @10h00 EST 2020 08 03 Question: |
It's optional, but starting a visdom server allows you to visualize the training results by navigating to The umap projections will let us know whether the encoder has learned to distinguish between the voices in the training set. This in turn helps us decide when to stop training. |
By any chance did you have the text file ( |
Nope. I noticed this file rights: Rights inheritance might have caused some issues since my user would fall under group:users. I'll keep you posted on updates. |
@mbdash Searching on the error message I came across the suggestion to convert the m4v files to wav which should fix the problem for voxceleb2: #76 (comment) |
@blue-fish I will try to convert them when I have some time. |
@blue-fish m4u to wav conversion in progress. |
@mbdash Although it is preferable to change just one variable with our training experiment, we know that the encoder gets better with more voices so I would like to suggest including the Mozilla CommonVoice dataset, which has over 60k unique English speakers: https://voice.mozilla.org/en Let's try to incorporate this one if you have the time and patience to preprocess it. @sberryman has written a snippet of code for just that purpose: https://github.com/sberryman/Real-Time-Voice-Cloning/blob/d6ba3e1ec0f950636e9cac3656c0be5c331821cc/encoder/preprocess.py#L224-L244 |
Some thoughts on the encoder modelMaybe a better encoder is not needed after all, depending on the objective. Although the SV2TTS paper demonstrated the possibility of high-quality zero-shot cloning, I think what most people are after is a high-quality single-speaker TTS. If that is the objective, we have demonstrated in #437 that a decent single-speaker model can be finetuned from the pretrained models at significantly less effort than traditional TTS models. The required dataset goes from 10+ hours to about 10 minutes, a reduction of nearly 2 orders of magnitude. For this purpose, the speaker encoder acts as a starting point for the finetuning task and the quality of encoding mainly determines how much finetuning is needed. The best case is that no additional training is needed, i.e. high-quality zero shot voice cloning per the SV2TTS paper. The worst case is bounded by the 10+ hours needed to train a single-speaker TTS. With a better encoder and synthesizer, the required dataset for finetuning can really only go down by 1 order of magnitude: just 1 minute of audio. An reduction of 2 orders of magnitude (dataset of 10 seconds) is equivalent to zero-shot in terms of performance. While the idea of making a voice with just 1 minute of training data is more appealing than the current 10 minutes, is it an order of magnitude improvement from the perspective of the end user? Or in other words, how much effort is appropriate for the encoder given the potential improvement to be had? Arguably, the encoder is already good enough and our limited resources are better spent on the synthesizer which has a lot of known issues. |
I have begun downloading the Mozilla CommonVoice dataset and will add it to the encoder pretraining. I am adding the preprocess fn to my version of encoder/preprocess.py
I also updated my encoder_preprocess.py accordingly. I will keep you guys updated. |
Following up on my earlier comments, this table is from 1806.04558 (the SV2TTS paper):
@mbdash It is still worth one attempt to add the 60k speakers from CommonVoice to the encoder and increase the hidden layers to see if we can achieve open-source zero-shot voice cloning that is as good as the results they published. While you're preparing that dataset I will also read the GE2E paper to see if anything else should be changed for this experiment. Edit: If you think VoxCeleb is too noisy of a dataset we can also try LibriSpeech + CommonVoice, or just CommonVoice alone. |
Just one mod to make, |
@mbdash One more request to make if training has not started yet. I just read #364 (comment) and would like to do the encoder training in 2 phases.
The question is, does the phase 2 model perform better than the model from phase 1? I forget how the SV2TTS folder is structured and if this can be easily implemented (maybe you could make a separate Edit: Given that VCTK is a popular dataset it would be nice to include it either in phase 1 or 1.5 to ensure the resulting model performs well with it. But it is only 110 voices so just a drop in the bucket compared to the others, and not worth holding up the training for it. |
@blue-fish I am still converting CommonVoice mp3 to wav. It has been doing it for hours. (I began downloading VCTK) Can I simply move all the Celeb 1&2 folders out of the /SV2TTS/encoder folder and move them back in phase 2? I did not look deep enough in the code to see if the preprocess does anything else then populating the /SV2TTS/encoder folder Here is a sample of what the /SV2TTS/encoder folder looks like: |
Good idea @mbdash , I looked at the code and I think that will work. |
Please note that by separating the CommonVoice files in subfolders, the preprocess in interpreting each folder as a different speaker. I do not know if the training quality will be affected by this since random files from different speakers are mix within those folders. The reprocessing seems to be going even with the warning. |
Thanks for bringing this up @mbdash . The whole point of the speaker encoder is to learn to distinguish voices from different speakers. If the folder name is used to uniquely ID the speaker then mixing will be disastrous. Is there any metadata in CommonVoice that can help sort things out before you preprocess? @sberryman can you share how you preprocessed CommonVoice for encoder training? Edit: Would this issue still exist if you treat each CommonVoice subfolder as its own dataset? |
@blue-fish and @mbdash You should cancel your current pre-processing. You need a unique folder per speaker in the pre-processed folder for the encoder. I wrote a little script to pre-process common voice dataset(s) for each language. It was run against a release of CV from at least a year ago. I doubt the format of You'll need to adjust line 26 for the base directory of common voice. One of the arguments to the script is The other arguments are for min and max number of audio segments per speaker. Feel free to adjust that based on your needs, I found that minimum of 5 worked well for me. So this loops over every speaker id in the It takes a while but works great, I did the entire CV dataset for my encoder (all languages.) Edit: Also be very careful about lines 60-61. It will |
@sberryman where can i find the validated.tsv |
@mbdash There is a validated.tsv included in every language download from Common Voice. For example:
The clips folder is where all the audio clips are stored and what I'm assuming you are running encoder pre-processing against. My script will create a new folder called Once that script finishes you'll be able to run the encoder pre-processing against the |
@sberryman Thank you so much for the prompt and helpful replies. |
@sberryman yes thank you for your quick response, I guess i might have "misplaced" a folder :-s I will delete my extracted files, un tar again and be more careful. thank you |
@mbdash no need to apologize, none of this is really documented. Requires reading through tons of comments on issues, some of which are closed I'm sure. It is neat to see such strong demand for this project. I had an idea a while ago to create a platform for cloning facial images and speech and standardize the training process a bit by making it easy to swap out backbone architectures, etc. Then ideally people could join in the project in various capacities. Some might help label new data, add new datasets, contribute their GPU(s) to a training pool, etc. Then we could build a web based UI to interact and run inference on pre-trained models. It is clear to me that the UI aspect of this project made it very approachable to everyone. But fine tuning or changing out datasets is confusing. (I've been working on the visual side recently) |
I should also clarify that the script I linked to does NOT add the language to the speaker specific folders. I just grabs the first 20 chars of the speaker_id provided by CV. Ensure you are using the modifications I made to |
I can confirm I previously deleted the tsv during the process.... |
@ustraymond Code for calculating EER is shared in #61 (comment) and #126 (comment) . The code is identical but the context for discussion is slightly different. |
Dear @mbdash
|
@mueller91 I am not the guy you are looking for, the wise guy with the answers is @blue-fish. Note that there is 2 encoder models. "LibriSpeech + CommonVoice + VCTK Only until step 315k" "LibriSpeech + CommonVoice + VCTK until step 315k + VoxCeleb1&2" The latest uploaded is 525k steps. I would wait for me to reach 750k to do anything with this encoder if I were you, |
@mueller91 I think your observations are explained by our continued use of ReLU (which is not a deliberate choice, we just used the repo code without modification). @sberryman removed ReLU in resemble-ai/Resemblyzer#13 which causes inter-similarity to be centered around zero instead of 0.5 as in our case. We will continue using ReLU as long as the encoder model needs to support Corentin's pretrained encoder which also uses it. |
@blue-fish how does @mbdash's model work with the existing synth/vocoder? I would assume not very well and producing a generic voice? If that is the case, @mbdash should stop training and start over with Tanh as the final activation. Then you'll use @mbdash's new model to train the synthesizer and vocoder from scratch (several weeks worth of GPU time.) Replicating Corentin's work training from scratch would likely require well over 700 hours of training using two 1080 TI's. They are using much larger GPUs (40GB of memory I believe) at Resemble.ai to train more quickly. The 700 hours is a very rough estimate to illustrate the several weeks of training for each of the three models. |
@sberryman Could you elaborate why to chose Tanh as final activation? |
@sberryman When I train a VCTK-based synthesizer to 100k steps on my basic GPU, I get a very similar result for voice cloning regardless of whether I use Corentin's model or @mbdash 315k model (LS+VCTK+CV) as the speaker encoder for training and inference. See: #458 (comment) That result was completely unexpected and maybe I should delete my pycache just to be very sure that I performed the experiment properly. But because of the different hidden unit size it is impossible to use the wrong encoder with the wrong synthesizer. Also, the pretrained encoder bundled with Resemblyzer is identical to the one in this repo, which I recall took ~20 days to train on a single 1080 TI. |
@mueller91 I used tanh as the final activation to force the values between -1 and 1 as opposed to 0-1 with ReLU. I never tried training with no final activation so I'm not sure how that would turn out to be honest. resemble-ai/Resemblyzer#13 (comment) Edit: I didn't do a good job documenting and checking in code throughout all the experiments so there is a chance the model I trained for 1M+ steps didn't have a final activation. @blue-fish probably knows that better than me at this point. There is a chance tanh was only used on an experiment when I was trying to build an encoder model based on raw waveform Edit 2: Corentin doesn't think it makes much of a difference though. resemble-ai/Resemblyzer#15 (comment) Edit 3: This implementation doesn't use an activation function after the LSTM. https://github.com/HarryVolek/PyTorch_Speaker_Verification/blob/master/speech_embedder_net.py |
There are no states associated with the activation so it's not possible to tell with just by looking at the checkpoint file. When I hacked the final linear layer in #458 (comment) I noticed the loss came down very quickly using ReLU. The loss was already down to 0.01 within 10 steps of restarting training. So if I had to guess, that particular 768/768 English encoder was likely using ReLU. @sberryman Thanks for digging up and sharing those additional links. |
At step 600k, loss is still moving between 0.015 and 0.04 https://drive.google.com/drive/folders/1QBeC8_PKFn-ZpZzsWY3vIFZKd4dk3g9O?usp=sharing |
@mbdash training is looking good, still some overlap on clusters. Have you tried to plot cross similarity matrixes? resemble-ai/Resemblyzer#13 (comment) I also did a few plots for a much larger number of speakers here: Default is the model included in this repository and 768 was the model I trained. It looks like my EER was 0.00392 at 2.38M steps. |
Dear @mbdash , any updates? If you find the time to share the current model, it'd be much appreciated! :) |
https://drive.google.com/drive/folders/1QBeC8_PKFn-ZpZzsWY3vIFZKd4dk3g9O?usp=sharing loss 0.015 to 0.027 cheers |
Hey guys, just went through this thread quickly. I indeed removed the ReLU layer in the voice encoder we use at Resemble.AI. I think the model on Resemblyzer still has it. I planned to release a new one which, among other things, wouldn't be trained on data that would have silences at the start and end of each clip. I don't think I'll update the code in this repo, but I should update the code on Resemblyzer when the new model's released. |
Hi, we are making a group effort on building a new dataset curated and cleaned, quality over quantity. VoxCeleb is rejected due to it's horrible quality. Join the Slack for more info. |
@mbdash what slack? |
Anyone who wants to contribute in some way to the RTVC project is welcome to join the Slack. Leave a comment in #474 and we will provide an invite link. |
The idea is to have a dataset with low quality though |
Mozilla TTS is also developing a speaker encoder in mozilla/TTS#512. I am inviting Mozilla TTS contributors to this discussion to see if we can decide on a common model structure. Also share thoughts on datasets and preprocessing techniques. In a best case situation we could even share the model. |
A small update for anyone watching this thread. I saw @CorentinJ 's comment.
We are just playing around, putting our resources in common and experimenting to improve audio output quality as well as adding some punctuation support. I am done with LibriTTS60/train-clean-100 |
@mbdash I believe corentin was referring to low quality audio (background noise, static, etc) being important while training the encoder. Clean audio is important for synthesis. |
In the SV2TTS paper it is stated that "the audio quality [for encoder training] can be lower than for TTS training" but between this and the GE2E paper I have not seen a statement that it should be of lower quality. The noise might train the network to distinguish based on features that humans can perceive as opposed to subtler differences that can be found in clean audio. But I think it's worth running the experiment to see if this is truly the case. |
@blue-fish I completely agree, running the experiment is the best option. |
If anyone is wondering, training is still paused while @mbdash is denoising datasets and @steven850 is doing trial runs to determine best hparams for the encoder. It's a slow process. We plan to swap out the ReLU for a Tanh activation, but will try to match the model structure of the updated Resemblyzer encoder if it is released. |
It's going to be a while before we get back to encoder training. I'm going to close this issue for now. Will reopen when we restart. |
Hmmm what kind of that i must to download ? |
In #126 it is mentioned that most of the ability to clone voices lies in the encoder. @mbdash is contributing a GPU to help train a better encoder model.
Instructions
LibriSpeech/train-other-500
)VoxCeleb1/wav
andVoxCeleb1/vox1_meta.csv
)VoxCeleb2/dev
)model_hidden_size
to 768 in encoder/params_model.pypython encoder_preprocess.py <datasets_root>
visdom
python encoder_train.py new_model_name <datasets_root>/SV2TTS/encoder
The text was updated successfully, but these errors were encountered: