-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train a better Speaker Encoder #512
Comments
Hi Erogol,
My code is not based on the TTS repo, but i'll try to integrate it and submit a PR in the upcoming days. |
@mueller91 that is great. I can train the model if you can send me a script which does all the processing. I am not use if there is enough space to allocate all the datasets but I can try. BTW I also don't have SSD to fit all the dataset. I think the latency is normal since for each batch it loads a lot of data and I don't think computing specs on the fly is not the cause of the problem. One option is to keep a number of batches in memory and sample from it to fill the half of the next batch and load the rest from disk. I think that would reduce the requirements quite a lot. Does it make sense? |
Hi @erogol , To minimize the data loaded from disk, your suggestion makes sense; but for all utterances we'd reuse, we'd have identical pairs in the GE2E loss matrix as in the batch before. Not sure if that's desirable ... I was thinking to re-use a given batch two or three times, and just select a new random 1.6s section of the MFCC for each utterance. What do you think? |
Let's assume our batch size is B. For instance if we keep N batches in memory and replace the Nth batch with the next batch and sample B/2 instances from in memory samples and sample the rest from the HDD instances, it is very likely that every batch is different from one another. That is more important than having same couples in the batch couple of times since the average gradient would be different. What do you think?
This also sounds like a good idea. Maybe we can combine these two ideas. we can also induce identical random noise to each speaker in batch so even if we use the same speaker from the cache, model sees slightly different version of the speaker's voice. |
Sound good, I'll implement those three ideas. Also, i've just added the LibriTTS, VoxCeleb1+2 and CommonVoice datasets:
Finally: Is there a reason |
I just assumed that the datasets are preprocessed and like to keep a bit of silence to be robust against it. But might be set differently for different cases. |
I implemented the three improvements we discussed above. Attached are first tensorflow plots. You can find the source in my fork. I'll keep training this for a bit and then publish the model + config. Let me know if you are interested in a different set of hyper parameters |
it is great!! Looks like the loss id smoothly going down. How many samples do you have in total for this model? Have you done any particular changes to the model? I was planning to remove the last ReLU layer that, in my opinion, skews the output distribution. Also with all these datasets, we could train a larger model. You could also use AngleProtoLoss which @Edresson reported better results. Are you planning to share the model and the code at the end? If you are, then I can work on the universal vocoder more and @Edresson is working on the multi-speaker TTS model. After we merge all these we would have the best model possible. |
@mueller91 this is very good congratulations :). As @erogol commented, I got better results with Angular Prototypical (AngleProtoLoss) in my training. I recommend you try to use it :). The paper In defense of metric learning for speaker recognition also shows a superiority of Angular Prototypical. |
With the datasets mentioned above, i have 25.5k speakers and 2M utterances. You can see my config here. I've submitted a PR, and will be happy to share the model once trained. |
@mueller91 Could you train the model with the audio config of this config here? This would allow us to use this model to calculate the loss in the TTS model and generate speakers with the voice closest to the originals :) |
@Edresson Are you planning on using the speaker encoder to create an additional similarity loss term for the multi-speaker tacotron? I tried that for a while with my own implementation, didn't improve anything, but also my speaker encoder was bad back then. Most of the datasets are 16khz, so upsampling to 22050hz may slow the data loader down, i'll have to see how it turns out. Upsampling should not affect the MFCCs in a negative way, right? |
I am not sure but the sampling rate in speaker encoder would not make an important difference. In the end, TTS model would learn what it needs to learn from the embedding regardless of the encoder rate. But maybe I am wrong. |
Yes, exactly that I've tried this and the results improve even using a bad speaker encoder. Training with a better speaker encoder should improve even more, especially for speakers not seen during training. Resample is really slow. @erogol In some tests that I did when I test a 16khz audio on a 22khz trained encoder speaker upsample to 22khz, the performance drops a lot. However I didn't try without the upsampling. @mueller91 @erogol Do you think it is feasible and make sense to train with audios at 22 kHz and 16 kHz at the same time? |
Here is the current model: Trained to 180k on LibriTTS Clean 100, 360 and 500; VCTK, Voxceleb1+2 and Mozilla Common Voice; a total of >25k speakers with 64 speakers per batch. Loss is at 0.25k. You can download the model and config at: @Edresson I can't do training at 22khz and 16khz at the same time because i have access to only a single GPU, and the current model (with 768 hidden layers and 64 speaker per batch) does not fit on my GPU twice. |
@mueller91 It should work, but the quality may not be as good for real applications. If it is just for data generation I believe it is a good one. Perhaps it would be interesting to test how the Speaker encoder behaves by receiving 22khz audio instead of 16khz (my test was the opposite, a 22khz trained speaker encoder received a 16khz sample that was upresampled ). If the performance loss is not great, we can use the trained 16 kHz speaker encoder to calculate the distance between speakers during training (speaker encoder extra loss) for a model trained in 22 kHz :) |
@mueller91 it is a great contributing. Thanks! i see that it was still converging. I guess you need the GPU as you stopped training. |
@Edresson i still dont think we need a different sampling rate for the encoder. you can always resample an audio before computing the embedding vector. |
@erogol I'll keep training, this was only a snapshot. @Edresson I have not forgotten your request. However, i have only a single GPU available, and i would like to train the current model a bit more before I start with your config. Upsampling to 22khz introduces significant overhead when data loading; would 16khz and 80 mel_channels be helpful to you? This paper reports SOTA with 16khz. |
@erogol The idea is to use it during the speaker encoder to calculate the loss during TTS training. And I don't know how to resample a spectrogram, so the ideal would be to have a speaker encoder trained in 22 kHz. @mueller91 can focus on the 16khz Speaker encoder :). As I said above, there may not be a big difference in performance and we can use it in 22 kHz audio. I trained a 22khz model compatible with the TTS audio configuration on LibriTTS 360 and 100 clean, a while ago this model is not as good as yours but it works :). |
@Edresson you dont need to resample spec. you resample audio and then compute the spec. Basically use separate Audio Processors for speaker encoder and the rest. |
I have further optimized the DataLoader, and now incur zero overhead when loading the data from disk (see LoaderTime); i train 1000steps in about 15 minutes (1.25 steps per second).
@Edresson I have started training the 80mel, 16khz speaker encoder; i'll keep you updated. Is the speaker-encoder based similarity loss already implemented? |
@mueller91 Yes on one of my brachs. We intend to merge with TTS in the future :). Are you training with this audio config here? Except the sample rate correct? For the sample rate, @erogol had the idea of using interpolation as discussed in issue #520 , we can try this :). |
@Edresson yes, i used your audio config, except for the sampling rate and do_trim_silence, which i set to true Edit: I noticed that changing the win-length from 400 to 1024 results in less frames given 1.6s of audio. Do you thing it makes sense to increase the length of the audio to maybe 2 or 3s during training? As far as i remember, the original paper reported improvements for longer audio files (up to 5s) during inference. |
I should also mention that. Thanks for reminding me. Yes, I use the latest encoder. |
Great ! Huge thanks to you @erogol and to you as well @mueller91 for the impressive work. |
Hi, does
|
It works, I just used it. Sure you are using the right model? |
Yeah! I am using master and the models from the drive link (I tried all the models on the link) and |
Ah, I used compute_embeddings.py from dev, which worked for me. |
Which commits are you using? The compute_embeddings.py is not there anymore |
current dev https://github.com/mozilla/TTS/blob/dev/TTS/bin/compute_embeddings.py |
Oh that's where it was 🤦🏻♂️ thanks mate. Very strange, still not working, even though I pulled the latest dev. It crashes at model.load_state_dict line and the only thing I changed was map the storage to CPU cos I am trying to load it on my laptop. I just added strict=False and it seems to be doing the trick. Weird. Thanks a lot for trying to help. 🤗 |
I guess it is 320K. @Edresson has computed the embedding. |
@WeberJulian The best_model was trained to ~370k steps. So I would assume it should be better? |
@sanjaesc Yeah it's probably better but I'm fine-tuning the VTCK multispeaker model in my language so I need the exact checkpoint used to compute the embeddings even if they are worse or else my model won't work properly (I think). |
Shouldn't a better speaker-encoder compute more accurate embeddings for your dataset and thus result in a more robust model? |
I don't know since the embbedings don't mean the same thing anymore. I don't have enough speakers in my dataset to make the model learn (slightly?) different embeddings. I need a model that already know how to interpret the embbedings. At least that's my intuition. But if you think the newer checkpoint might work better I may try this after this training ends. thanks for the advice |
Hi, Thanks for the great effort! I'm experimenting with various multi-speaker TTS recipes shared in this project. Has anyone tried training a Tacotron model with LibriSpeech/LibriTTS data? Or, any other large scale US English dataset? I'm able to get decent results with VCTK based Tacotron model but it's limited to UK English and speaker variety is not sufficient for my application. I'm aware that we can create random speaker embeddings or even random style tokens if it's a GST based model but still I think when Tacotron sees only a limited number of speakers as in VCTK, all you can generate is limited to that speaker set in terms of speaker variation. If a larger scale Tacotron model hasn't been done, I might be able to put some effort in it and share a pre-trained model if it goes well. Any thoughts? |
Hi, I think the model based on VCTK is the latest and greatest on this repo but it shouldn't be too hard to fine-tune on a larger dataset. |
VCTK Tacotron model is based on UK English phoneme set. I don't exactly know what espeak does when you switch dialects but I'm guessing the phoneme sets will be different so training from scratch would be inevitable. Otherwise, Tacotron output will be based on UK English espeak pronunciations. It may not be as accurate as using US English, say if you are using LibriTTS for Tacotron training. |
I think en-uk has more similar phonemes with en-us than it has different ones. I just tried transfer learning from this model to french and it works reasonably well so you shouldn't have any trouble for your use-case. Try the faster one first and if it doesn't suit you you can always take the longer path. |
Yes, makes sense. My naive guess is that it will perform better than using characters as input but maybe a bit worse than the 'correct' phoneme set. Definition of a 'correct' phoneme set is also a bit fuzzy. It all depends how well it represents the pronunciations of speakers in your training database which may contain accented speech etc that you might be unaware of. |
Hi, couple questions, especially @mueller91. I am trying to recreate the experiment with the same config, the same datasets and a handful of private speakers (not more than 120, so deffo not a lot). However, I am having issues initiating training. It seems that it freezes 15 minutes in; the RAM starts going up slowly (CPU allocation looks healthy), then fills up and the entire thing freezes. I have tried both with 4&8 workers and it did not work. I have a machine 8 vCPUs, 32GB RAM and a V100. Thanks! And thanks for the model. 😀😅 |
Are you using my code, where part of the samples is kept in-memory to reduce I/O? |
I am using dev branch, so I guess it has this, yes. Because I also tried your fork and got the same problem. I tried to decrease the storage_size to 15, but it didn't really do anything. And if it is lower, the I/O increases a lot. How large should RAM be to cache all the wavs no problem? |
try decreasing it to zero and see if the RAM problem persists. and yeah, the I/O really is a problem. you really need SSDs for it. |
Actually, the SSD is not a problem, because I never run on HDD. So the problems i am getting are all using an SSD? How large a RAM did you use? Setting storage_size to 1 (0 is not accepted) works, but then the loss jumps to 0.60, even though I use the training sets you use. Did you only use the caching because you have an HDD? |
i had to use caching because i use a HDD. |
120 🤯 No wonder my small 30GB will not work! Thanks a lot for the clarification and for confirming it affects the loss. |
you can use a swap space as an easy workaround If you create it on SSD then it should be fast enough |
Hi, I was wondering if anybody has tried clustering in order to get a better understanding of what the network learns. I extracted some embeddings for my speakers and I tried clustering using HSBSCAN, but it only gives one (zero) label and then -1 which is apparently noise. This is what I have tried:
and I get
I set the min_cluster_size to 5 because anything higher only gives back the noise label. Maybe it indeed only has one label (and it is the pitch), but isn't it a bit weird that it doesn't learn anything else? |
@mueller91 Do you have a branch where the inter- and intra- losses are implemented? In a screenshot you shared above they are there, but they are not in dev or any other branch I tried and I am not sure how to implement them. |
Our current speaker encoder is trained with only LibriTTS (100, 360) datasets. However, we can improve its performance using other available datasets (VoxCeleb, LibriTTS-500, Common Voice etc.). It will also increase the performance of our multi-speaker model and makes it easier to adapt to new voices.
I can't really work on this alone due to the recent changes and the amount of work needed therefore I need some hand here to work together.
So I can list the TODO as follows and feel free to contribute to any part of it or suggest changes;
The text was updated successfully, but these errors were encountered: