-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training from scratch #126
Comments
Great work and great questions! I'll pin this issue for others in need of help. Firstly, one thing I notice from your profiler output is that you would benefit from a 2x speedup by putting your data on a faster disk (or maybe increasing the number of threads in the DataLoader if you set them too low)
|
Thanks for the quick reply! I also noticed the blocking operation taking a long time, found it very strange as the mel spectrograms are stored on a Samsung 960 EVO 1TB NVMe drive and
Edit: Edit 2: Step 1030 Loss: 3.2002 EER: 0.2662 Step time: mean: 871ms std: 58ms
Average execution time over 10 steps:
Blocking, waiting for batch (threaded) (10/10): mean: 103ms std: 26ms
Data to cuda (10/10): mean: 3ms std: 0ms
Forward pass (10/10): mean: 7ms std: 1ms
Loss (10/10): mean: 73ms std: 3ms
Backward pass (10/10): mean: 569ms std: 67ms
Parameter update (10/10): mean: 116ms std: 3ms
Extras (visualizations, saving) (10/10): mean: 1ms std: 4ms Edit 3: Step 310 Loss: 3.6576 EER: 0.3275 Step time: mean: 425ms std: 233ms
Average execution time over 10 steps:
Blocking, waiting for batch (threaded) (10/10): mean: 104ms std: 122ms
Data to cuda (10/10): mean: 3ms std: 0ms
Forward pass (10/10): mean: 39ms std: 1ms
Loss (10/10): mean: 23ms std: 1ms
Backward pass (10/10): mean: 80ms std: 5ms
Parameter update (10/10): mean: 121ms std: 2ms
Extras (visualizations, saving) (10/10): mean: 1ms std: 3ms
..........
Step 320 Loss: 3.6723 EER: 0.3339 Step time: mean: 322ms std: 98ms
Average execution time over 10 steps:
Blocking, waiting for batch (threaded) (10/10): mean: 60ms std: 97ms
Data to cuda (10/10): mean: 3ms std: 0ms
Forward pass (10/10): mean: 39ms std: 0ms
Loss (10/10): mean: 22ms std: 1ms
Backward pass (10/10): mean: 77ms std: 4ms
Parameter update (10/10): mean: 121ms std: 2ms
Extras (visualizations, saving) (10/10): mean: 2ms std: 4ms
..........
Step 330 Loss: 3.6419 EER: 0.3309 Step time: mean: 362ms std: 140ms
Average execution time over 10 steps:
Blocking, waiting for batch (threaded) (10/10): mean: 97ms std: 139ms
Data to cuda (10/10): mean: 3ms std: 0ms
Forward pass (10/10): mean: 39ms std: 1ms
Loss (10/10): mean: 24ms std: 3ms
Backward pass (10/10): mean: 78ms std: 4ms
Parameter update (10/10): mean: 121ms std: 1ms
Extras (visualizations, saving) (10/10): mean: 1ms std: 3ms |
thank you |
There are quite a few ways to gain disk reading speedups for the encoder, but don't forget that you still need variety in the samples/batches. Another bottleneck is the GPU VRAM not being entirely used. Since the complexity of the forward/backward pass is cubic w.r.t the batch size, you would need to put multiple batches in parallel on the same GPU rather than putting a larger batch size. It's something worth looking into. I had no idea you could specify to run the backward pass on the gpu, how did you do that? |
Thanks for the continuous feedback.
## Model parameters:
learning_rate_init: 0.0001
model_embedding_size: 768
model_hidden_size: 256
model_num_layers: 3
speakers_per_batch: 64
utterances_per_speaker: 10
## Data parameters:
audio_norm_target_dBFS: -30
inference_n_frames: 80
mel_n_channels: 40
mel_window_length: 25
mel_window_step: 10
partials_n_frames: 160
sampling_rate: 16000
vad_max_silence_length: 6
vad_moving_average_width: 8
vad_window_length: 30
The combined npz files have been working great for me, it will load all the utterances per speaker and still uses your same sampling code to grab a random sample per speaker. The only thing I removed is loading from individual npy files. I assume I changed the backwards pass to GPU, either way the GPU utilization is much higher and the profiler is showing significantly lower mean duration's for "Backward pass". I changed the self.similarity_weight = nn.Parameter(torch.tensor([10.]).to(loss_device))
self.similarity_bias = nn.Parameter(torch.tensor([-5.]).to(loss_device)) Simply moved the tensor not the parameter to the GPU and changed the GPU sync in def sync(device: torch.device):
# FIXME
# return
# For correct profiling (cuda operations are async)
if device.type == "cuda":
# torch.cuda.synchronize(device)
torch.cuda.synchronize() I'm now up to step 447,200 and included the loss and UMAP to show progress. I also changed the UMAP visualization to show 30 speakers by adding more colors to the color map. New color mapcolormap = np.array([
[32, 25, 35],
[255, 255, 255],
[252, 255, 93],
[125, 252, 0],
[14, 196, 52],
[34, 140, 104],
[138, 216, 232],
[35, 91, 84],
[41, 189, 171],
[57, 152, 245],
[55, 41, 79],
[39, 125, 167],
[55, 80, 219],
[242, 32, 32],
[153, 25, 25],
[255, 203, 165],
[230, 143, 102],
[197, 97, 51],
[150, 52, 28],
[99, 40, 25],
[255, 196, 19],
[244, 122, 34],
[47, 42, 160],
[183, 50, 204],
[119, 43, 157],
[240, 124, 171],
[211, 11, 148],
[237, 239, 243],
[195, 165, 180],
[148, 106, 162],
[93, 76, 134],
[0, 0, 0],
[183, 183, 183],
], dtype=np.float) / 255 |
Ah, I had put a warning not to compute the loss on GPU because for some reason it wasn't working (either it was some intricacies with torch or I forgot to enable grad on some tensor) and would return None. If that works, then I should update the repo to make it the default and have only 1 device for the encoder. |
You are correct, it was not working until I changed the two lines to move the tensor to the GPU not the parameter. That was all I had to change (I believe, if not I can dig through all my changes and help you isolate that fix.) Technically I changed loss_device to Also in the sync function, I had to remove the device parameter and simply use Clusters are getting tighter but I plan on training until at least 700-900k steps. I'm also tempted to train an English only model to compare. |
@sberryman will you be submitting a pull request? Id be very interested to see the results using more data for the speaker encoder - the GE2E paper demonstrated that having more data for the encoder is critical to getting the similarity of the cloned speaker close to the original. Also in my own experience, the compatibility of Fatchords Taco1 with WaveRNN makes it a great candidate, and the codebase is easy to work with. I still believe that Taco2 would be an upgrade in terms of quality of the inflection of the speaker, but that the out of the box compatibility of Fatchords synthesizer with the vocoder makes it a natural choice. Do note that Fatchords synthesizer does not support multiple speakers, so you would need to add that capability yourself (and a PR on Fatchords repo would be especially appreciated for adding that capability :) ) |
I'm also very interested in the results. I'm currently training the encoder on about 2k speakers in Swedish and about 4k mixed mainly English. I would really like to see examples from your encoder model on multiple languages to see if its worth crawling radio and tv shows with resemblyzers diarization to create a a fully Swedish dataset or if 6k with 1/3 being Swedish can compare to 25k mixed mainly english for Swedish voice cloning. My hunch is m0ar data |
Current:I'm at ~700k steps and still quite a few tight clusters, not sure if this is due to the fact that I trained for 350k steps on 9,000 speakers prior to adding 16,668 more speakers (which also introduced quite a few more languages) I'm going to continue training for another 200k steps which will be done this time tomorrow morning. To-Do:
Comments@TheButlahFirst, thanks for the massive PR that landed on Fatchords WaveRNN 4 days ago, really excited you added mutli-gpu training and mel's in numpy format! To your question on a PR, I can certainly submit PRs to this repo and WaveRNN. The code to utilize most of the datasets from OpenSLR and Common Voice are bit of a hack but if people want them I'm open to working on a PR for that as well. Thanks for the feedback on Taco1 and WaveRNN from Fatchords repo, that will be the route I will go. I will most likely run into issues adding multi-speaker but I will start an issue in that repo when I get there. @ViktorAlmGreat to hear about someone else testing multiple languages! Have you changed any of the data or model parameters? Funny you mentioned using Resemble's diarization as I've had a tab open to that code for a few days and planned on using it against 7,000 hours of local (English) news video I have. That is once I finished training a new model. As far as sharing the models I'm training, I'm open to it. Here is the model trained to 697,500 steps (768 model embedding size and 256 hidden layers.) Would be interested to know how it performs against your Swedish data @ViktorAlm. |
Thanks! I have not changed any params. I was on step 150k with my data to try and do a real run with all the models. I did one where I only did 100k steps on each model with about 900 swedish speakers with about 90gb data in total. It did not clone the voice but produced a good audio quality and atleast a male voice came out when I ran my own voice. I paused it and did a quick test with yours and the encoding result is way better than the small testrun I did. Swedish and Norwegian are pretty similar. I didnt see any specific Swedish/Norwegian cluster gathering but I only did two tests and umap might remove any visible difference I guess. Heres a converter if you wish to add norwegian, danish and swedish data to your mix: I also added some results from your encoder in /Results. When i've played around a bit more i might make a script that evaluates different languages better. |
@ViktorAlm Thanks for sharing! Is your Swedish and Norwegian dataset private? I'm up for including those speakers in the next training run where I use 768 for hidden/embedding size if you can share. There are only 20 Swedish voices in the 25,668 speakers I am training on and zero Norwegian. Common voice had 44 speakers for Swedish but I filtered those down to 20 as I had a floor of 12 unique utterances per speaker. Other updates
If anyone else is aware of other datasets I can include please let me know! |
Nice! I edited my old comment because i did not want to clutter your thread with my bad screenshots. I added my converter with links to the datasets. Its very hacky and if you want to add them i really should clean up the code some. I think a simple merge of the folders and then looping through to get the spls(files with info on location etc) and loading the files would be the best way instead of my weird way of scanning the folders. I was testing on just one of the extracted folders and the speech folders did not contain the wavs which was specified in the spl file. Then everything went weird from there. https://github.com/ViktorAlm/Nasjonalbank-converter |
Just in case this wasn't clear, Resemblyzer is also my project and is merely an interface to the speaker encoder of this repo. You can replace the pretrained model in the package and put yours instead. I could also distribute models that you provide me for other languages. |
I also would like to leave my script for evaluating the EER over the test set. It's not clean and I'm not sure if it's correct either (given that you won't find anywhere the right procedure to evaluate the EER over a dataset). You should use this if you want to formally evaluate the performance of the speaker encoder. If someone manages to make it better then I would gladly include it in the repo
|
Also I don't know about that:
|
Thanks @CorentinJ Well aware Resemblyzer is your project, that is how I ended up finding it. Thanks for open sourcing that project as well. Looking forward to seeing what your next project is! Thanks for the test script, I was thinking about how I was going to evaluate the models I'm training and would be great to compare these to your public model. Originally I was just going to plot a random 5-10 utterances for every single speaker to get an idea of the overall distribution. Interesting on not adjusting the learning rate; I'm more accustomed to training image classification models where reducing/decaying the learning rate is almost a requirement. I will not adjust the learning rate any further then. I was not aware the SV2TTS authors trained for 50M steps, obviously it is time for me to read their paper. Also, this is turning into more of a discussion than an "issue". I'm happy to move it to another location or can continue using GitHub issues; completely up to you. Thanks again! |
Nah it's common for issues to serve a broader purpose than just solving bugs. I don't decay the learning rate simply because it's not a necessity with Adam. The original authors did not use Adam and they did decay the learning rate by the way. Also, you will have to read GE2E to know more about the speaker encoder, because there isn't much info in SV2TTS about how they train or evaluate it. |
@sberryman Shaun, would be awesome if you'll create PR. If you don't feel it's polished enough, just mark it WIP. So it wouldn't be merged, but will be just an inspiration for others :) |
@slavaGanzin I have pushed my work in progress to my own fork. There are hard coded paths and changes related to grouping all the .npy files into a single .npz for each speaker. I also use docker and volume mappings so I left the basic Dockerfile in there. I don't plan on ever submitting a PR for that branch as I'm still experimenting quite heavily. Basically, feel free to use any of the scripts as a starting point but don't count on them working out of the box. https://github.com/sberryman/Real-Time-Voice-Cloning/tree/wip Other updates
Model trained to 1,005,000 is available on my dropbox account now. https://www.dropbox.com/s/69wv21ajt6l2pag/cv_run_bak_1005000.pt?dl=0 |
Hi sberryman, can I know which language your trained model in dropbox.com supports on? |
I need Chinese pretrained models for project in grad school. Can you guide me on that ? |
@Jessicamat777 the models I have uploaded to drop box are all for experimentation and I have NOT trained the synthesizer or vocoder on them yet. So they will be of little value unless you wanted to use them with CorentinJ's Resemblyzer. That being said, the models on dropbox are from the following datasets. A vast majority of the speakers are English. Based on a very tiny sampling against languages it has NOT been trained on, it doesn't appear the foreign speakers make much of a difference. That is most likely due to the unbalanced training set and extremely small number of speakers per additional language. I just wanted to see if it made a difference including foreign languages while training. Meaning the clusters for foreign languages are okay but nowhere near as well defined as English speakers. Look at this issue where I show how my model(s) perform against the one trained by CorentinJ on Swedish and Norwegian. I haven't made an effort to train on Chinese but it shouldn't be difficult if you have enough data. CorentinJ has done a great job of documenting the training process and answering questions on what size dataset you would need to train from scratch. |
Thanks to reply me,
Can I use multiple GPUs to train encoder data, so as to connect and make it
one at the end ? If I can save time training like this .
Please let me know ?
…On Mon, 16 Sep 2019, 21:08 Shaun Berryman, ***@***.***> wrote:
@Jessicamat777 <https://github.com/Jessicamat777> the models I have
uploaded to drop box are all for experimentation and I have NOT trained the
synthesizer or vocoder on them yet. So they will be on little value unless
you wanted to use them with CorentinJ's Resemblyzer.
That being said, the models on dropbox are from the following datasets.
1. LibriTTS <https://ai.google/tools/datasets/libri-tts/>
(train-other-500)
2. VoxCeleb1 <http://www.robots.ox.ac.uk/~vgg/data/voxceleb/>
3. VoxCeleb2 <http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html>
4. OpenSLR <http://www.openslr.org/resources.php> (42-44, 61-66, 69-80)
5. VCTK <https://datashare.is.ed.ac.uk/handle/10283/2651>
6. Common Voice <https://voice.mozilla.org/en/datasets>
A vast majority of the speakers are English. Based on a very tiny sampling
against languages it has NOT been trained on, it doesn't appear the foreign
speakers make much of a difference. That is most likely due to the
unbalanced training set, I didn't make any effort to balance. Just wanted
to see if it made a difference including foreign languages while training.
Meaning the clusters for foreign languages are okay but nowhere near as
well defined as English speakers.
Look at this issue where I show how my model(s) perform against the one
trained by CorentinJ on Swedish and Norwegian.
resemble-ai/Resemblyzer#9
<resemble-ai/Resemblyzer#9>
I haven't made an effort to train on Chinese but it shouldn't be difficult
if you have enough data. CorentinJ has done a great job of documenting the
training process and answering questions on what size dataset you would
need to train from scratch.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#126?email_source=notifications&email_token=AM3TVRFPT46WSVI22P4M6UTQJ6SBBA5CNFSM4IUT3NSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ZSEOQ#issuecomment-531833402>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AM3TVRCI2RBDG37GJFUC7LTQJ6SBBANCNFSM4IUT3NSA>
.
|
@Jessicamat777 multi-gpu training is NOT implemented. If you do implement it, can you please submit a pull-request to this repository so others can benefit? |
Training the encoder is interesting, but I'm not entirely convinced that the problem is the encoder. (Where "problem" is defined as "the current model has a lot of trouble reproducing female voices accurately.") Are we certain that for every possible human voice, there exists an embedding which allows tacotron2 to produce spectrograms indistinguishable from that voice? If not, then it seems beneficial if tacotron2 were trained on the new diverse speech dataset in addition to the encoder. For example, in my experiments it has seemed impossible to generate spectrograms with cartoon-style inflections: lots of expressive vocalizations, rapid pitch changes, and so on. If that's how a speaker sounds normally, then it seems like it's impossible for the encoder to generate any latent vector that would cause tacotron2 to produce spectrograms that sound anything like the speaker. Perhaps I am confused, but just to confirm: there are three separate things that need to be trained, right? The encoder, the synthesizer (text to spectrogram), and the vocoder (spectrogram to wav). This training process is focusing entirely on the encoder. How is the loss being calculated? If the loss is calculated in terms of "tacotron2 is able to generate spectrograms that sound more like this speaker," then the training here will not have a huge impact on overall quality or diversity. The training would need to be done on the synth, then the encoder. Do I have this backwards? Is it true that the encoder's final quality is bounded by the expressiveness of the synth? If that's correct, then the synth is what would benefit from the larger dataset. |
It's not intuitive, I agree. However, this is clearly the conclusion the authors of the sv2tts paper reached. They argue that most of the ability to clone voices lies in the training of the encoder. They also clearly show that the framework has limitations (which we observe in this repo as well):
If you give a listen to their librispeech samples, you will notice that as well. |
Training updates EncoderI've stopped training both the mixed and English encoders, the mixed encoder reached just over 2.1 million steps with 27,432 speakers. SynthesizerSince I'm using LibriTTS I had to make some changes to the code base. First I used Montreal forced aligner to come up with the alignments. Then I realized google already normalized the audio and removed the leading and trailing silence. So at this point I just skipped the alignment portion of preprocessing and use the original transcript (as opposed to the normalized which is also provided) with all punctuation and capitalization left in place. I know the English cleaner converts everything to lowercase though. I started training last night across two GTX 1080 Ti's and GPU utilization bounces between 20% and 93%. Overridden hparams:
Training progressTensorBoardStdoutStep 27753 [1.664 sec/step, loss=0.68117, avg_loss=0.67622]
Step 27754 [1.690 sec/step, loss=0.64809, avg_loss=0.67585]
Step 27755 [1.687 sec/step, loss=0.68754, avg_loss=0.67603]
Step 27756 [1.686 sec/step, loss=0.67575, avg_loss=0.67593]
Step 27757 [1.675 sec/step, loss=0.65758, avg_loss=0.67573]
Step 27758 [1.684 sec/step, loss=0.66391, avg_loss=0.67550]
Step 27759 [1.687 sec/step, loss=0.66689, avg_loss=0.67528]
Step 27760 [1.710 sec/step, loss=0.66279, avg_loss=0.67525]
Step 27761 [1.681 sec/step, loss=0.69119, avg_loss=0.67565]
Step 27762 [1.679 sec/step, loss=0.67129, avg_loss=0.67552]
Step 27763 [1.677 sec/step, loss=0.69174, avg_loss=0.67563]
Step 27764 [1.693 sec/step, loss=0.65657, avg_loss=0.67544]
Step 27765 [1.692 sec/step, loss=0.66381, avg_loss=0.67518]
Step 27766 [1.672 sec/step, loss=0.70290, avg_loss=0.67546] PlotsWAVsQuestions:
|
Thanks for the feedback @LordBaaa . I generated that sample five times on the 428k model trying to get that pop to go away, before I became convinced that it was a feature of the model. |
Hello @sberryman! Could you provide pretrained weights from #126 (comment) for Mixed version? |
@blue-fish The wavs that you shared sounds good! Are the wavs just the result of vocoder, or an end2end results which using encoder to predict the embedding then using tacotron and vocoder model to synthesize? |
@Liujingxiu23 They are end-to-end results where I replicate the audio samples of the SV2TTS paper: https://google.github.io/tacotron/publications/speaker_adaptation/ I use the reference audio from VCTK p240 and p260 to create the embedding and generate synthesized samples #0 and #1 using tacotron and the vocoder model. |
@Oktai15 I thought I posted the links to the encoder for the mixed version. The tacotron and vocoder weights are useless that I trained. However the encoder is quite good. |
@Oktai15 I think these are the settings you need to use @sberryman 's mixed encoder: #126 (comment) I have not tried it though. Please let us know if it works for you. |
@blue-fish You train encoder,synthsizer as well as the vocoder by yourself as follows? I trained the encoder and synthsizer using chinese corpus, but the result is not as good as yours. For the encoder, have you remove the relu Activation Function in the last linear layer? |
@Liujingxiu23 The info about the model training comes from this page: https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Pretrained-models The encoder and synthesizer are the original models by @CorentinJ . All I did was take his original vocoder model and continued the training to see what would result. I didn't even change any parameters except to cut the batch size in half (100 to 50) so it would fit in my GPU's limited memory. Edit: In case it is not clear, I used the training code in the repo without modification. I also used the same datasets (LibriSpeech train-clean-100 and -360) and processed them following these instructions: https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Training Also, since Chinese is your target language, you should see @KuangDD 's work here: #30 (comment) if you haven't already. |
@blue-fish I see, Thank you very much |
i also experience that |
Thanks for publishing the code and basic training instructions!
Environment
Datasets: (9,063 speakers)
I'm working on adding TEDLIUM_release-3 which would add 1,925 new speakers and potentially SLR68 which would add 1,017 Chinese speakers but would require some clean up as there is a lot of silence in the audio files.
Hyper Parameters:
Left all parameters untouched.
Encoder training:
39,300 steps:
115,900 steps: (almost exactly 24 hours of training)
Typical step
Questions
The text was updated successfully, but these errors were encountered: