Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training a new encoder model #458

Closed
ghost opened this issue Jul 28, 2020 · 113 comments
Closed

Training a new encoder model #458

ghost opened this issue Jul 28, 2020 · 113 comments

Comments

@ghost
Copy link

ghost commented Jul 28, 2020

In #126 it is mentioned that most of the ability to clone voices lies in the encoder. @mbdash is contributing a GPU to help train a better encoder model.

Instructions

  1. Download the LibriSpeech/train-other-500, and VoxCeleb 1/2 datasets. Extract these to your <datasets_root> folder as follows:
    • LibriSpeech: train-other-500 (extract as LibriSpeech/train-other-500)
    • VoxCeleb1: Dev A - D as well as the metadata file (extract as VoxCeleb1/wav and VoxCeleb1/vox1_meta.csv)
    • VoxCeleb2: Dev A - H (extract as VoxCeleb2/dev)
  2. Change model_hidden_size to 768 in encoder/params_model.py
  3. python encoder_preprocess.py <datasets_root>
  4. Open a separate terminal and start visdom
  5. python encoder_train.py new_model_name <datasets_root>/SV2TTS/encoder
@ghost
Copy link
Author

ghost commented Jul 31, 2020

@mbdash I will be unavailable this weekend. Hopefully the commands will just work. You can reach out to the community for help, in particular @sberryman who has gone through this in #126 .

There are two things that I hope to learn from training this new encoder model.

  1. Does the voice cloning improve with the new model? (i.e. does increasing hidden layers from 256 to 768 make a difference when it is still projected down to 256 at the end)
  2. Will the new encoder model be compatible with the existing synthesizer?

My hypothesis is that for 1, we will not see a difference unless we also retrain the synth with many more voices. And for 2, that it should be compatible since the dimensions, input data and loss function are not changing. I may very well be wrong on that since I have not studied the encoder in detail.

@mbdash
Copy link
Collaborator

mbdash commented Aug 3, 2020

update: still preparing the data. I might start the training tomorrow.

@mbdash
Copy link
Collaborator

mbdash commented Aug 3, 2020

Preprocess started @10h00 EST 2020 08 03

Question:
why do I need to start visdom ?

@ghost
Copy link
Author

ghost commented Aug 3, 2020

It's optional, but starting a visdom server allows you to visualize the training results by navigating to http://localhost:8097

The umap projections will let us know whether the encoder has learned to distinguish between the voices in the training set. This in turn helps us decide when to stop training.

@mbdash
Copy link
Collaborator

mbdash commented Aug 3, 2020

update
I got a crash. I have to figure out what happens.

The dataset resides on a nfs share on my unraid host. many TB avail, so it is not a lack of space for the dataset.
I will force change chown -R user and chmod -R 766 on the whole dataset and try again.

image

@ghost
Copy link
Author

ghost commented Aug 3, 2020

By any chance did you have the text file (<datasets_root>/LibriSpeech/_sources.txt) open in a viewer?

@mbdash
Copy link
Collaborator

mbdash commented Aug 3, 2020

Nope.
It might have been a hickup due to using and nfs share.

I noticed this file rights:
-rw-r--r-- 1 99 users Log_LibriSpeech_train-other-500.txt

Rights inheritance might have caused some issues since my user would fall under group:users.
Changing the dataset root folder owner recursively instead of relying on group membership should fix the issue.

I'll keep you posted on updates.

@mbdash
Copy link
Collaborator

mbdash commented Aug 3, 2020

I fixed the previous error (see bottom of comment) but I got another crash in VoxCeleb2:

image

Here is my current pysoundfile version:
image

Here is the last files processed:

drwxr-xr-x 1    99 users    8676 Aug  3 21:08 VoxCeleb1_wav_id11249
drwxr-xr-x 1    99 users    3594 Aug  3 20:48 VoxCeleb1_wav_id11250
drwxr-xr-x 1    99 users    2586 Aug  3 20:56 VoxCeleb1_wav_id11251
drwxr-xr-x 1    99 users      24 Aug  3 21:34 VoxCeleb2_dev_aac_id00517
drwxr-xr-x 1    99 users     948 Aug  3 21:34 VoxCeleb2_dev_aac_id00906
drwxr-xr-x 1    99 users     948 Aug  3 21:34 VoxCeleb2_dev_aac_id00924
drwxr-xr-x 1    99 users     864 Aug  3 21:34 VoxCeleb2_dev_aac_id01184
drwxr-xr-x 1    99 users     192 Aug  3 21:34 VoxCeleb2_dev_aac_id02074
drwxr-xr-x 1    99 users     570 Aug  3 21:34 VoxCeleb2_dev_aac_id02477
drwxr-xr-x 1    99 users    1074 Aug  3 21:34 VoxCeleb2_dev_aac_id03184
drwxr-xr-x 1    99 users     948 Aug  3 21:34 VoxCeleb2_dev_aac_id03701
drwxr-xr-x 1    99 users    1074 Aug  3 21:34 VoxCeleb2_dev_aac_id04961
drwxr-xr-x 1    99 users     948 Aug  3 21:34 VoxCeleb2_dev_aac_id06261
drwxr-xr-x 1    99 users     318 Aug  3 21:34 VoxCeleb2_dev_aac_id07417
drwxr-xr-x 1    99 users     108 Aug  3 21:34 VoxCeleb2_dev_aac_id07531

For the previous crash,
I took a guess at the issue, my guess is that in encoder/preprocess.py
the file handle is kept open for too long (1h30min+) and the file handle get's f'ed up.
So I made some mods locally to only open for write the log file during the init, then I open for append for each write / finalizing.

class DatasetLog:
    def __init__(self, root, name):
        self.fpath = Path(root, "Log_%s.txt" % name.replace("/", "_"))
        self.sample_data = dict()
        start_time = str(datetime.now().strftime("%A %d %B %Y at %H:%M"))
        with open(self.fpath, "w") as f:
            self.write_line("Creating dataset %s on %s" % (name, start_time), file_handle=f)
            self.write_line("-----", file_handle=f)
            self._log_params(file_handle=f)
        
    def _log_params(self, file_handle):
        from encoder import params_data
        self.write_line("Parameter values:", file_handle=file_handle)
        for param_name in (p for p in dir(params_data) if not p.startswith("__")):
            value = getattr(params_data, param_name)
            self.write_line("\t%s: %s" % (param_name, value), file_handle=file_handle)
        self.write_line("-----", file_handle=file_handle)
    
    def write_line(self, line, file_handle=None):
        if file_handle:
            file_handle.write("%s\n" % line)
        else:
            with open(self.fpath, "a") as f:
                f.write("%s\n" % line)
        
    def add_sample(self, **kwargs):
        for param_name, value in kwargs.items():
            if not param_name in self.sample_data:
                self.sample_data[param_name] = []
            self.sample_data[param_name].append(value)
            
    def finalize(self):
        with open(self.fpath, "a") as f:
            self.write_line("Statistics:", file_handle=f)
            for param_name, values in self.sample_data.items():
                self.write_line("\t%s:" % param_name, file_handle=f)
                self.write_line("\t\tmin %.3f, max %.3f" % (np.min(values), np.max(values)), file_handle=f)
                self.write_line("\t\tmean %.3f, median %.3f" % (np.mean(values), np.median(values)), file_handle=f)
            self.write_line("-----", file_handle=f)
            end_time = str(datetime.now().strftime("%A %d %B %Y at %H:%M"))
            self.write_line("Finished on %s" % end_time, file_handle=f)

@ghost
Copy link
Author

ghost commented Aug 4, 2020

@mbdash Searching on the error message I came across the suggestion to convert the m4v files to wav which should fix the problem for voxceleb2: #76 (comment)

@mbdash
Copy link
Collaborator

mbdash commented Aug 4, 2020

@blue-fish I will try to convert them when I have some time.
I'll keep you posted.

@mbdash
Copy link
Collaborator

mbdash commented Aug 6, 2020

@blue-fish m4u to wav conversion in progress.
(I was out of commission for a few days)
24k files done while writing this.

@ghost
Copy link
Author

ghost commented Aug 6, 2020

@mbdash Although it is preferable to change just one variable with our training experiment, we know that the encoder gets better with more voices so I would like to suggest including the Mozilla CommonVoice dataset, which has over 60k unique English speakers: https://voice.mozilla.org/en

Let's try to incorporate this one if you have the time and patience to preprocess it. @sberryman has written a snippet of code for just that purpose: https://github.com/sberryman/Real-Time-Voice-Cloning/blob/d6ba3e1ec0f950636e9cac3656c0be5c331821cc/encoder/preprocess.py#L224-L244

@ghost
Copy link
Author

ghost commented Aug 6, 2020

Some thoughts on the encoder model

Maybe a better encoder is not needed after all, depending on the objective. Although the SV2TTS paper demonstrated the possibility of high-quality zero-shot cloning, I think what most people are after is a high-quality single-speaker TTS. If that is the objective, we have demonstrated in #437 that a decent single-speaker model can be finetuned from the pretrained models at significantly less effort than traditional TTS models. The required dataset goes from 10+ hours to about 10 minutes, a reduction of nearly 2 orders of magnitude.

For this purpose, the speaker encoder acts as a starting point for the finetuning task and the quality of encoding mainly determines how much finetuning is needed. The best case is that no additional training is needed, i.e. high-quality zero shot voice cloning per the SV2TTS paper. The worst case is bounded by the 10+ hours needed to train a single-speaker TTS.

With a better encoder and synthesizer, the required dataset for finetuning can really only go down by 1 order of magnitude: just 1 minute of audio. An reduction of 2 orders of magnitude (dataset of 10 seconds) is equivalent to zero-shot in terms of performance.

While the idea of making a voice with just 1 minute of training data is more appealing than the current 10 minutes, is it an order of magnitude improvement from the perspective of the end user? Or in other words, how much effort is appropriate for the encoder given the potential improvement to be had? Arguably, the encoder is already good enough and our limited resources are better spent on the synthesizer which has a lot of known issues.

@mbdash
Copy link
Collaborator

mbdash commented Aug 6, 2020

I have begun downloading the Mozilla CommonVoice dataset and will add it to the encoder pretraining.

I am adding the preprocess fn to my version of encoder/preprocess.py
(you can note that I hardcoded a default fallback value for lang = lang or 'en'

def preprocess_commonvoice(datasets_root: Path, out_dir: Path, lang=None, skip_existing=False):
    lang = lang or 'en'    
    # simple dataset path
    dataset_name = "CommonVoice/{0}/speakers".format(lang)

    # Initialize the preprocessing
    dataset_root, logger = _init_preprocess_dataset(dataset_name, datasets_root, out_dir)
    if not dataset_root:
        return

    # Preprocess all speakers
    speaker_dirs = sorted(list(dataset_root.glob("*")))

    # speaker_dirs = speaker_dirs[0:4000] (complete)
    # speaker_dirs = speaker_dirs[4000:5000] (complete)
    # speaker_dirs = speaker_dirs[5000:7000] (complete)
    # speaker_dirs = speaker_dirs[7000:8000] (complete)
    # speaker_dirs = speaker_dirs[8000:9000] (in-progress)
    # speaker_dirs = speaker_dirs[9000:] (in-progress)

    _preprocess_speaker_dirs(speaker_dirs, dataset_name, datasets_root, out_dir, "wav",
                             skip_existing, logger)

I also updated my encoder_preprocess.py accordingly.

I will keep you guys updated.
(and attempt to create a push for my changes when done)

@ghost
Copy link
Author

ghost commented Aug 7, 2020

Following up on my earlier comments, this table is from 1806.04558 (the SV2TTS paper):

1806 04558_screenshot

Finally, we note that the proposed model, which uses a speaker encoder trained separately on a corpus of 18K speakers, significantly outperforms all baselines

@mbdash It is still worth one attempt to add the 60k speakers from CommonVoice to the encoder and increase the hidden layers to see if we can achieve open-source zero-shot voice cloning that is as good as the results they published. While you're preparing that dataset I will also read the GE2E paper to see if anything else should be changed for this experiment.

Edit: If you think VoxCeleb is too noisy of a dataset we can also try LibriSpeech + CommonVoice, or just CommonVoice alone.

@mbdash
Copy link
Collaborator

mbdash commented Aug 9, 2020

I am about to begin preprocess on CommonVoice.
If there is any modifications you want to make prior to me starting the training, please let me know.

update:
mp3 return the same errors as m4a...
I guess i will have to convert to wav...

image

update 2h later: i think i have converted 50%+ to wav

@ghost
Copy link
Author

ghost commented Aug 9, 2020

If there is any modifications you want to make prior to me starting the training, please let me know.

Just one mod to make, model_hidden_size = 768 in encoder/params_model.py

@ghost
Copy link
Author

ghost commented Aug 9, 2020

@mbdash One more request to make if training has not started yet. I just read #364 (comment) and would like to do the encoder training in 2 phases.

  1. Start training on LibriSpeech + CommonVoice only, it should converge relatively fast, save off the model
  2. Resume training on the model after adding VoxCeleb 1+2 to the training set

The question is, does the phase 2 model perform better than the model from phase 1? I forget how the SV2TTS folder is structured and if this can be easily implemented (maybe you could make a separate datasets_root/SV2TTS/encoder folder and use symbolic links to present only the selected datasets to the encoder). It would be a very interesting data point that would help those who want to do encoder training in the future.

Edit: Given that VCTK is a popular dataset it would be nice to include it either in phase 1 or 1.5 to ensure the resulting model performs well with it. But it is only 110 voices so just a drop in the bucket compared to the others, and not worth holding up the training for it.

@mbdash
Copy link
Collaborator

mbdash commented Aug 9, 2020

@blue-fish I am still converting CommonVoice mp3 to wav. It has been doing it for hours.

(I began downloading VCTK)

Can I simply move all the Celeb 1&2 folders out of the /SV2TTS/encoder folder and move them back in phase 2?
Or should I restart the preprocess only selecting LibriSpeech + CommonVoice?

I did not look deep enough in the code to see if the preprocess does anything else then populating the /SV2TTS/encoder folder

Here is a sample of what the /SV2TTS/encoder folder looks like:
image

@ghost
Copy link
Author

ghost commented Aug 10, 2020

Can I simply move all the Celeb 1&2 folders out of the /SV2TTS/encoder folder and move them back in phase 2?

Good idea @mbdash , I looked at the code and I think that will work.

@mbdash
Copy link
Collaborator

mbdash commented Aug 10, 2020

Please note that by separating the CommonVoice files in subfolders, the preprocess in interpreting each folder as a different speaker.

I do not know if the training quality will be affected by this since random files from different speakers are mix within those folders.

Also note this error:
image

The reprocessing seems to be going even with the warning.

@ghost
Copy link
Author

ghost commented Aug 10, 2020

I do not know if the training quality will be affected by this since random files from different speakers are mix within those folders.

Thanks for bringing this up @mbdash . The whole point of the speaker encoder is to learn to distinguish voices from different speakers. If the folder name is used to uniquely ID the speaker then mixing will be disastrous. Is there any metadata in CommonVoice that can help sort things out before you preprocess? @sberryman can you share how you preprocessed CommonVoice for encoder training?

Edit: Would this issue still exist if you treat each CommonVoice subfolder as its own dataset?

@sberryman
Copy link

sberryman commented Aug 10, 2020

@blue-fish and @mbdash

You should cancel your current pre-processing. You need a unique folder per speaker in the pre-processed folder for the encoder. I wrote a little script to pre-process common voice dataset(s) for each language. It was run against a release of CV from at least a year ago. I doubt the format of validated.tsv has changed but just keep that in mind.

Script: https://github.com/sberryman/Real-Time-Voice-Cloning/blob/wip/scripts/cv_2_speakers.py

You'll need to adjust line 26 for the base directory of common voice. One of the arguments to the script is --lang which is just the subfolder for the language. Fairly useless if you plan to hardcode the path on line 26.

The other arguments are for min and max number of audio segments per speaker. Feel free to adjust that based on your needs, I found that minimum of 5 worked well for me.

So this loops over every speaker id in the validated.tsv file and groups the audio clips per speaker into a dictionary. Then it processes each speaker, grabs the first 20 chars of the speaker id and uses that for the path name in the pre-processed directory. Finally it uses ffmpeg to convert the mp3's to wav and downsamples to 16000hz. The sample rate is hardcoded so if you want to adjust that change it on line 93.

It takes a while but works great, I did the entire CV dataset for my encoder (all languages.)

Edit: Also be very careful about lines 60-61. It will rmtree the output path!
Edit 2: When this step has finished you can run the encoder pre-processing script against the {base_path}/{lang}/speakers directory.

@mbdash
Copy link
Collaborator

mbdash commented Aug 10, 2020

@sberryman where can i find the validated.tsv

@sberryman
Copy link

sberryman commented Aug 10, 2020

@mbdash There is a validated.tsv included in every language download from Common Voice.

For example:

  1. Download Greek dataset
  2. Extract el.tar.gz
  3. base_dir="./cv-corpus-5.1-2020-06-22/{lang}/"
  4. You will see several tsv files, validated.tsv being one of them.

The clips folder is where all the audio clips are stored and what I'm assuming you are running encoder pre-processing against.

My script will create a new folder called speakers in the base_dir and will then create a new sub folder for each speaker which will include the language (el) and the first 20 chars of the speaker_id provided by CV.

Once that script finishes you'll be able to run the encoder pre-processing against the {base_dir}/speakers directory.

@ghost
Copy link
Author

ghost commented Aug 10, 2020

@sberryman Thank you so much for the prompt and helpful replies.

@mbdash
Copy link
Collaborator

mbdash commented Aug 10, 2020

@sberryman yes thank you for your quick response, I guess i might have "misplaced" a folder :-s

I will delete my extracted files, un tar again and be more careful.

thank you

@sberryman
Copy link

@mbdash no need to apologize, none of this is really documented. Requires reading through tons of comments on issues, some of which are closed I'm sure.

It is neat to see such strong demand for this project.

I had an idea a while ago to create a platform for cloning facial images and speech and standardize the training process a bit by making it easy to swap out backbone architectures, etc. Then ideally people could join in the project in various capacities. Some might help label new data, add new datasets, contribute their GPU(s) to a training pool, etc. Then we could build a web based UI to interact and run inference on pre-trained models. It is clear to me that the UI aspect of this project made it very approachable to everyone. But fine tuning or changing out datasets is confusing. (I've been working on the visual side recently)

@sberryman
Copy link

I should also clarify that the script I linked to does NOT add the language to the speaker specific folders. I just grabs the first 20 chars of the speaker_id provided by CV.

Ensure you are using the modifications I made to encoder/preprocess.py for preprocessing common voice.
https://github.com/sberryman/Real-Time-Voice-Cloning/blob/wip/encoder/preprocess.py#L224-L244

@mbdash
Copy link
Collaborator

mbdash commented Aug 10, 2020

I can confirm I previously deleted the tsv during the process....
I can see them now that i am re-extracting the archive.
I will also re-process using your script since you specify a bitrate for wav conversion, which i didnt do when i did it last weekend.
I will keep you guys posted.

@ghost
Copy link
Author

ghost commented Aug 25, 2020

@ustraymond Code for calculating EER is shared in #61 (comment) and #126 (comment) . The code is identical but the context for discussion is slightly different.

@mueller91
Copy link

Dear @mbdash
thank you for providing the GPU and publishing the models. One curious observation, though: i use your model to embed a batch of utterances, and compute the inter- and intra class cosine similarity (i.e. the cosine similarity for all pairs s_i, s_j where the speakers are different, or the same, repectively).
i obtain mean inter-similarity of around 0.45, and intra-similarity of around 0.9

  • i would expect an inter-sim of close to 0 (since this is the training objective)
  • these values do not correspond to the very low loss you report
    are you still using ReLu activation? Shouldn't relu be disadvantageous (since it limits the expressiveness of the model, or more intuitively, it limits 'where' on a hypersphere the model can map to, namely only to the positive domain)

@mbdash
Copy link
Collaborator

mbdash commented Aug 25, 2020

@mueller91 I am not the guy you are looking for, the wise guy with the answers is @blue-fish.

Note that there is 2 encoder models.
Be sure you are downloading the proper one.

"LibriSpeech + CommonVoice + VCTK Only until step 315k"
The 1st one was only "LibriSpeech + CommonVoice + VCTK until step 315k"
Available here: https://drive.google.com/drive/folders/1OkHpeV3i5fGzI6shhjY3nkpN9jXGk7Ak?usp=sharing

"LibriSpeech + CommonVoice + VCTK until step 315k + VoxCeleb1&2"
The 2nd encoder model is the encoder above with VoxCeleb 1 & 2 added to the dataset after step 315k.
(I stopped the training at 315k, then added more datasets (VoxCeleb), and resumed training at step 315k)
I am constantly (daily) updating the new model "LibriSpeech + CommonVoice + VCTK until step 315k + VoxCeleb1&2"
Available here: https://drive.google.com/drive/folders/1QBeC8_PKFn-ZpZzsWY3vIFZKd4dk3g9O?usp=sharing

The latest uploaded is 525k steps.
Currently I am locally at step 531k.

I would wait for me to reach 750k to do anything with this encoder if I were you,
I see the loss bouncing from .026 to .04 non stop currently.

image

@ghost
Copy link
Author

ghost commented Aug 25, 2020

@mueller91 I think your observations are explained by our continued use of ReLU (which is not a deliberate choice, we just used the repo code without modification). @sberryman removed ReLU in resemble-ai/Resemblyzer#13 which causes inter-similarity to be centered around zero instead of 0.5 as in our case.

We will continue using ReLU as long as the encoder model needs to support Corentin's pretrained encoder which also uses it.

@sberryman
Copy link

@blue-fish how does @mbdash's model work with the existing synth/vocoder? I would assume not very well and producing a generic voice?

If that is the case, @mbdash should stop training and start over with Tanh as the final activation. Then you'll use @mbdash's new model to train the synthesizer and vocoder from scratch (several weeks worth of GPU time.)

Replicating Corentin's work training from scratch would likely require well over 700 hours of training using two 1080 TI's. They are using much larger GPUs (40GB of memory I believe) at Resemble.ai to train more quickly. The 700 hours is a very rough estimate to illustrate the several weeks of training for each of the three models.

@mueller91
Copy link

@sberryman Could you elaborate why to chose Tanh as final activation?

@ghost
Copy link
Author

ghost commented Aug 25, 2020

@sberryman When I train a VCTK-based synthesizer to 100k steps on my basic GPU, I get a very similar result for voice cloning regardless of whether I use Corentin's model or @mbdash 315k model (LS+VCTK+CV) as the speaker encoder for training and inference. See: #458 (comment)

That result was completely unexpected and maybe I should delete my pycache just to be very sure that I performed the experiment properly. But because of the different hidden unit size it is impossible to use the wrong encoder with the wrong synthesizer.

Also, the pretrained encoder bundled with Resemblyzer is identical to the one in this repo, which I recall took ~20 days to train on a single 1080 TI.

@sberryman
Copy link

sberryman commented Aug 25, 2020

@mueller91 I used tanh as the final activation to force the values between -1 and 1 as opposed to 0-1 with ReLU. I never tried training with no final activation so I'm not sure how that would turn out to be honest.

resemble-ai/Resemblyzer#13 (comment)

Edit: I didn't do a good job documenting and checking in code throughout all the experiments so there is a chance the model I trained for 1M+ steps didn't have a final activation. @blue-fish probably knows that better than me at this point. There is a chance tanh was only used on an experiment when I was trying to build an encoder model based on raw waveform

Edit 2: Corentin doesn't think it makes much of a difference though. resemble-ai/Resemblyzer#15 (comment)

Edit 3: This implementation doesn't use an activation function after the LSTM. https://github.com/HarryVolek/PyTorch_Speaker_Verification/blob/master/speech_embedder_net.py

@ghost
Copy link
Author

ghost commented Aug 25, 2020

Edit: I didn't do a good job documenting and checking in code throughout all the experiments so there is a chance the model I trained for 1M+ steps didn't have a final activation. @blue-fish probably knows that better than me at this point.

There are no states associated with the activation so it's not possible to tell with just by looking at the checkpoint file. When I hacked the final linear layer in #458 (comment) I noticed the loss came down very quickly using ReLU. The loss was already down to 0.01 within 10 steps of restarting training. So if I had to guess, that particular 768/768 English encoder was likely using ReLU.

@sberryman Thanks for digging up and sharing those additional links.

@mbdash
Copy link
Collaborator

mbdash commented Aug 27, 2020

At step 600k, loss is still moving between 0.015 and 0.04
But more often hitting lower values

https://drive.google.com/drive/folders/1QBeC8_PKFn-ZpZzsWY3vIFZKd4dk3g9O?usp=sharing

image

encoder_mdl_ls_cv_vctk_vc12_umap_600000

@sberryman
Copy link

@mbdash training is looking good, still some overlap on clusters. Have you tried to plot cross similarity matrixes?

resemble-ai/Resemblyzer#13 (comment)

I also did a few plots for a much larger number of speakers here:
resemble-ai/Resemblyzer#13 (comment)

Default is the model included in this repository and 768 was the model I trained. It looks like my EER was 0.00392 at 2.38M steps.

@mbdash
Copy link
Collaborator

mbdash commented Sep 1, 2020

step 750K reached we are still around 0.03 loss

image

encoder_mdl_ls_cv_vctk_vc12_umap_757500

@mueller91
Copy link

Dear @mbdash , any updates? If you find the time to share the current model, it'd be much appreciated! :)

@mbdash
Copy link
Collaborator

mbdash commented Sep 6, 2020

image
encoder_mdl_ls_cv_vctk_vc12_umap_945000

https://drive.google.com/drive/folders/1QBeC8_PKFn-ZpZzsWY3vIFZKd4dk3g9O?usp=sharing

loss 0.015 to 0.027

cheers

@mbdash
Copy link
Collaborator

mbdash commented Sep 7, 2020

1000000steps reached

loss 0.016 to 0.022

image

encoder_mdl_ls_cv_vctk_vc12_umap_1000100

@CorentinJ
Copy link
Owner

Hey guys, just went through this thread quickly.

I indeed removed the ReLU layer in the voice encoder we use at Resemble.AI. I think the model on Resemblyzer still has it. I planned to release a new one which, among other things, wouldn't be trained on data that would have silences at the start and end of each clip.

I don't think I'll update the code in this repo, but I should update the code on Resemblyzer when the new model's released.

@mbdash
Copy link
Collaborator

mbdash commented Sep 11, 2020

Hi, we are making a group effort on building a new dataset curated and cleaned, quality over quantity.

VoxCeleb is rejected due to it's horrible quality.
VCTK might eventually have some bits in it.
CommonVoice will be part of it.
LibriTTS 100 / 360 / 500 will mostly be the base. (1st iteration)

Join the Slack for more info.

@lnguyen
Copy link

lnguyen commented Sep 11, 2020

@mbdash what slack?

@ghost
Copy link
Author

ghost commented Sep 11, 2020

Anyone who wants to contribute in some way to the RTVC project is welcome to join the Slack. Leave a comment in #474 and we will provide an invite link.

@CorentinJ
Copy link
Owner

VoxCeleb is rejected due to it's horrible quality.

The idea is to have a dataset with low quality though

@ghost ghost mentioned this issue Sep 11, 2020
@ghost
Copy link
Author

ghost commented Sep 11, 2020

Mozilla TTS is also developing a speaker encoder in mozilla/TTS#512. I am inviting Mozilla TTS contributors to this discussion to see if we can decide on a common model structure. Also share thoughts on datasets and preprocessing techniques. In a best case situation we could even share the model.

@mbdash
Copy link
Collaborator

mbdash commented Sep 29, 2020

A small update for anyone watching this thread.
@steven and @blue-fish are doing some experimentation with training.
I am currently cleaning up datasets to remove noise and artifacts from the source data used to train the models.

I saw @CorentinJ 's comment.

VoxCeleb is rejected due to it's horrible quality.

The idea is to have a dataset with low quality though

We are just playing around, putting our resources in common and experimenting to improve audio output quality as well as adding some punctuation support.

I am done with LibriTTS60/train-clean-100
and progressing through 360

@sberryman
Copy link

@mbdash I believe corentin was referring to low quality audio (background noise, static, etc) being important while training the encoder. Clean audio is important for synthesis.

@ghost
Copy link
Author

ghost commented Sep 30, 2020

In the SV2TTS paper it is stated that "the audio quality [for encoder training] can be lower than for TTS training" but between this and the GE2E paper I have not seen a statement that it should be of lower quality. The noise might train the network to distinguish based on features that humans can perceive as opposed to subtler differences that can be found in clean audio. But I think it's worth running the experiment to see if this is truly the case.

@sberryman
Copy link

@blue-fish I completely agree, running the experiment is the best option.

@ghost
Copy link
Author

ghost commented Oct 14, 2020

If anyone is wondering, training is still paused while @mbdash is denoising datasets and @steven850 is doing trial runs to determine best hparams for the encoder. It's a slow process. We plan to swap out the ReLU for a Tanh activation, but will try to match the model structure of the updated Resemblyzer encoder if it is released.

@ghost
Copy link
Author

ghost commented Oct 25, 2020

It's going to be a while before we get back to encoder training. I'm going to close this issue for now. Will reopen when we restart.

@ghost ghost closed this as completed Oct 25, 2020
@webbrows
Copy link

  • VoxCeleb1: Dev A - D as well as the metadata file (extract as VoxCeleb1/wav and VoxCeleb1/vox1_meta.csv)
  • VoxCeleb2: Dev A - H (extract as VoxCeleb2/dev)

Hmmm what kind of that i must to download ?
*The site has a metadata VoxCeleb and audiofiles

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants