Updates for synthesizer training using LibriTTS #413

ghost · 2020-07-09T21:07:06Z

I am certain someone has done this before (such as @sberryman in #126). Would someone please share the code modifications needed to train the synthesizer on LibriTTS?

If we can improve the training process to use LibriTTS in place of LibriSpeech, we can also generate a new set of pretrained models for better output quality.

Here are some questions to get it started... but feel free to skip ahead and share finished code if it's already available.

Can preprocess_librispeech be reused for TTS? See synthesizer/preprocess.py
Are LibriTTS alignments available? I see LibriTTSLabel and Montreal-Forced-Aligner. But not sure what else is needed to get it in a form that the RTVC repo can use.

The text was updated successfully, but these errors were encountered:

ghost · 2020-07-09T21:07:41Z

Naturally we will also want to train the vocoder on LibriTTS, but one step at a time.

sberryman · 2020-07-10T15:56:03Z

@blue-fish It is good to see someone continue working on this project!

I've made an enourmous amount of changes to the project but a few of the most important changes I've included in a forked branch which you can find at https://github.com/sberryman/Real-Time-Voice-Cloning/tree/wip

For the encoder you'll see preprocessing methods for libritts, voxceleb, vctk, common voice, timit, nasjonal and tedlium datasets.
https://github.com/sberryman/Real-Time-Voice-Cloning/blob/wip/encoder/preprocess.py

For the synthesizer I've only implemented preprocessing for libritts.
https://github.com/sberryman/Real-Time-Voice-Cloning/blob/wip/synthesizer/preprocess.py

Generating alignments for LibriTTS I used montreal forced aligner. I created a dockerfile specifically for alignment which you can find in my fork:
https://github.com/sberryman/Real-Time-Voice-Cloning/blob/wip/Dockerfile.align

Alignments:

bin/mfa_align \
  /datasets/CommonVoice/en/speakers \
  /datasets/slr60/english.dict \
  /opt/Montreal-Forced-Aligner/dist/montreal-forced-aligner/pretrained_models/english.zip \
  /output/montreal-aligned/cv-en/

That is an example of generating alignments for CommonVoice but changing the path to slr60/LibriTTS should be very simple.
Once you generate the TextGrid files using MFA, you need to convert them to alignment files in the format required for training. You'll find that script in the repository as well:
https://github.com/sberryman/Real-Time-Voice-Cloning/blob/wip/scripts/textgrid_to_alignments.py

I spent months training and trying all sorts of variants on the encoder so my code and project is a complete mess. I'm happy to help answer any questions you have though.

ghost · 2020-07-10T17:41:26Z

Thanks for sharing all that @sberryman . Is there any reason we cannot generate and publish a set of alignments for LibriTTS (just like the LibriSpeech alignments)? This seems like a good idea, to keep the training setup as easy and repeatable as it currently is.

I'm also surprised that the sample rate in synthesizer/hparams.py is not updated to match the 24k sampling rate of LibriTTS. Can you confirm if an hparams update is needed for synthesizer training on LibriTTS?

sberryman · 2020-07-10T17:58:49Z

I think I've deleted my copy of the alignments but I don't see why you wouldn't be able to include those to speed up training. I would suggest documenting and leaving scripts in the repo to generate alignments though. That will make it much easier for people to use new/other datasets for training. I wasted a lot of time trying to figure out how to install the dependencies, generate the TextGrid files and converting them into the proper format for training.

I don't think I updated synthesizer/hparams.py on my branch. Like I said, code is a mess and I have 10+ derivatives of it scattered around.

If someone is going to take on the full stack training they should start from scratch and adjust the hparams in the encoder to have 768 hidden units and an embedding size of 256. I had started training before Corentin replied back to keep the embedding size at 256. They should also be prepared to wait, a LONG LONG time to train the 3 models. Or ideally have access to better GPU's than my 1080 Ti's. The more memory you have on the GPU for training the encoder the faster it will converge (if it ever does?). I know Corentin has tried batch sizes significantly higher than I was able to hit with two 1080 TI's which trained much more quickly. If you read though my issue here and on the Resemblizer project you'll see that Corentin suggested using a different vocoder. There are quite a few opensource vocoders out there that should be fairly easy to implement. You just need to concat the speaker embedding to make them multi-speaker.

It would also be nice to know the impact of training with multiple languages. I didn't have time to test the effect of an unbalanced dataset (>70% of the audio was English) but I did notice the embeddings from my model did a slightly better job when testing against unseen English audio. My assumption is this was due to more training steps, not multiple languages.

I'm currently playing around with talking heads / code. I've also come across Lip2Wav which is a really interesting project that uses this repository as a base.

ghost · 2020-07-12T17:28:40Z

@sberryman Taking a closer look at LibriTTSLabel, it has the raw textgrid output from MFA, so that will save a lot of time. I will use your script to generate the alignment files after downloading the datasets to get the transcript files.

If someone is going to take on the full stack training they should start from scratch and adjust the hparams in the encoder to have 768 hidden units and an embedding size of 256. I had started training before Corentin replied back to keep the embedding size at 256.

If we keep the embedding size at 256, while increasing the hidden units from 256 to 768, would the resulting encoder be compatible with the existing synthesizer and vocoder models?

If not, what changes need to be made to the synthesizer and vocoder hparams?
If yes, can we continue training @CorentinJ 's pretrained synthesizer and vocoder models, or would those have to be trained from scratch?

Looking at encoder/model.py the answer seems to be yes in a strict sense. It depends on how well the new encoder's embedding matches the original encoder's when torch.nn.Linear reduces the size from 768 to 256. But I would think the loss function used in encoder training should cause it to converge in a similar direction.

I'm currently playing around with talking heads / code. I've also come across Lip2Wav which is a really interesting project that uses this repository as a base.

Those are very interesting, thanks for sharing the links.

sberryman · 2020-07-13T02:18:13Z

@blue-fish I found the TextGrid files if you want them.

You can't adjust anything upstream without retraining downstream. So the encoder is at the top of the stream (upstream) followed by the synthesizer and finally the vocoder. If you make a change at the encoder level you need to retrain the synthesizer and vocoder. If you change the synthesizer you need to retrain the vocoder.

I'm also not an expert in any of this. I'm a programmer who happened to drop by and test a few things out along the way. All I can say is that I'm happy to share my experiences but nobody should rely on anything I say. I'm probably wrong.

What is your goal with this project?

Reproduce Corentin's results?
Make it more accessible?
Train on new languages or datasets?

I understand you have commit access to the repository but I don't understand your motivation or goal.

ghost · 2020-07-13T03:35:34Z

@sberryman Thank you for the offer, but would your raw textgrid differ from the set that's publicly available via LibriTTSLabel? I don't want to waste your time, so let's assume it's not necessary unless their files give me trouble.

Your discussions in #126 were really helpful for understanding the training process. I get what you're saying about downstream models needing to be retrained when an upstream element is changed. From a naive point of view, if the new encoder also has an embedding size of 256, and the loss function makes it resemble the original encoder (taking both to be a black box), it seems like we could use the existing synthesizer and vocoder models as a starting point for training. My knowledge is seriously lacking and I need to put the computer down and pick up a textbook before going any further. But it does seem like an opportunity to save time and electricity.

My goal with this project is incremental improvement with a focus on usability. I would also like better quality output, but I'm not suitable to contribute anything except enthusiasm and low-level grunt work. LibriTTS support is one step in that direction. I don't expect to get there alone, just clear an obstacle or two so someone else can pick it up and take it the rest of the way.

sberryman · 2020-07-13T04:30:15Z

Not sure if it is any different, not a hassle at all to share it though.

My gut says that any retraining or even fine-tuning of the encoder requires re-training of all downstream modules. I haven't tried to train another encoder model and compare the embeddings though, so at this point it is just a guess.

I would also really like to see improvement in this field but also keep in mind that the pace of research is incredible. Given the number of people who starred this repo on github it is obviously a very interesting topic to a lot of people. Maybe a quick survey would be beneficial to see what people are looking for? Quite a few people could be using Corentin's work in further research and not interested in contributing code. Maybe most are looking to implement virtual voices in their own projects? If we knew the use case for most of the interested developers we could try and build a foundation for further research a development?

For clarity, my purpose of re-training was to focus on the encoder. I wanted to train embeddings on voices similar to facial recognition while taking advantage of GE2E loss. I have a personal dataset of over 7,000 hours of local and national broadcast news in the USA from about 300 stations. That video was recorded during a single week from 2019-05-27 through 2019-06-02. I had already run some of the most popular ML networks (YOLOv3, MaskRCNN, Face detection and embeddings) on all of the keyframes from the video. My next task was to identify the voices and determine when the face and voice embeddings overlapped. Then I could easily tell if the person was on-camera and speaking at the same time. As of right now (visual only) it is very easy to see the most common clusters of faces across a station and nationally.

Local

National

It would be interesting to hear if you have a project in mind, I would assume everyone here does.

Edit:
Here is the link to the TextGrid files: https://www.dropbox.com/s/xov6qyc6e33tf7n/libritts.textgrid.zip?dl=0

CorentinJ · 2020-07-13T06:20:53Z

Thanks for sharing all that @sberryman . Is there any reason we cannot generate and publish a set of alignments for LibriTTS (just like the LibriSpeech alignments)? This seems like a good idea, to keep the training setup as easy and repeatable as it currently is.

I'm also surprised that the sample rate in synthesizer/hparams.py is not updated to match the 24k sampling rate of LibriTTS. Can you confirm if an hparams update is needed for synthesizer training on LibriTTS?

I found that using alignments to split long text samples is a waste of time. Simply discard samples that are too long.

ghost added enhancement New feature or request help wanted Extra attention is needed labels Jul 9, 2020

ghost mentioned this issue Jul 9, 2020

Synthesizer training on LibriTTS #42

Closed

ghost mentioned this issue Jul 23, 2020

Add synthesizer preprocessing support for other datasets #441

Merged

ghost closed this as completed in #441 Jul 23, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates for synthesizer training using LibriTTS #413

Updates for synthesizer training using LibriTTS #413

ghost commented Jul 9, 2020

ghost commented Jul 9, 2020

sberryman commented Jul 10, 2020

ghost commented Jul 10, 2020

sberryman commented Jul 10, 2020

ghost commented Jul 12, 2020

sberryman commented Jul 13, 2020

ghost commented Jul 13, 2020

sberryman commented Jul 13, 2020 •

edited

Loading

CorentinJ commented Jul 13, 2020

Updates for synthesizer training using LibriTTS #413

Updates for synthesizer training using LibriTTS #413

Comments

ghost commented Jul 9, 2020

ghost commented Jul 9, 2020

sberryman commented Jul 10, 2020

ghost commented Jul 10, 2020

sberryman commented Jul 10, 2020

ghost commented Jul 12, 2020

sberryman commented Jul 13, 2020

ghost commented Jul 13, 2020

sberryman commented Jul 13, 2020 • edited Loading

Local

National

CorentinJ commented Jul 13, 2020

sberryman commented Jul 13, 2020 •

edited

Loading