This is a TensorFlow + Pytorch implementation. This implementation is adapted from the Real Time Voice Clone implementation at https://github.com/CorentinJ/Real-Time-Voice-Cloning.
- Python 3.8
- Install PyTorch (>=1.0.1).
- Install Nvidia version of TensorFlow 1.15
- Install ffmpeg.
- Install Kaldi
- Install PyKaldi
- Run
pip install -r requirements.txt
to install the remaining necessary packages. - Download pretrained TDNN-F model, extract it, and set
PRETRAIN_ROOT
inkaldi_scripts/extract_features_kaldi.sh
to the pretrained model directory.
- Acoustic Model: LibriSpeech. Download pretrained TDNN-F acoustic model here.
- You also need to set
KALDI_ROOT
andPRETRAIN_ROOT
inkaldi_scripts/extract_features_kaldi.sh
accordingly.
- You also need to set
- Speaker Encoder: LibriSpeech, see here for detailed training process.
- Synthesizer (i.e., Seq2seq model): ARCTIC and L2-ARCTIC. Please see here for a merged version.
- Vocoder: LibriSpeech, see here for detailed training process.
All the pretrained the models are available here
- Use Kaldi to extract BNF for the reference L1 speaker
./kaldi_scripts/extract_features_kaldi.sh /path/to/L2-ARCTIC/BDL
- Preprocessing
python synthesizer_preprocess_audio.py /path/to/L2-ARCTIC BDL /path/to/L2-ARCTIC/BDL/kaldi --out_dir=your_preprocess_output_dir
python synthesizer_preprocess_embeds.py your_preprocess_output_dir
- Training
python synthesizer_train.py Accetron_train your_preprocess_output_dir