Skip to content

Latest commit

 

History

History
44 lines (34 loc) · 2.3 KB

README.md

File metadata and controls

44 lines (34 loc) · 2.3 KB

Voice Conversion Using Zero-Shot Learning

This is a TensorFlow + Pytorch implementation. This implementation is adapted from the Real Time Voice Clone implementation at https://github.com/CorentinJ/Real-Time-Voice-Cloning.

Installation

  1. Python 3.8
  • Install PyTorch (>=1.0.1).
  • Install Nvidia version of TensorFlow 1.15
  • Install ffmpeg.
  • Install Kaldi
  • Install PyKaldi
  • Run pip install -r requirements.txt to install the remaining necessary packages.
  • Download pretrained TDNN-F model, extract it, and set PRETRAIN_ROOT in kaldi_scripts/extract_features_kaldi.sh to the pretrained model directory.

Dataset

  • Acoustic Model: LibriSpeech. Download pretrained TDNN-F acoustic model here.
    • You also need to set KALDI_ROOT and PRETRAIN_ROOT in kaldi_scripts/extract_features_kaldi.sh accordingly.
  • Speaker Encoder: LibriSpeech, see here for detailed training process.
  • Synthesizer (i.e., Seq2seq model): ARCTIC and L2-ARCTIC. Please see here for a merged version.
  • Vocoder: LibriSpeech, see here for detailed training process.

All the pretrained the models are available here

Quick Start

See the inference script

Training

  • Use Kaldi to extract BNF for the reference L1 speaker
./kaldi_scripts/extract_features_kaldi.sh /path/to/L2-ARCTIC/BDL
  • Preprocessing
python synthesizer_preprocess_audio.py /path/to/L2-ARCTIC BDL /path/to/L2-ARCTIC/BDL/kaldi --out_dir=your_preprocess_output_dir
python synthesizer_preprocess_embeds.py your_preprocess_output_dir
  • Training
python synthesizer_train.py Accetron_train your_preprocess_output_dir