In our recent paper, we propose VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.
Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.
VITS at training | VITS at inference |
---|---|
Clone the repo
git clone [email protected]:SameerSri72/Sanskrit-text-to-speech.git
This is assuming you have navigated to the root folder after cloning it.
NOTE: This is tested under python3.11
with conda env. For other python versions, you might encounter version conflicts.
NOTE: This is tested under python3.11
with conda env. For other python versions, you might encounter version conflicts.
PyTorch 2.0 Please refer requirements.txt Please refer requirements.txt
# install required packages (for pytorch 2.0)
conda create -n vits python=3.11
conda activate vits
pip install -r requirements.txt
There are three options you can choose from: LJ Speech, VCTK, and custom dataset.
- LJ Speech: LJ Speech dataset. Used for single speaker TTS.
- VCTK: VCTK dataset. Used for multi-speaker TTS.
- Custom dataset: You can use your own dataset. (I have used Saskrit text and audio files here, data is locally available with me can not upload it publicly due to copyright)
- create a folder with wav files
- create configuration file in configs. Change the following fields in
custom_base.json
: - create configuration file in configs. Change the following fields in
custom_base.json
:
{
"data": {
"training_files": "filelists/custom_audio_text_train_filelist.txt.cleaned", // path to training cleaned filelist
"validation_files": "filelists/custom_audio_text_val_filelist.txt.cleaned", // path to validation cleaned filelist
"text_cleaners": ["sanskrit_cleaners"], // text cleaner
"bits_per_sample": 16, // bit depth of wav files
"training_files": "filelists/custom_audio_text_train_filelist.txt.cleaned", // path to training cleaned filelist
"validation_files": "filelists/custom_audio_text_val_filelist.txt.cleaned", // path to validation cleaned filelist
"text_cleaners": ["english_cleaners2"], // text cleaner
"bits_per_sample": 16, // bit depth of wav files
"sampling_rate": 22050, // sampling rate if you resampled your wav files
...
"n_speakers": 0, // number of speakers in your dataset if you use multi-speaker setting
"cleaned_text": true // if you already cleaned your text (See text_phonemizer.ipynb), set this to true
},
...
"cleaned_text": true // if you already cleaned your text (See text_phonemizer.ipynb), set this to true
},
...
}
- install espeak-ng (optional)
NOTE: This is required for the preprocess.py and inference.ipynb notebook to work. If you don't need it, you can skip this step. Please refer espeak-ng
- preprocess text
You can do this step by step way: (No phonemization needed for Sanskrit)
- create a dataset of text files. See text_dataset.ipynb
- phonemize or just clean up the text. Please refer text_phonemizer.ipynb
- create filelists and cleaned version with train test split. See text_split.ipynb
- rename or create a link to the dataset folder. Please refer text_split.ipynb
ln -s /path/to/custom_dataset DUMMY3
# LJ Speech
python train.py -c configs/ljs_base.json -m ljs_base
# VCTK
python train_ms.py -c configs/vctk_base.json -m vctk_base
#Sanskrit
python train.py -c configs/sans_base.json -m sans_base
# Custom dataset (multi-speaker)
python train_ms.py -c configs/custom_base.json -m custom_base
After training refer to infer.ipynb